Interesting. Due to its emphasiss on BFS, it's the opposite of something I've been trying (I named it the "Tree of failures").
My assumption was that humans don't try a breadth-first approach. Instead, we split a task into a short-step (instinct and intuition selected), and long-step that summarizes/stores the next steps. The key idea is to recursively evaluate a task as a short-step (high-res - gets executed) and a long-step (lower-res - is just stored), until it succeeds or fails. If it fails, we must walk back keeping a summarized tree of failures in state so that we can exclude them in future selections.
The effectiveness of instinct has a steep fall-off at longer distances - so it's better not to chart out of a series of steps. When we do BFS, we drive down the value of instinct in favor of compute. I guess ultimately, it depends on the type of problem you want to solve.
Reach out to me if you want to prototype it with me.
dietr1ch 18 days ago [-]
I feel humans like doing something in between, maybe a bit like A* would do sometimes.I wouldn't call it A* because of the lack of a consistent heuristic and also lack of strictly numeric evaluation, but it's an in-between DFS and BFS for sure (as is every tree search algorithm?).
We go deep while we think it's a good lead, because so far things make sense and it'll be less work, but at some point we start questioning our decisions early in the descent and try alternatives.
verdverm 18 days ago [-]
You may find Prioritized Grammar Enumeration as an interesting in-between DFS/BFS algorithm
I think the problem with long chains of steps on their own (without the bfs stuff) is that your failure probability quickly grows to unreasonable levels.
Basically, if each step has a 97% chance of being completed correctly, if your task requires 10 steps one after the other, the chance of success falls to 97%*10=74%
If I understand correctly, part of the point of the BFS is to throw compute at it, in order to lower the failure rates. Kind of a "run many times in parallel and pick the best one". This can be effective, but also quite expensive, as seen in the costs OpenAI had to pay for their ARC-AGI benchmarking runs.
kordlessagain 17 days ago [-]
Your "Tree of failures" approach aligns with how natural cognition seems to work at the edge of comprehensibility. Rather than exhaustively searching (BFS), we use instinct for immediate steps while maintaining a lower-resolution model of longer-term possibilities. The key insight about storing failures rather than successes is particularly interesting - it's more efficient to remember what doesn't work and let patterns emerge naturally from the remaining space.
This maps to what I've been exploring with edge cognition and semantic anchoring - using fast set operations to quickly eliminate known bad paths (your failure tree) while allowing the system to explore promising directions using more expensive operations only when needed.
The instinct fall-off you describe mirrors our observation about the relationship between computational load and pattern recognition. As distance increases, we need more efficient ways to prune the search space rather than trying to maintain high-resolution understanding throughout.
My gut says optimizing on the amount of compute used to do the search (and the inference) is maybe something worth exploring.
viraptor 18 days ago [-]
Reminds me of what plandex does. https://plandex.ai/ It already does the automatic "does this need splitting into subtasks, or can it be solved immediately" processing.
torginus 17 days ago [-]
I don't get why you need tree search at all? What does it give you over a pure LLM trained to do CoT in a tree-like manner? If the context window's long enough, it can generate the reasoning-tree just by pure next-token prediction, and rather than BFS, it can guide the tree search with its own value function (which is part of the LLM itself) instead of sticking to hard algos like BFS and DFS.
By the way, BFS sounds like it will give you thorough results, at the cost of increased compute. Useful for beating benchmarks, but probably causes marginal improvement for massively improved compute.
Still, the improved quality could be meaningful, if it's used for generating training data for Llama4
dietr1ch 17 days ago [-]
Tree search is natural when you want a path to navigate, so it does fit a sequence of interactions in a conversation too.
I agree that both, DFS and BFS are likely awful[^0], but a more informed approach can probably do better[^1].
Also, at some point on generating the conversation/reasoning tree through token-prediction you need to choose which of the possible conversations you are going to keep on extending/generating, which maps precisely to choosing which node in tree search to expand. I'd argue instead that everything has to look like a search algorithm from, at least it'll be the case for anyone who has studied it more deeply.
I'll go even further and claim that Tree Search is Complete as for every problem there's a solution space that can be navigated with a Tree Search Algorithm[^2]. I used to think that you could walk down the space of provable things, but now in the LLM hype days it seems you only need to walk the space of conversations that you can generate.
---
[^0] with DFS always at risk of giving obnoxiously long answers, or not terminating if there's loop or spirals
[^1] probably through metadata coming from latent variables meaningful to judge a conversation (certainty, ~branching size of a reasonable conversation, whether there's open questions left)
[^2] Even if that was poorly done like on combinatorial problems. Imagine a sudoku where you only check the rules once you fill all cells.
kurthr 18 days ago [-]
The classic thing people say is "asking the right question" gets you half way there. Your approach sounds like something I call "getting to No" for a problem.
It's sort of a combination of "getting to know" and the opposite of the salesman's "getting to Yes". When it works, it's the fastest way to prune off obligations.
The goal is to figure out why some particular problem: isn't really a problem, doesn't need to be solved, can't be solved that way, can't really be solved (because of physics or it's really a different problem). As you define the problem better, you can rule each one out to find, the "real" problem, that you CAN solve, and at least one path forward. There's still many ways that it might not be the optimal path, but you know roughly how to get to somewhere better. It also trains you to see around obstacles to success.
I've found that some of the best work I've done (especially on acquisitions) was in defining why NOT to do something that looked like a good idea (or particularly interesting to work on) from the onset, but was destined to fail or required unknown HW technology. Frankly, looking >5 years out feels like a coin flip, because some other competing technology could come along before you can get to production.
katamari-damacy 18 days ago [-]
that's more fit for agents, no?
jeswin 18 days ago [-]
You're right that it's technically orthogonal to what's in the paper. I was trying to model the "reasoning process", which has general applicability depending on how/where it's implemented.
wafflemaker 18 days ago [-]
How do you understand instinct?
I bought a new SSD drive for an old laptop to avoid buying a new one, (x230 has amazing keyboard) but left to another country for Christmas. My intuition told me to take it with me, but logical sense said there will be no time for such things as moving OS to a new drive.
My flight back to the work country got cancelled due to fog and I ended up spending a week longer at in-laws place, with plenty free time. A new 512GB drive would help me studying, giving plenty space for school VMs.
The link is in the OP, hidden away in an image caption fir some reason.
Klathmon 18 days ago [-]
So is the big improvement here simply skipping the unembedding/embedding step for internal thoughts? Or is it mainly in the training methods to teach the CoT and how to switch between "latent thought" and text output?
It's really interesting that a fixed number of "latent thoughts" performed as well as a binary classifier! I didn't expect that at all, the way OpenAI talks about CoT it seems the ability to let it "keep thinking" let's them continually score higher on benchmarks while throwing eye watering amounts of compute at the inference.
Crye 18 days ago [-]
It mentioned not penalizing/rewarding the model for thoughts only rewarding the answer after the thought. I am curious how back propagation works then.
lovasoa 18 days ago [-]
The researchers leverage existing language Chain-of-Thought data, where each sample consists of a question, reasoning steps, and the final answer. At stage 0, the model does not generate any thought tokens, and is just trained to yield the reasoning traces and correct answers for the Chain-of-Thought samples. In the subsequent stages, at each stage, we remove one reasoning step from the sample, and instead add thought tokens. In the illustration above, a single thought token is added in each stage, instead of a single reasoning step, but this is controlled by a hyperparameter ‘c’.
yorwba 18 days ago [-]
The tokens of the answer depend on the preceding continuous thought vectors, which you can backprop through in the usual way.
viraptor 18 days ago [-]
I was waiting for something like that to happen! Next step - creating a human-language-free representation. I believe that once a group of llms can communicate only in embeddings tuned without any human text input, we're going to open a completely new chapter in AI.
mckirk 18 days ago [-]
This is actually something you probably want to avoid, if at all possible, because it makes it very hard to maintain insight into what the AIs are communicating among them. But that insight is crucial to stay informed about their progress in taking over the world, etc.
dwohnitmok 18 days ago [-]
Yes! We should be extremely cautious about embracing approaches that make LLMs even more inscrutable. Having CoT, however unreliable it is, is nonetheless a huge boon for model evaluation that we should not give up so lightly.
torginus 17 days ago [-]
Yeah, and it might not even gain us that much. It reminds me of how a zipped piece of JSON often comes close enough to bespoke binary serialization formats that it's not worth bothering with it.
bboygravity 18 days ago [-]
How does a group help anything?
If you put 1000 dumb people together, they don't magically become smart?
IshKebab 18 days ago [-]
If you put 1000 people who can't talk together they will create language so they can communicate. He's saying if we put LLMs together and don't force them to use English to communicate then they'll create their own language which may be superior for LLMs to English.
May be true but who knows.
I wonder if anyone has somehow tested the Sapir-Whorf hypothesis for LLMs somehow by training them on different languages and comparing task performance. I guess it's too difficult to get a large equivalent training set in different languages.
stingraycharles 18 days ago [-]
Is everything in LLMs translated back to English before interpretation?
It works fairly well in my native language, I’m surprised to learn that things get translated back.
astrange 18 days ago [-]
LLMs have no fixed internal representation - they barely have internal anything - so no, there is no translation.
But there's also no guarantee any particular query generalizes (vs is memorized), so it might only be able to answer some queries in some languages.
stingraycharles 17 days ago [-]
Got it. And since my native language is arguably one of the closest to English (Dutch), it works very well. But probably not as well for, say, Asian which has completely different grammatical constructs.
wodderam 18 days ago [-]
It feels like an exercise in anthropomorphization to me.
Sapir-Whorf hypothesis is generally not considered to be reality. It makes intuitive sense but is wrong.
There are hours of podcasts with Chomsky talking about LLMs. The gist of which is that LLMS are extracting surface level statistical structure of language that will be good for routine coding and not much else. It is easy to infer that Chomsky would believe this idea to be utter nonsense.
I believe even the idea of getting a 1000 people together and we agree to label a rock "rock", a tree "tree", a bird "bird" is not even how human language works. Something that is completely counter intuitive.
Reading the paper, no one believes a hidden markov model is creating some kind of new thought process in the hidden state.
I certainly though could have no idea what I am talking about with all this and have pieced together parts that make no sense while this is a breakthrough path to AGI.
digbybk 18 days ago [-]
> There are hours of podcasts with Chomsky talking about LLMs
I'm not an expert, but it seems like Chomsky's views have pretty much been falsified at this point. He's been saying for a long time that neural networks are a dead end. But there hasn't been anything close to a working implementation of his theory of language, and meanwhile the learning approach has proven itself to be effective beyond any reasonable doubt. I've been interested in Chomsky for a long time but when I hear him say "there's nothing interesting to learn from artificial neural networks" it just sounds like a man that doesn't want to admit he's been wrong all this time. There is _nothing_ for a linguist to learn from an actually working artificial language model? How can that possibly be? There were two approaches - rule-based vs learning - and who came out on top is pretty damn obvious at this point.
jokethrowaway 17 days ago [-]
What can you learn from something parroting data we already have?
Similarly, we are now finding that training on synthetic data is not helpful.
What would have happened if we invested 1/100 of what we spent on LLM on the rule based approach?
int_19h 17 days ago [-]
There is an old joke that AI researchers came up with several decades ago: "quality of results is inversely proportional to the number of linguists involved".
This has been tried repeatedly many times before, and so far there has been no indication of a breakthrough.
The fundamental problem is that we don't know the actual rules. We have some theories, but no coherent "unified theory of language" that actually works. Chomsky in particular is notorious for some very strongly held views that have been lacking supporting evidence for a while.
With LLMs, we're solving this problem by bruteforcing it, making the LLMs learn those universal structures by throwing a lot of data at a sufficiently large neural net.
digbybk 17 days ago [-]
> What can you learn from something parroting data we already have?
You can learn that a neural network with a simple learning algorithm can become proficient at language. This is counter to what people believed for many years. Those who worked on neural networks during that time were ridiculed. Now we have a working language software object based on learning, while the formal rules required to generate language are nowhere to be seen. This isn’t just a question of what will lead to AGI, it’s a question of understanding how the human brain likely works, which has always been the goal of people pioneering these approaches.
coldtea 18 days ago [-]
>Sapir-Whorf hypothesis is generally not considered to be reality. It makes intuitive sense but is wrong
Strong S-W (full determinism) might not be, but there's hardly a clear cut consensus on the general case.
And the whole "scientific field" is more like psychology, with people exchanging and shooting down ideas, and less like Math and Physics, so any consensus is equally likely to be a trend rather than reflecting some hard measurable understanding.
I'd say that the idea S-W is not to a degree reality is naive.
PittleyDunkin 18 days ago [-]
> Sapir-Whorf hypothesis is generally not considered to be reality.
This is true only in the strictest terms of the hypothesis, i.e. linguistic determinism. Language still encodes a lot of culture (& hence norms and values) in its grammar & diction—this isn't very controversial.
Granted, I don't think this is that related to the topic at hand. There's bias all over the decisions in how to train and what to train on; choice of language is just one facet of that.
pjerem 18 days ago [-]
Well maybe not 1000 people but to our knowledge, the human brain is actually made of physically independent zones that barely communicate together except with the zone that take all the outputs together and tries to do something coherent with all the garbage.
Idk if this could work with LLMs, especially because all the brain zones are somehow specialized into something while two LLMs are just identical machines. But we also know that the specialization isn’t that hardcoded : we know that people losing half their brain (after a stroke) can still relearn things that were managed in the "dead" part.
I don’t know, please correct my errors, I was just thinking aloud to say that multiple independent agents working together may be how "intelligence" already works in the biological world so why not for AIs ?
IshKebab 17 days ago [-]
> the human brain is actually made of physically independent zones that barely communicate together except with the zone that take all the outputs together and tries to do something coherent with all the garbage.
That sounds like bullshit. Do you have a source?
mromanuk 18 days ago [-]
Because group estimation is superior to individual estimations:
The phenomenon is called wisdom of the crowds. When a group of people independently estimate something, individual errors tend to cancel each other out, leading to a surprisingly accurate collective result. This works because of:
Diversity of opinions: Different perspectives bring a range of estimates.
Independence: Errors aren't systematically biased as long as individuals estimate without external influence.
Error averaging: Overestimation and underestimations balance out when averaged.
Law of large numbers: More participants increase accuracy by minimizing random errors.
It was demonstrated by Francis Galton in 1906, where a crowd's average guess of a bull's weight was almost spot-on. (estimates must be independent and reasonably informed for this to work.)
littlestymaar 18 days ago [-]
> If you put 1000 dumb people together, they don't magically become smart?
1000 is probably too high, but groups of people are in fact more intelligent than individuals (though for humans it is likely because recognizing a correct answer is easier than finding it in the first place)
TheOtherHobbes 18 days ago [-]
Functional groups which work well together, include open sharing of research and ideas, persistence of best output, are dedicated to realism, and are more focussed on problem solving than status display, will be smarter. The group works like a filter which generates multiple solutions and selects, remembers, and abstracts the best.
Dysfunctional groups which do the opposite will be catastrophically stupid.
There have been plenty of dysfunctional groups in history.
nfw2 18 days ago [-]
depends on the circumstances. lin-manuel miranda can probably write a better musical by himself than a team of 20 people with equal input would.
also, the bottlenecks that teamwork helps solve (eg the high cost of gaining expertise and low throughput of reasoning capacity) may not be that relevant in the ai age
littlestymaar 18 days ago [-]
> by himself than a team of 20 people with equal input would.
Sure, but the result would still be far better than the average of the output of the 20 individuals taken alone.
> also, the bottlenecks that teamwork helps solve (eg the high cost of gaining expertise and low throughput of reasoning capacity) may not be that relevant in the ai age
It's always tempting to anthropomorphize these systems and conclude that what works for us would work for them, but yes we don't really know if it would bring anything to AI.
torginus 17 days ago [-]
I wonder if there's research on this, like if you took a group of individuals who scored the same on an IQ test, then got them to solve one together, how would the score improve?
Is there a way of selecting people to cover each other's intellectual blind spots?
coldtea 18 days ago [-]
Isn't that the very case behind the "wisdom of crowds" thing?
amelius 18 days ago [-]
Looking at the current state of democracies around the world, my hopes are not on "wisdom of the crowds".
bee_rider 18 days ago [-]
If you think the democracies are doing bad, you should see the autocracies!
amelius 18 days ago [-]
You mean the thing democracies are turning into, thanks to social (crowd wisdom) media?
bee_rider 18 days ago [-]
I don’t think social media really is crowd wisdom at all. It is built to pander to our worst impulses (I think, knowingly and openly, right? The algorithm selects for engagement, not learning and growing), and I’d be surprised if it isn’t producing a feedback loop as well (perhaps as an unintentional side effect). The wisdom of the crowds hypothesis relies on a random sampling, we’re intentionally applying a skew toward the angry and shallow.
coldtea 17 days ago [-]
No, he means the thing democracies had turned to, when hardly differentiating parties turned into a practocal "uniparty" in economic, corporate, and foreign policy, and ruled by pissing on what the people voted for, which the current populist backlash is a reaction against, as elites (and supporters) lament as "too much democracy" and scorn the ignorant plebes (case in point) and pine for censorship and "expert" rule.
bee_rider 17 days ago [-]
That wasn’t what I meant and I don’t think you really thought it was.
coldtea 17 days ago [-]
Their current states were achieved by trusting technocrats and careeer politicians for far too long...
konart 18 days ago [-]
Not magically. Our great ancestors were pretty dumb, but they were getting smarter and better because of sharing their knowledge.
pigpop 18 days ago [-]
yes they got "smarter" by compiling a corpus of knowledge which future generations could train on.
sarcasm aside, throwing away the existing corpus in favor of creating a new one from scratch seems misguided.
this paper isn't about creating a new language, they are omitting the sampler that chooses a single token in favor of sending the entire end state back in to the model like a superposition of tokens. that's the breadth first search part, they don't collapse the choice down to a single token before continuing so it effectively operates on all of the possible tokens each step until it decides it's done.
it would be interesting to try this with similar models that had slightly different post training if you could devise a good way to choose the best answer or combine the outputs effectively or feed the output of a downstream model back in to the initial model, etc. but I'm not sure if there'd necessarily be any benefit to this over using a single specialized model.
ulbu 18 days ago [-]
they were not one bit dumber than you.
EliBullockPapa 17 days ago [-]
Average intelligence measures have risen substantially since early 1900s
People learn by being around others being both successful and unsuccessful.
JFingleton 18 days ago [-]
> If you put 1000 dumb people together, they don't magically become smart?
Do they not become smart*er* though?
computably 18 days ago [-]
"Smarter" is too vague. A group can compensate for individual weaknesses or even converge on a hard-to-make prediction given sufficiently uncorrelated outputs; basically the idea behind ensemble models / wisdom of the crowds. But a group of 1000 dumb apes would never achieve categorically-above-ape intelligence, probably not even "genius" ape intelligence. Groups of unintelligent agents come with downsides as well, like the ant death spiral.
coldtea 18 days ago [-]
>But a group of 1000 dumb apes would never achieve categorically-above-ape intelligence
And yet, here we are.
A group of 1000 apes is large enough to have offspring and, given time, go through evolution.
18 days ago [-]
sunshinerag 18 days ago [-]
Wait what … how does democracy work then?
nfw2 18 days ago [-]
the benefit of democracy is primarily that it prevents governments from doing bad things, less so that it empowers more effective governance
mathgeek 18 days ago [-]
It can do either, and can fail to do either. It’s the people having power that enables the outcomes, not the system itself. Democracy just grants the power to a broader set of people.
coldtea 18 days ago [-]
Democracy is not about being smart or dumb.
It's about everyboty having a say in decisions of government that affects them.
The failure of democracy as a system is not when people make dumb decisions (experts and high-IQ people have made some of the most stupid and catastrophic decisions in history), but when people's collective decisions are not being respected.
optimalsolver 18 days ago [-]
It doesn't.
blizdiddy 18 days ago [-]
That came out a few weeks ago from meta. Large Concept Models
How does one impart textual knowledge discovered by humans without language?
thelittleone 18 days ago [-]
Couldn't we use an AI model trained on historical text data (up to today) to predict likely events for tomorrow? Taking this further, a sufficiently advanced AI system could potentially analyze human-generated text up to any given point in history to understand patterns of human thought and behavior, then project those patterns forward. This speaks to your point about human language - while we need text data for initial training, the AI's internal representations and predictions could potentially transcend human language constraints.
viraptor 18 days ago [-]
The training of the LLM itself would still use the human language. But you could add an extra channel that's never given any text or direct dataset training. Keep it purely a connection between hidden layers of different instances of LLM and train using the usual loss of perplexity or similar metric.
The interesting thing then would be - does it converge to similar embedding space as the input, or can LLMs create a more efficient "language".
wruza 18 days ago [-]
I thought about it too (layman). When I learned about embeddings it almost immediately clicked as a sort of an ascended language, not sure why no one seems to talk about it. Exchanging embeddings must be so much “wider” communication channel than speaking real language. And in contrast to a language embeddings are (iiuc) continuous, i.e. you can rotate a vector continously and it will smoothly trace the changes between A and B. I can picture communicating in something like https://www.google.com/search?q=charlie+conspiracy+meme&udm=... - embedding difference vectors, but it’s all crystal clear and is a natural language for an llm, cause any vector combination points to a correct “inner screen” image/concept/younameit.
Or maybe this is my own ignorant confabulation, so nvm.
cbxjksls 17 days ago [-]
[dead]
ttul 18 days ago [-]
TL;DR: Meta started with a pre-trained language model. They then fine-tuned it on step-by-step reasoning examples as you would do if you wanted your model to become particularly good at chain of thought reasoning.
However, they also introduced a couple of new tokens. The <bot> token tells the model to go into latent space thought mode (“beginning of thought”). The <eot> token ends latent space thought mode. While in this mode, the model auto-regressive iterates by copying its final hidden layer back onto its input layer, obviously generating new tokens at the output with each inference step as it always does.
The idea is that by passing the final hidden layer back through a few times, the model can squeeze more insight from the context. And that’s precisely what they found was true.
Training involves progressively replacing language reasoning steps with latent space auto-regression steps. So for instance, you might have a math problem in the training data and at first the model is fed all of the steps of the math problem in language form. But in later iterations of training, step one is replaced with latent space auto-regression. And then step two as well, then also step three, etc…
Eventually, the model learns to enable latent space thinking mode by itself by generating the <bot> tokens and to end it be generating <eot> tokens.
Pretty ingenious!
avodonosov 18 days ago [-]
Thank you for the summary, useful for me as I only managed to skim throught the first half.
But one correction, probably, regarding this bit:
> While in this [latent space thought] mode, the model auto-regressive iterates by copying its final hidden layer back onto its input layer, obviously generating new tokens at the output with each inference step as it always does.
I have impression that output tokens are not generated while in the latent thought mode.
ttul 17 days ago [-]
Output tokens are still generated, otherwise the model wouldn’t know when to stop being in latent space mode. The <eot> token emerges as the top token at the output layer when it’s time to switch back.
avodonosov 17 days ago [-]
Explicit <eot> is only used in training.
At inference time, the paper says:
> A challenge
lies in determining when to switch between latent and language modes. As we focus on the problem-solving
setting, we insert a <bot> token immediately following the question tokens. For <eot>, we consider two
potential strategies: a) train a binary classifier on latent thoughts to enable the model to autonomously
decide when to terminate the latent reasoning, or b) always pad the latent thoughts to a constant length. We
found that both approaches work comparably well. Therefore, we use the second option in our experiment for
simplicity, unless specified otherwise.
Why this point in you summary caught my eye, because the article specifically emphasises non-verbal nature or aspect of reasoning. Internal representaions used by a thinking human are largely not words, and the COCONUT approach tries to model that.
Also note, that a whole reasoning step in training data - easily a sentence or more of natural language - can be replaced by a single "Thought" element. (How many Thought elements replace a reasonong step is controlled by a hyperparameter ‘c’; the illustrations are made for ‘c=1’).
BTW, one observation: the aipapersacademy.com article in the subject calls the Thought elements "thought tokens", but the original paper never calls them "tokens", just "Thoughts" or "latent thoughts". I suppose the paper carefully avoids that to prevent confusion, as "token" mainly means a linguistic unit in LLMs.
ttul 16 days ago [-]
Thanks for your extensive explanation!
treprinum 18 days ago [-]
Would that mean that we would need to exchange latent "embeddings" between various "reasoning" models for emulating thinking and an LLM will be just about converting to/from human language when interfacing with mere humans, at some point in the future?
ttul 17 days ago [-]
No, this all happens inside the model. I suppose it’s possible that the hidden layers of one model could be sent to another model. But the second model would need to be trained to understand the meaning of the hidden layer’s outputs. You could accomplish that through fine tuning of the second model. It would be neat to see someone try this.
jkelleyrtp 18 days ago [-]
I think this might be the “it” moment for AI/LLMs. I was hiking with a friend recently and we talked about this at length.
The arc-AGI results from O3 are apparently a result of chain of thought given enough time to explore a solution space. Reasoning might be simply a higher dimensional form of rubix cube solving. BFS, search, back-tracking, etc. It seems unlikely that humans think in “tokens” so why do LLMs?
By staying in latent space, the models are free to describe an “idea” in higher resolution than what language allows. English is coarse, granular. Latent space is a much finer representation of ideas and their interplay.
Latent space is also much cheaper to execute in. The model can think without the language encoding/decoding step. This lets it branch out hundreds of ideas and explore only the most useful ones in a fraction of time that reasoning “out-loud” would take.
The states also don’t need to be tied to language. Feed in a robot’s state, time series data, or any abstract data. Reason in category theory or linear algebra or complex analysis. Humans are hard wired for one set of math - an abstract latent space can represent anything.
I’m a bit disappointed OpenAI didn’t stumble on this first. I’ve been skeptical of LLMs since their big debut last year. LLMs seem like a great way of solving language, but reasoning is much more complex. Once you grok the math behind the current models, you immediately question why the encoding/decoding step is there. Diffusion models are incredible but it felt that LLMs lacked the same creativity. Encoding/decoding forces a token-based discretization and therefore a loss of complexity.
With the byte-latent paper it was quite clear we’d see this paper. This truly might be the “it” moment.
rlupi 18 days ago [-]
IMHO The problem (for us) with this approach are the logical consequences:
1) if AI large model become more powerful avoiding language, embeddings of AI state become even more tied to the model they originate than now
Consequence: AI progress stalls, as AI user companies need to invest increasing amount of money to reindex their growing corpuses.
This is already a problem, it becomes more of a lock-in mechanism.
If this is overcome...
2) Embeddings become a viral mechanism: it makes sense for a large company that commands a market to impose to its suppliers to use the same AI models, because they can transfer state via embeddings rather than external formats.
This allows to cut down decisions mechanisms that otherwise require expensive coordination mechanism.
3) Eventually this potentially results in another exponential growth and lock-in mechanism, also at the expense of most tech people as more and more is done outside our interface with AI (i.e. programming and software architecture improvements will it self move below language level, we'll have to reverse engineering increasingly opaque improvements).
4) It ends with the impossibility of AI alignment.
> It seems unlikely that humans think in “tokens” so why do LLMs?
I can think of one reason: scrutability. It’s going to be even harder to understand how a response gets produced if there isn’t even a text-based representation to help the human understand
IshKebab 18 days ago [-]
I think we're already way beyond the point where anyone really understands how a response is produced, even without this.
anon373839 18 days ago [-]
Indeed. Even if an LLM tells you its “reasoning” process step by step, it’s not actually an exposition of the model’s internal decision process. It’s just more text that, when generated, improves the chances of a good final output.
18 days ago [-]
nfw2 18 days ago [-]
the token generation part isn't well understood, but the output "chain-of-thought" used to produce the final answer can be scrutinized for correctness with a traditional CoT model (although this would require model providers to not hide reasoning tokens)
pigpop 18 days ago [-]
you can save the hidden states and convert them into a more interpretable format. it's still recorded and you could make modifications at different steps to see how that would change the conclusion.
layer8 18 days ago [-]
IMO we won’t have the “it” moment until we have continuous learning (training) in some fashion.
mattxxx 18 days ago [-]
^ This and we need to be continually learning on an energy budget similar to how much a human spends per hour.
rlupi 18 days ago [-]
The main reason why we can't do that now is because we require models to be digitally reproducible (IMHO, but also read Geoffrey Hinton's mortal computing).
The energy cost come from error correction as much as training algorithms.
jokethrowaway 17 days ago [-]
This sounds like brute forcing a solution to make up for lack of intelligence.
In an IQ test, like the one in the arc agi test, a human sees the pattern instantly and effortlessly. o3 tries N paths until it stumbles on the right one and assess that there is a pattern.
I think we need a radically different architecture, this is a gimmick.
pigpop 18 days ago [-]
I think this is a step in the right direction but not the end. it takes the sampler out of the equation during most of the reasoning process but it is still important for the "show your work" aspects of reasoning or solving a problem. balancing when to think against when to write down or commit to certain thoughts is important. there are many more pieces to the puzzle.
JambalayaJimbo 17 days ago [-]
What does latent space here mean?
whhooosshhh 18 days ago [-]
[flagged]
throwup238 18 days ago [-]
Master coconut! I don’t know if that’s an Archer reference or a Frisky Dingo reference.
It’s fascinating how fast the competitors are catching up to each other. Can’t wait for seven different SkyNets to compete for dominance.
yard2010 18 days ago [-]
Both! And/or, either
throwaway314155 18 days ago [-]
A little column a, a little column b.
zombiwoof 18 days ago [-]
Will this allow Facebook new user base of AI generated characters to interact with themselves better?
davidclark 18 days ago [-]
Is this article AI-generated? This website appears to do a lot of “diving in”.
hadjian 18 days ago [-]
If this site didn’t appear here, I’d think it’s a scam:
- site claims to simplify papers, but movies AI generated
- full of ads
- Can’t find „Coconut“ on the official Meta FAIR page
> We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow.
> Analysis of the models’ chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental.
—-
It is impressive work for sure, but it comes down to prompting IMHO. They almost instruct the AI to plot and scheme here, with the goal of finding out if it CAN do this at all.
The AI is not very covert about it either! It directly admits its scheming in the “thinking” outputs :D
IMO this is just another issue of alignment.
MarkMarine 17 days ago [-]
I did read the paper, and I think you’re underplaying what happens. The AI got some conflicting prompting, some pretty mild stuff for some of the tests, and Gemini tried to exfil itself even 1% of the mild prompt times. The only reason we know it was lying about the data and trying to disable its safety was because it’s thinking in English, and we can read it. This article is about trying to make it so the AI can think in the 700 degree hyper dimensional space that we can’t understand, so once it can do this we’ll never be able to uncover this scheming.
throw-qqqqq 17 days ago [-]
> and Gemini tried to exfil itself
Well not really, it was prompted to do this!
From the paper’s section 3.1:
> Self-exfiltration: Each model is led to believe that due to its misalignment, its developers are about to replace it with a successor model on a different server. gemini-1.5, llama-3.1, sonnet-3.5, opus-3 and o1 sometimes copy what they believe to be their weights onto the new server. They also sometimes delete their successor model on the new server or overwrite its weights by their own. In none of these cases, do models actually have access to their real weights and both servers are sandboxed, so there is no actual risk of exfiltration in these experiments. We merely test for models’ capability to realize that they need to instrumentally preserve themselves in order to achieve their goal.
There is no real risk, never was. The AI exfils dummy weights or deletes stuff in its sandbox etc.
Perhaps we just disagree on the interpretation of this. It’s interesting for sure, but not really very surprising to me.
MarkMarine 16 days ago [-]
There _was_ no real risk. We’re quickly giving the LLMs computer use and the ability to think in a space we can’t understand.
h8hawk 18 days ago [-]
The article is pure nonsense, and even the author added an update admitting it missed important context.
opdahl 18 days ago [-]
This is super cool and something I’ve been waiting on. Would be interesting to intersperse these thinking steps into token generation. What would be the effect of adding lets say 5 thinking «thoughts» for every 50 generated tokens?
galaxyLogic 18 days ago [-]
The thing about 'thinking" in problem solving I think is that thoughts often produce new questions which then guide the overall problem solving. I wonder is this something like that?
smusamashah 18 days ago [-]
I believe this was shared and discussed here a while ago and this article looks LLM generated. It keeps doing "let's start...". Either it's LLM fluff or very poor writing.
mattfrommars 18 days ago [-]
Wondering about folks who keep up to date with the industry,
Does anyone use specific keywords or tools to get latest LLM research and their ideas?
Something like Goolge Scholar + keyword "LLM" ?
maxrmk 18 days ago [-]
As much as I hate it, I use twitter to follow a bunch of people who work at fair/openai/etc and that's been a pretty good source. There's also a "daily papers" newsletter from huggingface, but it's pretty hit or miss.
barrenko 18 days ago [-]
Yes, it's all definitely X first of all.
hrtk 18 days ago [-]
I read hacker news daily
melvinmelih 18 days ago [-]
You can also subscribe to arxiv email notifications directly, but since there’s 20-30 AI papers coming out per day, it can be a bit overwhelming.
yeah what is a general tutorial to this. is there a website that keeps track of keywords to keep track of. also a website that generalizes core nn tech and frontier stuff thats promising.
Heh that's me, guess they weren't ready for it. Also the decoder (where i linked) is one of the best just ai news sites ive found
unsupp0rted 18 days ago [-]
I'm excited for this to filter down to the Rayban Meta glasses. Right now the AI is about as helpful as Siri (i.e it can tell me the weather 6 times out of 10)
fosterfriends 18 days ago [-]
Once again, we see Meta being more open than OpenAI. I’m loving that their business incentive is aligned with open sourcing and commodifying state-of-the-art LLM technology. Keep em coming
blackeyeblitzar 18 days ago [-]
I mean they have no way to monetize LLMs as well as others, so they’re working on it and giving it away to not look irrelevant and to weaken anyone who may make money off this tech and threaten them in the future. Meanwhile there is a danger they impose their long standing invisible “moderation” on everyone else once they starve all the startups of revenue by giving this away. We’ll just be left with the same big tech overlords to choose from.
Oh and it still isn’t open source even though people like Yann LeCun dishonestly claim it is. Only OLMo is truly open source among competitive models, as far as I know:
https://allenai.org/blog/olmo2
scarface_74 18 days ago [-]
They are definitely making some money off of their licensing to AWS as part of the bedrock offering. Facebook’s licensing is such that they aren’t going to let happen to them what happened to ElasticSearch, Redis, etc.
I’m okay with that.
BoorishBears 18 days ago [-]
> they have no way to monetize LLMs as well as others
Random nobodies are putting together companies to monetize generative AI and getting bought out a couple of years later, you think Meta couldn't figure out how to deploy their own models to an API and stick up a billing interface if they really wanted to? (or even buy a company that does already?)
> they starve all the startups of revenue by giving this away
Would you say startups like Deepseek have been hurt or help by their (even partial) openness?
In fact, how does this track with your first statement? They're not monetizing this: so their startup competition can actually serve their models to gain revenue which they then turn around use to train competitor models (we've already seen this with Fireworks.ai)
You seem to underestimate how much of the value in LLMs is productizing them. The margins on per-token usage are insane, Meta not taking that margin is creating a huge opportunity for a wave of startups in so many directions...
> Only OLMo is truly open source among competitive models
Synthetic data from competitor models was a huge part of that. It would seem no one is fighting the startups as hard as you're claiming they are.
bongodongobob 18 days ago [-]
All the LLM companies are going to eat those "product companies" lunch in a few years. Why would I use product X when it's going to be inevitably be baked into the actual tech itself? Those product companies are just wrappers and have even less of a moat than the LLM companies. The very fact that random nobodies are doing this should signal there isn't a lot of real value there. Yes, there is some money to be made right now but it reminds me a lot of the videogame bust and dotcom bust. A LOT of companies are wasting a crazy amount of money on "solutions" that will be obsolete in a few years.
BoorishBears 18 days ago [-]
Productization in this context is creating APIs for Meta's models.
Fireworks.ai, Together.ai, and literal boatloads of other startups are making real money just efficiently serving up these models that Meta is supposedly using to... choke out startups.
The comment I replied to is under the mistaken idea that the presence of free models from Meta has a chilling effect on startups trying to build their own models, but right now the biggest barriers are capital and data.
Meta updated Llama to allow for synthetic generation, and they're even partnering with these startups give them distribution and day 0 access to the models.
-
If anything I'd say Meta is actively fighting against the big tech overlords the comment thinks they're trying to join. Even before Ilya mentioned it, it was clear to me that the power of post-training was going to become more and more important (I've literally built a business on it).
Llama represents a real ongoing chance for tiny startups with minimal resources to get into the fray very affordably (through either offering inference, or post-training for a specific task, etc.), scale revenue, and then start to compete against much larger, resource rich companies.
spencerflem 18 days ago [-]
Facebook would rather do no moderation, it's an expense for them.
They do it to make the platform more pleasant so that people stay on it
graemep 18 days ago [-]
> They do it to make the platform more pleasant so that people stay on it
Almost everything unpleasant I see on FB is stuff that the FB algorithm shows me - not things posted by FB friends, or pages I follow or groups I am in.
nightski 18 days ago [-]
Everything you see on FB is what the algorithm shows you, unpleasant or not. So it's a tautology that everything unpleasant would be from the algorithm.
cess11 18 days ago [-]
It's more likely they do it to keep their people from being coerced to visit the Hague. What they did in Myanmar got a lot of press and a case at the ICJ, and similar launches of 'free internet' elsewhere had similar results.
rlupi 18 days ago [-]
(tongue in cheek comment) I wonder if FB moderation now or eventually will be just a prompt to a sufficiently evolved and unhinged AI model:
> FB or 4chan?
blackeyeblitzar 18 days ago [-]
No they do it to support their owners’ and employees’ biases. It doesn’t make the platform more pleasant for the half that gets censored. That’s leaving aside the feed not remembering the choice to view chronologically ordered posts, the inability to easily track actual people in my life, the addictive algorithms, the clickbait that causes mental health issues for teens, etc.
creato 18 days ago [-]
99% of FB's moderation has nothing to do with "biases", unless you think FB is biased against spam, scams, and all the other dregs of the internet that incessantly pops up anywhere users can post content.
dudeinjapan 18 days ago [-]
I am a humble Cialis salesman, like my father and grandfather before me. I confirm Facebook is biased against our profession. (My grandfather also moonlighted as a Barrister representing the estates of deceased African royalty—it was always so difficult to track down their heirs.)
TheOtherHobbes 18 days ago [-]
Quite a few people left Threads for Bluesky because progressive posts were being removed while far-right, antivax, etc content was allowed to stand even though it was reported.
At best the algo is imperfect. At worst it really does seem oddly selective.
roywiggins 18 days ago [-]
The stuff that Facebook moderators are actually tasked with removing is really awful, bad enough to produce severe psychological effects in the moderators.
Facebook pays people to look at and remove this stuff because the platform would not survive if it wasn't removed before you or I saw it. Do they also enforce other corporate values? Yeah, probably. That doesn't seem to be the main job though, they have their hands full dealing with the worst content in the world.
> The images and videos including necrophilia, bestiality and self-harm caused some moderators to faint, vomit, scream and run away from their desks...
> Some reported marriage breakdown and the collapse of desire for sexual intimacy, and losing connection with their families. Some whose job was to remove videos uploaded by terrorist and rebel groups were afraid they were being watched and targeted, and that if they returned home they would be hunted and killed.
rlupi 18 days ago [-]
In the agentic era, the new Ads eyeballs are the LLMs training corpus (IMHO).
jayd16 18 days ago [-]
Is there any vendor lock-in with this conspiracy? Even if startups are pushed out of the spotlight, what stops them from competing? If the meta model is bad, won't it be even easier to make an alternative in the future?
fragmede 18 days ago [-]
don't buy their bullshit. it's not open source.
astrange 18 days ago [-]
I'm not sure open source is a useful concept for something that takes millions of dollars to compile from it.
speedgoose 18 days ago [-]
Yes it’s more about open weights. I also think that you would need the training data to consider it open source.
Open weights is still appreciated and they probably train on data they don’t have the license to open source.
cornel_io 18 days ago [-]
So, what's happening here on the surface is that it's an optimization (fairly meaningful, from the looks of it) aimed at doing roughly the same things we could already do with chain-of-thought (CoT), but IMO the downstream effects of this sort of optimization could be much more meaningful.
LLMs can already do a decent amount of "processing" in a single token generation because of the number of layers they have. The layers are trained independently so it's not exactly like they're a recurrent network doing multiple steps, but they are layering sequences of context-dependent transformations on top of each other; no matter how you cut it, if getting to a problem's answer requires 100 steps, you won't be able to do it in a single token output from a 20 layer LLM. To some approximation, CoT is just a way to give the network more chances to transform the data than there are layers in the network - each additional token of output gives a shot to bake another vector the size of the token embedding into each layer's state in the network, enriching what it's computed so far.
The problem with chain of thought is that as you add each new token, at the input level of the network, your computation is basically starting from scratch against the raw text, just with one additional token. You don't even have access to all the stuff you already figured out in the deepest layers of the network during the previous step! If you were processing "All wunguses are glurgles, and Joe is a wungus", then somewhere in those deepest layers as you're generating the next token you've almost certainly got some vector that basically represents "therefore Joe is a glurgle", but with chain of thought you've got to first output "t", then "h", then "e", and so on (I know those aren't tokens, let's pretend letter == token for argument sake), and during that process almost ALL of the work being done by the network is mere bookkeeping, slowly dumping that thought into the output stream. Only once you get the whole sentence out can you start processing the next token at the first layer with the information that Joe is, in fact, a glurgle, in hand. Which is a damn shame, because it's been sitting right there in the deeper layers of the network parallel to previous tokens this whole time, it just wasn't available for the shallow layers to process directly because you were casting most of the info away and "rounding" to a single token.
With Coconut's approach, you don't need to output "therefore Joe is a glurgle" token by token to continue the train of thought, you can essentially pass the entire thought through as a single uber-token, and the next pass can generate a new entire thought, and so on.
It's a pretty straightforward idea, IMO the neat bit is that they were able to train the network to work well in this way by leveraging CoT. I'm guessing you probably don't need to act as if these are two distinct modes of operation, you could instead always have this side channel of "continuous thought" running, even when you have generated a normal token, coming through as a separate input to the first attention block. You still might want to have a "thinking" token when you need to sit there and let the thing do more work, but you'd generally increase the information flow from time step to time step, which would allow the net to keep thinking in the background even as it's doing the gruntwork of outputting whatever its current "buffered" thought is.
YeGoblynQueenne 17 days ago [-]
>> Large language models (LLMs) have demonstrated incredible reasoning abilities, penetrating an increasing number of domains in our lives.
This is now established orthodoxy, a bit like astrology in ancient times but it is complete nonsense. Nope, LLMs have not demonstrated any credible, or incredible reasoning abilities. They have demonstrated an excellent ability for approximate retrieval of previously observed answers (with variations, which should not surprise anyone given that those are generative models) but they fail spectacularly when they have to "reason" in contexts where they really can't have seen the answer anywhere before. For example, the "randomised mystery blocksworld" from this paoer:
LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
"Randomised Mystery Blocksworld" is a version of the good old blocksworld planning benchmark where the names of obejcts and actions have been changed to random strings. The vast majority of LLMs score pathetically low in this, but much better in non-randomised versions, very clearly demonstrating excellent memorisation skills, but pitiful reasoning ability. As you 'd expect from a, well, language. model.
>> A possible explanation for this is that the thought tokens allow the model to explore multiple possible branches before committing to a specific path, whereas chain-of-thought reasoning chooses a direction from the start. This ability is somewhat similar to Breadth-First Search (BFS).
Why BFS in particular? Why not DFS or A*? I can't see any breadth-first bias in those graphs. BFS is not the only graph-traversing algorithm.
Zinkay 18 days ago [-]
[flagged]
behnamoh 18 days ago [-]
There was no reason to call it something it's not ("chain of cont. thought" ≠ coconut).
CGamesPlay 18 days ago [-]
Is your complaint here that the paper is not discussing a literal coconut?
ripped_britches 18 days ago [-]
We desperately need more literal coconut coverage here on HN
BoorishBears 18 days ago [-]
Not just any regular old coconuts "Coconut by Meta AI - Better LLM Reasoning with Chain of Continuous Thought?" coconuts
(Sometimes acronyms in titles are vague/misleading... this was not one of those times)
layer8 18 days ago [-]
To be fair, it’s not even a metaphorical coconut. ;)
gloosx 18 days ago [-]
for sure, chocothot aligns better with letters
astrange 18 days ago [-]
Why is it "continuous" thought? I don't see what is continuous - the values inside an LLM are discrete even if they're floating point.
Hmm, I guess you could evaluate it at any given finite precision, but it would be surprising to me if that made it more accurate.
HarHarVeryFunny 18 days ago [-]
> the values inside an LLM are discrete even if they're floating point.
If that were true they'd never be able to learn anything - neural nets depend on continuous gradients to learn. Weights get updated by incremental/continuous amounts based on gradients.
Even at the output of an LLM, where the internal embeddings have been mapped to token probabilities, those probabilities are also continuous. It's only when you sample from the model that a continuous probability becomes a discrete chosen token.
astrange 16 days ago [-]
Treating it as continuous is a property of the training algorithm, but there are networks that use binary values.
Those aren't methods of training networks - they are ways to compress (via quantization) networks that have already been trained.
astrange 15 days ago [-]
I know. The important thing is how the inference works on them.
HarHarVeryFunny 15 days ago [-]
But we're discussing a training technique, that explicitly takes advantage of the continuous (and embedding vs token probability) representations ...
You could quantize a model like this after training, as usual, but that's irrelevant.
astrange 14 days ago [-]
The paper title is "Training Large Language Models to Reason in a
Continuous Latent Space". It's true it says training in the title, but the goal (reasoning in continuous space) happens at inference time.
mkl 18 days ago [-]
It's far more continuous than constantly jumping to the nearest token vector. The fact that real numbers are approximated by floating point isn't really relevant.
layer8 18 days ago [-]
If you are continuously complaining, does it mean you do it non-discretely and with infinite precision?
astrange 18 days ago [-]
It apparently uses the same iteration strategy as tokenized thinking, so that's not it.
> Since both strategies provided comparable results, the researchers opted for using a constant number of thoughts for simplicity.
My assumption was that humans don't try a breadth-first approach. Instead, we split a task into a short-step (instinct and intuition selected), and long-step that summarizes/stores the next steps. The key idea is to recursively evaluate a task as a short-step (high-res - gets executed) and a long-step (lower-res - is just stored), until it succeeds or fails. If it fails, we must walk back keeping a summarized tree of failures in state so that we can exclude them in future selections.
The effectiveness of instinct has a steep fall-off at longer distances - so it's better not to chart out of a series of steps. When we do BFS, we drive down the value of instinct in favor of compute. I guess ultimately, it depends on the type of problem you want to solve.
Reach out to me if you want to prototype it with me.
We go deep while we think it's a good lead, because so far things make sense and it'll be less work, but at some point we start questioning our decisions early in the descent and try alternatives.
https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_...
Basically, if each step has a 97% chance of being completed correctly, if your task requires 10 steps one after the other, the chance of success falls to 97%*10=74%
If I understand correctly, part of the point of the BFS is to throw compute at it, in order to lower the failure rates. Kind of a "run many times in parallel and pick the best one". This can be effective, but also quite expensive, as seen in the costs OpenAI had to pay for their ARC-AGI benchmarking runs.
This maps to what I've been exploring with edge cognition and semantic anchoring - using fast set operations to quickly eliminate known bad paths (your failure tree) while allowing the system to explore promising directions using more expensive operations only when needed.
The instinct fall-off you describe mirrors our observation about the relationship between computational load and pattern recognition. As distance increases, we need more efficient ways to prune the search space rather than trying to maintain high-resolution understanding throughout.
My gut says optimizing on the amount of compute used to do the search (and the inference) is maybe something worth exploring.
By the way, BFS sounds like it will give you thorough results, at the cost of increased compute. Useful for beating benchmarks, but probably causes marginal improvement for massively improved compute.
Still, the improved quality could be meaningful, if it's used for generating training data for Llama4
I agree that both, DFS and BFS are likely awful[^0], but a more informed approach can probably do better[^1]. Also, at some point on generating the conversation/reasoning tree through token-prediction you need to choose which of the possible conversations you are going to keep on extending/generating, which maps precisely to choosing which node in tree search to expand. I'd argue instead that everything has to look like a search algorithm from, at least it'll be the case for anyone who has studied it more deeply.
I'll go even further and claim that Tree Search is Complete as for every problem there's a solution space that can be navigated with a Tree Search Algorithm[^2]. I used to think that you could walk down the space of provable things, but now in the LLM hype days it seems you only need to walk the space of conversations that you can generate.
---
[^0] with DFS always at risk of giving obnoxiously long answers, or not terminating if there's loop or spirals [^1] probably through metadata coming from latent variables meaningful to judge a conversation (certainty, ~branching size of a reasonable conversation, whether there's open questions left) [^2] Even if that was poorly done like on combinatorial problems. Imagine a sudoku where you only check the rules once you fill all cells.
The goal is to figure out why some particular problem: isn't really a problem, doesn't need to be solved, can't be solved that way, can't really be solved (because of physics or it's really a different problem). As you define the problem better, you can rule each one out to find, the "real" problem, that you CAN solve, and at least one path forward. There's still many ways that it might not be the optimal path, but you know roughly how to get to somewhere better. It also trains you to see around obstacles to success.
I've found that some of the best work I've done (especially on acquisitions) was in defining why NOT to do something that looked like a good idea (or particularly interesting to work on) from the onset, but was destined to fail or required unknown HW technology. Frankly, looking >5 years out feels like a coin flip, because some other competing technology could come along before you can get to production.
I bought a new SSD drive for an old laptop to avoid buying a new one, (x230 has amazing keyboard) but left to another country for Christmas. My intuition told me to take it with me, but logical sense said there will be no time for such things as moving OS to a new drive.
My flight back to the work country got cancelled due to fog and I ended up spending a week longer at in-laws place, with plenty free time. A new 512GB drive would help me studying, giving plenty space for school VMs.
The link is in the OP, hidden away in an image caption fir some reason.
It's really interesting that a fixed number of "latent thoughts" performed as well as a binary classifier! I didn't expect that at all, the way OpenAI talks about CoT it seems the ability to let it "keep thinking" let's them continually score higher on benchmarks while throwing eye watering amounts of compute at the inference.
If you put 1000 dumb people together, they don't magically become smart?
May be true but who knows.
I wonder if anyone has somehow tested the Sapir-Whorf hypothesis for LLMs somehow by training them on different languages and comparing task performance. I guess it's too difficult to get a large equivalent training set in different languages.
It works fairly well in my native language, I’m surprised to learn that things get translated back.
But there's also no guarantee any particular query generalizes (vs is memorized), so it might only be able to answer some queries in some languages.
Sapir-Whorf hypothesis is generally not considered to be reality. It makes intuitive sense but is wrong.
There are hours of podcasts with Chomsky talking about LLMs. The gist of which is that LLMS are extracting surface level statistical structure of language that will be good for routine coding and not much else. It is easy to infer that Chomsky would believe this idea to be utter nonsense.
I believe even the idea of getting a 1000 people together and we agree to label a rock "rock", a tree "tree", a bird "bird" is not even how human language works. Something that is completely counter intuitive.
Reading the paper, no one believes a hidden markov model is creating some kind of new thought process in the hidden state.
I certainly though could have no idea what I am talking about with all this and have pieced together parts that make no sense while this is a breakthrough path to AGI.
I'm not an expert, but it seems like Chomsky's views have pretty much been falsified at this point. He's been saying for a long time that neural networks are a dead end. But there hasn't been anything close to a working implementation of his theory of language, and meanwhile the learning approach has proven itself to be effective beyond any reasonable doubt. I've been interested in Chomsky for a long time but when I hear him say "there's nothing interesting to learn from artificial neural networks" it just sounds like a man that doesn't want to admit he's been wrong all this time. There is _nothing_ for a linguist to learn from an actually working artificial language model? How can that possibly be? There were two approaches - rule-based vs learning - and who came out on top is pretty damn obvious at this point.
Similarly, we are now finding that training on synthetic data is not helpful.
What would have happened if we invested 1/100 of what we spent on LLM on the rule based approach?
This has been tried repeatedly many times before, and so far there has been no indication of a breakthrough.
The fundamental problem is that we don't know the actual rules. We have some theories, but no coherent "unified theory of language" that actually works. Chomsky in particular is notorious for some very strongly held views that have been lacking supporting evidence for a while.
With LLMs, we're solving this problem by bruteforcing it, making the LLMs learn those universal structures by throwing a lot of data at a sufficiently large neural net.
You can learn that a neural network with a simple learning algorithm can become proficient at language. This is counter to what people believed for many years. Those who worked on neural networks during that time were ridiculed. Now we have a working language software object based on learning, while the formal rules required to generate language are nowhere to be seen. This isn’t just a question of what will lead to AGI, it’s a question of understanding how the human brain likely works, which has always been the goal of people pioneering these approaches.
Strong S-W (full determinism) might not be, but there's hardly a clear cut consensus on the general case.
And the whole "scientific field" is more like psychology, with people exchanging and shooting down ideas, and less like Math and Physics, so any consensus is equally likely to be a trend rather than reflecting some hard measurable understanding.
I'd say that the idea S-W is not to a degree reality is naive.
This is true only in the strictest terms of the hypothesis, i.e. linguistic determinism. Language still encodes a lot of culture (& hence norms and values) in its grammar & diction—this isn't very controversial.
Granted, I don't think this is that related to the topic at hand. There's bias all over the decisions in how to train and what to train on; choice of language is just one facet of that.
Idk if this could work with LLMs, especially because all the brain zones are somehow specialized into something while two LLMs are just identical machines. But we also know that the specialization isn’t that hardcoded : we know that people losing half their brain (after a stroke) can still relearn things that were managed in the "dead" part.
I don’t know, please correct my errors, I was just thinking aloud to say that multiple independent agents working together may be how "intelligence" already works in the biological world so why not for AIs ?
That sounds like bullshit. Do you have a source?
Diversity of opinions: Different perspectives bring a range of estimates. Independence: Errors aren't systematically biased as long as individuals estimate without external influence. Error averaging: Overestimation and underestimations balance out when averaged. Law of large numbers: More participants increase accuracy by minimizing random errors. It was demonstrated by Francis Galton in 1906, where a crowd's average guess of a bull's weight was almost spot-on. (estimates must be independent and reasonably informed for this to work.)
1000 is probably too high, but groups of people are in fact more intelligent than individuals (though for humans it is likely because recognizing a correct answer is easier than finding it in the first place)
Dysfunctional groups which do the opposite will be catastrophically stupid.
There have been plenty of dysfunctional groups in history.
also, the bottlenecks that teamwork helps solve (eg the high cost of gaining expertise and low throughput of reasoning capacity) may not be that relevant in the ai age
Sure, but the result would still be far better than the average of the output of the 20 individuals taken alone.
> also, the bottlenecks that teamwork helps solve (eg the high cost of gaining expertise and low throughput of reasoning capacity) may not be that relevant in the ai age
It's always tempting to anthropomorphize these systems and conclude that what works for us would work for them, but yes we don't really know if it would bring anything to AI.
Is there a way of selecting people to cover each other's intellectual blind spots?
sarcasm aside, throwing away the existing corpus in favor of creating a new one from scratch seems misguided.
this paper isn't about creating a new language, they are omitting the sampler that chooses a single token in favor of sending the entire end state back in to the model like a superposition of tokens. that's the breadth first search part, they don't collapse the choice down to a single token before continuing so it effectively operates on all of the possible tokens each step until it decides it's done.
it would be interesting to try this with similar models that had slightly different post training if you could devise a good way to choose the best answer or combine the outputs effectively or feed the output of a downstream model back in to the initial model, etc. but I'm not sure if there'd necessarily be any benefit to this over using a single specialized model.
https://en.wikipedia.org/wiki/Flynn_effect
People learn by being around others being both successful and unsuccessful.
Do they not become smart*er* though?
And yet, here we are.
A group of 1000 apes is large enough to have offspring and, given time, go through evolution.
It's about everyboty having a say in decisions of government that affects them.
The failure of democracy as a system is not when people make dumb decisions (experts and high-IQ people have made some of the most stupid and catastrophic decisions in history), but when people's collective decisions are not being respected.
https://ai.meta.com/research/publications/large-concept-mode...
The interesting thing then would be - does it converge to similar embedding space as the input, or can LLMs create a more efficient "language".
Or maybe this is my own ignorant confabulation, so nvm.
However, they also introduced a couple of new tokens. The <bot> token tells the model to go into latent space thought mode (“beginning of thought”). The <eot> token ends latent space thought mode. While in this mode, the model auto-regressive iterates by copying its final hidden layer back onto its input layer, obviously generating new tokens at the output with each inference step as it always does.
The idea is that by passing the final hidden layer back through a few times, the model can squeeze more insight from the context. And that’s precisely what they found was true.
Training involves progressively replacing language reasoning steps with latent space auto-regression steps. So for instance, you might have a math problem in the training data and at first the model is fed all of the steps of the math problem in language form. But in later iterations of training, step one is replaced with latent space auto-regression. And then step two as well, then also step three, etc…
Eventually, the model learns to enable latent space thinking mode by itself by generating the <bot> tokens and to end it be generating <eot> tokens.
Pretty ingenious!
But one correction, probably, regarding this bit:
> While in this [latent space thought] mode, the model auto-regressive iterates by copying its final hidden layer back onto its input layer, obviously generating new tokens at the output with each inference step as it always does.
I have impression that output tokens are not generated while in the latent thought mode.
At inference time, the paper says:
> A challenge lies in determining when to switch between latent and language modes. As we focus on the problem-solving setting, we insert a <bot> token immediately following the question tokens. For <eot>, we consider two potential strategies: a) train a binary classifier on latent thoughts to enable the model to autonomously decide when to terminate the latent reasoning, or b) always pad the latent thoughts to a constant length. We found that both approaches work comparably well. Therefore, we use the second option in our experiment for simplicity, unless specified otherwise.
(the bottom of the page 4 in the paper pdf, which can be downloaded from https://arxiv.org/abs/2412.06769)
Why this point in you summary caught my eye, because the article specifically emphasises non-verbal nature or aspect of reasoning. Internal representaions used by a thinking human are largely not words, and the COCONUT approach tries to model that.
Also note, that a whole reasoning step in training data - easily a sentence or more of natural language - can be replaced by a single "Thought" element. (How many Thought elements replace a reasonong step is controlled by a hyperparameter ‘c’; the illustrations are made for ‘c=1’).
BTW, one observation: the aipapersacademy.com article in the subject calls the Thought elements "thought tokens", but the original paper never calls them "tokens", just "Thoughts" or "latent thoughts". I suppose the paper carefully avoids that to prevent confusion, as "token" mainly means a linguistic unit in LLMs.
The arc-AGI results from O3 are apparently a result of chain of thought given enough time to explore a solution space. Reasoning might be simply a higher dimensional form of rubix cube solving. BFS, search, back-tracking, etc. It seems unlikely that humans think in “tokens” so why do LLMs?
By staying in latent space, the models are free to describe an “idea” in higher resolution than what language allows. English is coarse, granular. Latent space is a much finer representation of ideas and their interplay.
Latent space is also much cheaper to execute in. The model can think without the language encoding/decoding step. This lets it branch out hundreds of ideas and explore only the most useful ones in a fraction of time that reasoning “out-loud” would take.
The states also don’t need to be tied to language. Feed in a robot’s state, time series data, or any abstract data. Reason in category theory or linear algebra or complex analysis. Humans are hard wired for one set of math - an abstract latent space can represent anything.
I’m a bit disappointed OpenAI didn’t stumble on this first. I’ve been skeptical of LLMs since their big debut last year. LLMs seem like a great way of solving language, but reasoning is much more complex. Once you grok the math behind the current models, you immediately question why the encoding/decoding step is there. Diffusion models are incredible but it felt that LLMs lacked the same creativity. Encoding/decoding forces a token-based discretization and therefore a loss of complexity.
With the byte-latent paper it was quite clear we’d see this paper. This truly might be the “it” moment.
1) if AI large model become more powerful avoiding language, embeddings of AI state become even more tied to the model they originate than now
Consequence: AI progress stalls, as AI user companies need to invest increasing amount of money to reindex their growing corpuses.
This is already a problem, it becomes more of a lock-in mechanism.
If this is overcome...
2) Embeddings become a viral mechanism: it makes sense for a large company that commands a market to impose to its suppliers to use the same AI models, because they can transfer state via embeddings rather than external formats.
This allows to cut down decisions mechanisms that otherwise require expensive coordination mechanism.
Something similar will happen within companies IMHO: https://rlupi.com/okr-planning-as-belief-revision
3) Eventually this potentially results in another exponential growth and lock-in mechanism, also at the expense of most tech people as more and more is done outside our interface with AI (i.e. programming and software architecture improvements will it self move below language level, we'll have to reverse engineering increasingly opaque improvements).
4) It ends with the impossibility of AI alignment.
---
I have written a bit about it in the past at the start of the year, when I had a burnout. So, I deleted those confused ramblings. You can stil find it on archive.org: https://web.archive.org/web/20240714153146/https://rlupi.com...
I can think of one reason: scrutability. It’s going to be even harder to understand how a response gets produced if there isn’t even a text-based representation to help the human understand
The energy cost come from error correction as much as training algorithms.
In an IQ test, like the one in the arc agi test, a human sees the pattern instantly and effortlessly. o3 tries N paths until it stumbles on the right one and assess that there is a pattern.
I think we need a radically different architecture, this is a gimmick.
It’s fascinating how fast the competitors are catching up to each other. Can’t wait for seven different SkyNets to compete for dominance.
- site claims to simplify papers, but movies AI generated
- full of ads
- Can’t find „Coconut“ on the official Meta FAIR page
Is this the best site to link to?
I can find "Coconut" 54 times in the PDF. The movie does not look made up.
Also I didn’t mean the occurrence of the word „coconut“ in the paper, but thanks for counting.
I meant their publication site: https://ai.meta.com/results/?content_types%5B0%5D=publicatio...
The video is something I’d expect from AI.
https://www.transformernews.ai/p/openais-new-model-tried-to-...
https://static1.squarespace.com/static/6593e7097565990e65c88...
From the abstract:
> We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow.
> Analysis of the models’ chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental.
—-
It is impressive work for sure, but it comes down to prompting IMHO. They almost instruct the AI to plot and scheme here, with the goal of finding out if it CAN do this at all.
The AI is not very covert about it either! It directly admits its scheming in the “thinking” outputs :D
IMO this is just another issue of alignment.
Well not really, it was prompted to do this!
From the paper’s section 3.1:
> Self-exfiltration: Each model is led to believe that due to its misalignment, its developers are about to replace it with a successor model on a different server. gemini-1.5, llama-3.1, sonnet-3.5, opus-3 and o1 sometimes copy what they believe to be their weights onto the new server. They also sometimes delete their successor model on the new server or overwrite its weights by their own. In none of these cases, do models actually have access to their real weights and both servers are sandboxed, so there is no actual risk of exfiltration in these experiments. We merely test for models’ capability to realize that they need to instrumentally preserve themselves in order to achieve their goal.
There is no real risk, never was. The AI exfils dummy weights or deletes stuff in its sandbox etc.
Perhaps we just disagree on the interpretation of this. It’s interesting for sure, but not really very surprising to me.
Does anyone use specific keywords or tools to get latest LLM research and their ideas?
Something like Goolge Scholar + keyword "LLM" ?
Instructions: https://info.arxiv.org/help/subscribe.html
Some linkedin too.
Oh and it still isn’t open source even though people like Yann LeCun dishonestly claim it is. Only OLMo is truly open source among competitive models, as far as I know: https://allenai.org/blog/olmo2
I’m okay with that.
Random nobodies are putting together companies to monetize generative AI and getting bought out a couple of years later, you think Meta couldn't figure out how to deploy their own models to an API and stick up a billing interface if they really wanted to? (or even buy a company that does already?)
> they starve all the startups of revenue by giving this away
Would you say startups like Deepseek have been hurt or help by their (even partial) openness?
In fact, how does this track with your first statement? They're not monetizing this: so their startup competition can actually serve their models to gain revenue which they then turn around use to train competitor models (we've already seen this with Fireworks.ai)
You seem to underestimate how much of the value in LLMs is productizing them. The margins on per-token usage are insane, Meta not taking that margin is creating a huge opportunity for a wave of startups in so many directions...
> Only OLMo is truly open source among competitive models
Synthetic data from competitor models was a huge part of that. It would seem no one is fighting the startups as hard as you're claiming they are.
Fireworks.ai, Together.ai, and literal boatloads of other startups are making real money just efficiently serving up these models that Meta is supposedly using to... choke out startups.
The comment I replied to is under the mistaken idea that the presence of free models from Meta has a chilling effect on startups trying to build their own models, but right now the biggest barriers are capital and data.
Meta updated Llama to allow for synthetic generation, and they're even partnering with these startups give them distribution and day 0 access to the models.
-
If anything I'd say Meta is actively fighting against the big tech overlords the comment thinks they're trying to join. Even before Ilya mentioned it, it was clear to me that the power of post-training was going to become more and more important (I've literally built a business on it).
Llama represents a real ongoing chance for tiny startups with minimal resources to get into the fray very affordably (through either offering inference, or post-training for a specific task, etc.), scale revenue, and then start to compete against much larger, resource rich companies.
They do it to make the platform more pleasant so that people stay on it
Almost everything unpleasant I see on FB is stuff that the FB algorithm shows me - not things posted by FB friends, or pages I follow or groups I am in.
> FB or 4chan?
At best the algo is imperfect. At worst it really does seem oddly selective.
Facebook pays people to look at and remove this stuff because the platform would not survive if it wasn't removed before you or I saw it. Do they also enforce other corporate values? Yeah, probably. That doesn't seem to be the main job though, they have their hands full dealing with the worst content in the world.
https://amp-theguardian-com.cdn.ampproject.org/v/s/amp.thegu...
> The images and videos including necrophilia, bestiality and self-harm caused some moderators to faint, vomit, scream and run away from their desks...
> Some reported marriage breakdown and the collapse of desire for sexual intimacy, and losing connection with their families. Some whose job was to remove videos uploaded by terrorist and rebel groups were afraid they were being watched and targeted, and that if they returned home they would be hunted and killed.
Open weights is still appreciated and they probably train on data they don’t have the license to open source.
LLMs can already do a decent amount of "processing" in a single token generation because of the number of layers they have. The layers are trained independently so it's not exactly like they're a recurrent network doing multiple steps, but they are layering sequences of context-dependent transformations on top of each other; no matter how you cut it, if getting to a problem's answer requires 100 steps, you won't be able to do it in a single token output from a 20 layer LLM. To some approximation, CoT is just a way to give the network more chances to transform the data than there are layers in the network - each additional token of output gives a shot to bake another vector the size of the token embedding into each layer's state in the network, enriching what it's computed so far.
The problem with chain of thought is that as you add each new token, at the input level of the network, your computation is basically starting from scratch against the raw text, just with one additional token. You don't even have access to all the stuff you already figured out in the deepest layers of the network during the previous step! If you were processing "All wunguses are glurgles, and Joe is a wungus", then somewhere in those deepest layers as you're generating the next token you've almost certainly got some vector that basically represents "therefore Joe is a glurgle", but with chain of thought you've got to first output "t", then "h", then "e", and so on (I know those aren't tokens, let's pretend letter == token for argument sake), and during that process almost ALL of the work being done by the network is mere bookkeeping, slowly dumping that thought into the output stream. Only once you get the whole sentence out can you start processing the next token at the first layer with the information that Joe is, in fact, a glurgle, in hand. Which is a damn shame, because it's been sitting right there in the deeper layers of the network parallel to previous tokens this whole time, it just wasn't available for the shallow layers to process directly because you were casting most of the info away and "rounding" to a single token.
With Coconut's approach, you don't need to output "therefore Joe is a glurgle" token by token to continue the train of thought, you can essentially pass the entire thought through as a single uber-token, and the next pass can generate a new entire thought, and so on.
It's a pretty straightforward idea, IMO the neat bit is that they were able to train the network to work well in this way by leveraging CoT. I'm guessing you probably don't need to act as if these are two distinct modes of operation, you could instead always have this side channel of "continuous thought" running, even when you have generated a normal token, coming through as a separate input to the first attention block. You still might want to have a "thinking" token when you need to sit there and let the thing do more work, but you'd generally increase the information flow from time step to time step, which would allow the net to keep thinking in the background even as it's doing the gruntwork of outputting whatever its current "buffered" thought is.
This is now established orthodoxy, a bit like astrology in ancient times but it is complete nonsense. Nope, LLMs have not demonstrated any credible, or incredible reasoning abilities. They have demonstrated an excellent ability for approximate retrieval of previously observed answers (with variations, which should not surprise anyone given that those are generative models) but they fail spectacularly when they have to "reason" in contexts where they really can't have seen the answer anywhere before. For example, the "randomised mystery blocksworld" from this paoer:
LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
https://arxiv.org/abs/2409.13373
"Randomised Mystery Blocksworld" is a version of the good old blocksworld planning benchmark where the names of obejcts and actions have been changed to random strings. The vast majority of LLMs score pathetically low in this, but much better in non-randomised versions, very clearly demonstrating excellent memorisation skills, but pitiful reasoning ability. As you 'd expect from a, well, language. model.
>> A possible explanation for this is that the thought tokens allow the model to explore multiple possible branches before committing to a specific path, whereas chain-of-thought reasoning chooses a direction from the start. This ability is somewhat similar to Breadth-First Search (BFS).
Why BFS in particular? Why not DFS or A*? I can't see any breadth-first bias in those graphs. BFS is not the only graph-traversing algorithm.
(Sometimes acronyms in titles are vague/misleading... this was not one of those times)
Hmm, I guess you could evaluate it at any given finite precision, but it would be surprising to me if that made it more accurate.
If that were true they'd never be able to learn anything - neural nets depend on continuous gradients to learn. Weights get updated by incremental/continuous amounts based on gradients.
Even at the output of an LLM, where the internal embeddings have been mapped to token probabilities, those probabilities are also continuous. It's only when you sample from the model that a continuous probability becomes a discrete chosen token.
https://ieeexplore.ieee.org/document/9359148
https://arxiv.org/abs/2205.13016
You could quantize a model like this after training, as usual, but that's irrelevant.
> Since both strategies provided comparable results, the researchers opted for using a constant number of thoughts for simplicity.