Can you spot conceptually similar stories by their shape?
For instance what is the shape of the ugly duckling compared to Rudolf the red nosed reindeer. They are essentially the same story, so presumably on some dimension you should be able to spot them in a group of unrelated stories.
nutanc 16 days ago [-]
Will check for these particular stories. But yes, when we tried this on some stories with a similar arc we saw that their path is similar in the semantic space.
stravant 17 days ago [-]
This feels like a failure to learn the bitter lesson: You're just taking the translation to concepts that the LLM is certainly already doing and trying to make it explicitly forced.
mdp2021 17 days ago [-]
It is explicitly stated in the paper that
> One may argue that LLMs are implicitly learning a hierarchical representation, but we stipulate that models with an explicit hierarchical architecture are better suited to create coherent long-form output
And the problem remains that (text surrounding the above):
> Despite the undeniable success of LLMs and continued progress, all current LLMs miss a crucial characteristic of human intelligence: explicit reasoning and planning at multiple levels of abstraction. The human brain does not operate at the word level only. We usually have a top-down process to solve a complex task or compose a long document: we first plan at a higher level the overall structure, and then step-by-step, add details at lower levels of abstraction. [...] Imagine a researcher giving a fifteen-minute talk. In such a situation, researchers do not usually prepare detailed speeches by writing out every single word they will pronounce. Instead, they outline a flow of higher-level ideas they want to communicate. Should they give the same talk multiple times, the actual words being spoken may differ, the talk could even be given in different languages, but the flow of higher-level abstract ideas will remain the same. Similarly, when writing a research paper or essay on a specific topic, humans usually start by preparing an outline that structures the whole document into sections, which they then refine iteratively. Humans also detect and remember dependencies between the different parts of a longer document at an abstract level. If we expand on our previous research writing example, keeping track of dependencies means that we need to provide results for each of the experiment mentioned in the introduction. Finally, when processing and analyzing information, humans rarely consider every single word in a large document. Instead, we use a hierarchical approach: we remember which part of a long document we should search to find a specific piece of information. To the best of our knowledge, this explicit hierarchical structure of information processing and generation, at an abstract level, independent of any instantiation in a particular language or modality, cannot be found in any of the current LLMs
motoboi 17 days ago [-]
I suppose humans need high level concepts because we can only hold 7[] things in working memory. Computers don’t have that limitation.
Also, humans cannot iterate over thousands of possibilities in a second, like computers do.
And finally, animal brains are severely limited by heat dissipation and energy input flow.
Based on that, artificial intelligence may arise from unexpected simple strategies, given the fundamental differences in scale and structure from animal brains.
- where 7 is whatever number is the correct number nowadays.
dr_dshiv 17 days ago [-]
I just don’t understand that — I thought deep neural nets were inherently hierarchical. Or at least emergently hierarchical?
mdp2021 17 days ago [-]
Neural Nets can be made to be hierarchical - I would say a most notable example is the Convolutional Neural Network so successfully promoted by Yann Le Cun.
But the issue with the LLMs architectures in place is with the idea of "predicting the next token", so strident with the exercise of intelligence - where we search instead for the "neighbouring fitting ideas".
So, "hierarchical" in this context is there to express that it is typical of natural intelligence to refine an idea - formulating an hypothesis and improving its form (hence its expression) step after step of pondering. The issue of transparency in current LLMs, and the idea of "predicting the next token", do not help in having the idea of typical natural intelligence mechanism and the tentative interpretation of LLM internals match.
nightski 16 days ago [-]
Is that true? There are many attention/mlp layers stacked on top of each other. Higher level layers aren't performing attention on input tokens, but instead on the output of the previous layer.
mdp2021 16 days ago [-]
> Is that true
Well, if you are referring to «The issue of transparency in current LLMs», I have not read an essay that explains satisfactorily the inner process and world modelling inside LLMs. Some pieces say (guess?) that the engine has no idea what the whole concept in the reply would be before outputting all the tokens, others swear it seems impossible it has no such idea before formulation...
throwawaymaths 16 days ago [-]
there is a way that "predicting the next token" is ~append-only turing machine. Obviously the tokens we're using might be suboptimal for whatever goalpost "agi" is at any given time, but the structure/strategies of LLMs is probably not far from a really good one, modulo refactoring for efficiency like MAMBA (but still doing token stream prediction, esp. during inference)
motoboi 16 days ago [-]
Not necessarily.
For visual tasks, that is the state of the art, with visual features being "gouped" into more semantically relevant parts ("circles" grouped into "fluffy textures" grouped into "dog ears"). This hierarchy building behavior is baked into the model.
For transformers, not so much. Although each transformer block output serve as input for the next block, they can learn hierarchical relationship (in latent space, not in human language), but that is not backed nor enforced in the architecture.
anon373839 17 days ago [-]
The bitter lesson isn’t a law of nature, though. And as GPT-style LLMs appear to be at the foot of a scaling wall, I personally think inductive bias is due for a comeback.
Der_Einzige 17 days ago [-]
Everyone keeps claiming this but we have zero evidence of any kind of scaling wall what-so-ever. Oh you mean data? Synthetic Data, Agents, and Digitization solve that.
anon373839 17 days ago [-]
I disagree, but I also wasn’t referring to the exhaustion of training materials. I am referring to the fact that exponentially more compute is required to achieve linear gains in performance. At some point, it just won’t be feasible to do $50B training runs, you know?
throw5959 17 days ago [-]
50B still seems reasonable compared to the revenue of the Big AI companies.
Maybe Nvidia, but they are a chip / hardware maker first. And even for them 50B training run with no exponential gains seems unreasonable.
Better to optimize the architecture / approach first, which also is what most companies are doing now before scaling out.
throw5959 16 days ago [-]
It's not unusual to make infrastructure investments that will pay off in 30-50 years. I don't see why not an AI model - unless it's not true that we're at the end of scaling.
cubefox 17 days ago [-]
There were multiple reports confirming that OpenAI's Orion (planned to be GPT-5) yielded unexpectedly weak results.
pegasus 16 days ago [-]
And not just OpenAI is facing this problem. Anthropic and Google as well.
Der_Einzige 16 days ago [-]
So Deepseek V3 did nothing to show you how wrong this take is?
UltraSane 16 days ago [-]
And costs $500 million per training run.
UltraSane 16 days ago [-]
There seems to be a affordable scaling wall.
Jensson 17 days ago [-]
> You're just taking the translation to concepts that the LLM is certainly already doing and trying to make it explicitly forced.
That is what tokens are doing in the first place though, and you get better results with tokens instead of letters.
mdp2021 17 days ago [-]
Well, individual letters in these languages in use* do not convey specific meaning, while individual tokens do - so, you cannot really construe a ladder that would go from letter to token, then from token to sentence.
This said, to research whether the search for concepts (in the solutions space) works better than the search for tokens seems absolutely dutiful, in absence of a solid theory that showed otherwise.
(*Sounds convey their own meaning e.g. in proto-Indo-European according to some interpretations, but that becomes too remote in the current descendants - you cannot reconstruct the implicit sound-token in words directly in English, just from the spelling.)
IanCal 17 days ago [-]
Is that true? I thought there was a desire to move towards byte level work rather than tokens, and that the benefits of tokens was more that you are reducing the context size for the same input.
fngjdflmdflg 16 days ago [-]
>there was a desire to move towards byte level work rather than tokens
Yeah, latest work on this is from Meta a last month.[0] It showed good results.
That should be proven. The two approaches - predicting tokens vs predicting "sentences" - should be compared to see how much their output differ in terms of quality.
Edit2: ...and both (and their variants) be compared to other ideas such as "multi-token prediction"...
Edit: or, appropriateness of the approach should be demonstrated after acquired "transparency" of how the LLMs effectively internally work. I am not aware of studies that make the inner workings of LLMs adequately clear.
Edit3: Substantially, the architecture should be as solid as possible (and results should reflect that).
blackeyeblitzar 17 days ago [-]
Isn’t “sentence prediction” roughly the same as multi token prediction of sufficient length? In the end are we just talking about a change to hyper parameters or maybe a new hyper parameter that controls the granularity of “prediction length”?
mdp2021 17 days ago [-]
> multi token prediction of sufficient length
Is multi token prediction the same as predicting the embedding of a complex token (the articulation of those input tokens in a sentence)?
blackeyeblitzar 16 days ago [-]
To be honest I don’t know. Maybe the only way to know is to build and measure all these variations.
macawfish 17 days ago [-]
At a performance boost of 10-100x :)
mdp2021 17 days ago [-]
> Current best practice for large scale language modeling is to operate at the token level, i.e. to learn to predict the next tokens given a sequence of preceding tokens. There is a large body of research on improvements of LLMs, but most works concentrate on incremental changes and do not question the main underlying architecture. In this paper, we have proposed a new architecture,
For some 2024 may have ended badly,
but reading the lines above shines a great light of hope for the new year.
attentionmech 17 days ago [-]
I like the idea of "concept" .. you can represent a concept with language, visual etc. but it isn't any of those. Those are symbols used to communicate a concept or give representation to it but concepts are just connections between other concepts at the core. The closest things i feel to this is categories in category theory.
layer8 16 days ago [-]
Concepts need to be linked to reality somehow in order to carry any meaning. They are thus not just relations between themselves.
dr_dshiv 16 days ago [-]
Platonic forms?
attentionmech 16 days ago [-]
interesting concept they are.
rxm 16 days ago [-]
What used to be feature engineering a decade or more ago now seems to have shifted to developing distributed representations. LLMs use word tokens (for words or the entities in images). But there are many more. The 3D Fields (or whatever they have evolved to) developed by Fei-Fei Li's group represent visual information in a way better suited for geometrical tasks. Wav2Vec, the convolutional features for YOLO and friends, and these sentence representations are other examples. I would love to read a review of this circle of ideas.
inshard 17 days ago [-]
This is interesting. I wonder if such a project could dive into lower-level concepts, those akin to prime numbers. The atoms from which all other concepts are built.
lern_too_spel 17 days ago [-]
This is like going back to CNNs. Attention is all you need.
zed1726 17 days ago [-]
Quantum states are all one really needs, but it turns out that it's way to computationally expensive to simulate all that just for the purpose of AI applications - so instead we have to go to higher levels of construction. Attention is surely just about on the cusp of what is computationally reasonable which means that it's not all we need, we need more efficient and richer constructions.
mdp2021 17 days ago [-]
We do not need quantum states to build (arithmetic) calculators. Nor, very probably, for complex and much more complex calculators.
katamari-damacy 17 days ago [-]
Yes, just spray Quantum on it
chronic4948412 17 days ago [-]
> Yes, just spray Quantum on it
Careful, don’t give Sam Altman any ideas.
Once OpenAI cannot raise enough capital, he will aim quantum AGI.
snake_doc 17 days ago [-]
Attention is just communication? It’s orthogonal to the space of the representation.
benreesman 17 days ago [-]
Between this and learned patches and ModernBERT and DeepSeek?
I think it’s time to read up.
upghost 17 days ago [-]
Aside from the using the word "concept" instead of "language" I don't see how this is different than an LLM. It's still doing next token prediction. This is like in D&D where you have two swords with wildly different flavor text but ultimately they both do 1d6+1 damage.
What am I missing -- aside from the marketing? Is there something architecturally different or what? Looks like regular autoregressive sequence transformer to me.
tantalor 17 days ago [-]
(Guessing here) It does tokenization and prediction for a whole sentence, not fragments of words.
I like this idea because that's how humans think. We mentally formulate a whole sentence, then say it. People who don't do this speak in run-ons and word salad.
botanical76 16 days ago [-]
I would be interested to know how many people do formulate a whole sentence before saying it. "Think before you speak" as they say. I feel I do not have the cognitive window or processing speed to do this; instead, I formulate a concept of how I would like to respond abstractly, and then think of and say phrases of several words one at a time until the sentence ends itself. The latter process leans heavily on some kind of next word anticipation.
mdp2021 16 days ago [-]
> how many people do formulate a whole sentence before saying it
The process is a formulation of precise ideas (complex at some level and verified to some degree, hopefully), then translated into sentences for output (not necessarily in these two steps, but through iterations).
This project tries to use sentences as formalizations of ideas - an interesting way enabled by availability of tools, allowing good features like transparency.
upghost 17 days ago [-]
oh interesting. concepts as tokens. Yeah I'd buy that. They do something similar with transformers in robotics, except they use tokens as actions instead of word chunks. Good eye.
mdp2021 17 days ago [-]
> something architecturally different
An embedding space engine accepting sentences (SONAR) fit in so that the tokens of this architecture are complex sets of the tokens of past architectures.
17 days ago [-]
16 days ago [-]
YeGoblynQueenne 17 days ago [-]
From the paper:
>> In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a “concept”.
I wonder if the many authors of the paper know that what they call "concept" is what all of machine learning and AI has also called a "concept" for many decades, and not a new thing that they have just named from scratch.
For instance, classes of "concepts" are the target of learning in Leslie Valiant's "A Theory of the Learnable", the paper that introduced Probably Approximately Correct Learning (PAC-Learning). Quoting from its abstract:
ABSTRACT: Humans appear to be able to learn new
concepts without needing to be programmed explicitly in
any conventional sense. In this paper we regard learning as
the phenomenon of knowledge acquisition in the absence of
explicit programming. We give a precise methodology for
studying this phenomenon from a computational viewpoint.
It consists of choosing an appropriate information gathering
mechanism, the learning protocol, and exploring the class of
concepts that can be learned using it in a reasonable
(polynomial) number of steps. Although inherent algorithmic
complexity appears to set serious limits to the range of
concepts that can be learned, we show that there are some
important nontrivial classes of propositional concepts that
can be learned in a realistic sense
Or take this Introduction to Chapter 2 in Tom Mitchell's "Machine Learning" (the original ML textbook, published 1997):
This chapter considers concept learning: acquiring the definition of
a general category given a sample of positive and negative training
examples of the category.
I mean I really wonder some times what is going on here. There's been decades of research in AI and machine learning but recently papers look like their authors have landed in an undiscovered country and are having to invent everything from scratch. That's not good. There are pitfalls that all the previous generations have explored thoroughly by falling in them time and again. Those who don't remember those lessons will have to find that out the hard way.
mdp2021 17 days ago [-]
I am not sure that fits the point, YGQ:
it seems to me the concept of «concept» in the paper is "the embedding vector we get in systems like SONAR (which we could use to generalize ordered sets of tokens into more complex ideas)". That's pretty specific, only marginally related to past handling as mentioned.
YeGoblynQueenne 16 days ago [-]
That's only the representation of a concept. Different systems and different approaches will have different representations but that doesn't change the fact of what is being represented.
mdp2021 16 days ago [-]
But if the issue is about "research in AI has had to deal with the concept of "concept" since the inception" (and of course it had to), the contribution in this paper is to try an operational implementation that could bear fruit and possibly fix architectural shortcomings of the mainstream effort.
(It is not separate from the context of LLMs.)
YeGoblynQueenne 16 days ago [-]
Right, but there's been many operationalisations before. That's what's not new. Tome Mitchell's textbook has plenty of examples. Basically all of machine learning is about learning concepts- in practice as well as in theory. That's the whole point.
We can clearly see in 2D space itself how different "concepts" are explored.
Using the shape of stories for semantic chunking we can clearly see in multiple articles how we can chunk by "concepts". [2]
Now we are trying to see if we can just use these chunks and train a next "chunk" predictor instead of a next word predictor.
In the paper, they take a sentence to mean a concept. We believe that a "semantic chunk" is better suited for a concept instead of a sentence.
[1] https://gpt3experiments.substack.com/p/the-shape-of-stories-...
[2]https://gpt3experiments.substack.com/p/a-new-chunking-approa...
For instance what is the shape of the ugly duckling compared to Rudolf the red nosed reindeer. They are essentially the same story, so presumably on some dimension you should be able to spot them in a group of unrelated stories.
> One may argue that LLMs are implicitly learning a hierarchical representation, but we stipulate that models with an explicit hierarchical architecture are better suited to create coherent long-form output
And the problem remains that (text surrounding the above):
> Despite the undeniable success of LLMs and continued progress, all current LLMs miss a crucial characteristic of human intelligence: explicit reasoning and planning at multiple levels of abstraction. The human brain does not operate at the word level only. We usually have a top-down process to solve a complex task or compose a long document: we first plan at a higher level the overall structure, and then step-by-step, add details at lower levels of abstraction. [...] Imagine a researcher giving a fifteen-minute talk. In such a situation, researchers do not usually prepare detailed speeches by writing out every single word they will pronounce. Instead, they outline a flow of higher-level ideas they want to communicate. Should they give the same talk multiple times, the actual words being spoken may differ, the talk could even be given in different languages, but the flow of higher-level abstract ideas will remain the same. Similarly, when writing a research paper or essay on a specific topic, humans usually start by preparing an outline that structures the whole document into sections, which they then refine iteratively. Humans also detect and remember dependencies between the different parts of a longer document at an abstract level. If we expand on our previous research writing example, keeping track of dependencies means that we need to provide results for each of the experiment mentioned in the introduction. Finally, when processing and analyzing information, humans rarely consider every single word in a large document. Instead, we use a hierarchical approach: we remember which part of a long document we should search to find a specific piece of information. To the best of our knowledge, this explicit hierarchical structure of information processing and generation, at an abstract level, independent of any instantiation in a particular language or modality, cannot be found in any of the current LLMs
Also, humans cannot iterate over thousands of possibilities in a second, like computers do.
And finally, animal brains are severely limited by heat dissipation and energy input flow.
Based on that, artificial intelligence may arise from unexpected simple strategies, given the fundamental differences in scale and structure from animal brains.
- where 7 is whatever number is the correct number nowadays.
But the issue with the LLMs architectures in place is with the idea of "predicting the next token", so strident with the exercise of intelligence - where we search instead for the "neighbouring fitting ideas".
So, "hierarchical" in this context is there to express that it is typical of natural intelligence to refine an idea - formulating an hypothesis and improving its form (hence its expression) step after step of pondering. The issue of transparency in current LLMs, and the idea of "predicting the next token", do not help in having the idea of typical natural intelligence mechanism and the tentative interpretation of LLM internals match.
Well, if you are referring to «The issue of transparency in current LLMs», I have not read an essay that explains satisfactorily the inner process and world modelling inside LLMs. Some pieces say (guess?) that the engine has no idea what the whole concept in the reply would be before outputting all the tokens, others swear it seems impossible it has no such idea before formulation...
For visual tasks, that is the state of the art, with visual features being "gouped" into more semantically relevant parts ("circles" grouped into "fluffy textures" grouped into "dog ears"). This hierarchy building behavior is baked into the model.
For transformers, not so much. Although each transformer block output serve as input for the next block, they can learn hierarchical relationship (in latent space, not in human language), but that is not backed nor enforced in the architecture.
Maybe Nvidia, but they are a chip / hardware maker first. And even for them 50B training run with no exponential gains seems unreasonable.
Better to optimize the architecture / approach first, which also is what most companies are doing now before scaling out.
That is what tokens are doing in the first place though, and you get better results with tokens instead of letters.
This said, to research whether the search for concepts (in the solutions space) works better than the search for tokens seems absolutely dutiful, in absence of a solid theory that showed otherwise.
(*Sounds convey their own meaning e.g. in proto-Indo-European according to some interpretations, but that becomes too remote in the current descendants - you cannot reconstruct the implicit sound-token in words directly in English, just from the spelling.)
Yeah, latest work on this is from Meta a last month.[0] It showed good results.
[0] https://ai.meta.com/research/publications/byte-latent-transf... (https://news.ycombinator.com/item?id=42415122)
Edit2: ...and both (and their variants) be compared to other ideas such as "multi-token prediction"...
Edit: or, appropriateness of the approach should be demonstrated after acquired "transparency" of how the LLMs effectively internally work. I am not aware of studies that make the inner workings of LLMs adequately clear.
Edit3: Substantially, the architecture should be as solid as possible (and results should reflect that).
Is multi token prediction the same as predicting the embedding of a complex token (the articulation of those input tokens in a sentence)?
For some 2024 may have ended badly,
but reading the lines above shines a great light of hope for the new year.
Careful, don’t give Sam Altman any ideas.
Once OpenAI cannot raise enough capital, he will aim quantum AGI.
I think it’s time to read up.
What am I missing -- aside from the marketing? Is there something architecturally different or what? Looks like regular autoregressive sequence transformer to me.
I like this idea because that's how humans think. We mentally formulate a whole sentence, then say it. People who don't do this speak in run-ons and word salad.
The process is a formulation of precise ideas (complex at some level and verified to some degree, hopefully), then translated into sentences for output (not necessarily in these two steps, but through iterations).
This project tries to use sentences as formalizations of ideas - an interesting way enabled by availability of tools, allowing good features like transparency.
An embedding space engine accepting sentences (SONAR) fit in so that the tokens of this architecture are complex sets of the tokens of past architectures.
>> In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a “concept”.
I wonder if the many authors of the paper know that what they call "concept" is what all of machine learning and AI has also called a "concept" for many decades, and not a new thing that they have just named from scratch.
For instance, classes of "concepts" are the target of learning in Leslie Valiant's "A Theory of the Learnable", the paper that introduced Probably Approximately Correct Learning (PAC-Learning). Quoting from its abstract:
From: https://web.mit.edu/6.435/www/Valiant84.pdfOr take this Introduction to Chapter 2 in Tom Mitchell's "Machine Learning" (the original ML textbook, published 1997):
From: https://www.cs.cmu.edu/~tom/mlbook.html (clink link in "the book").I mean I really wonder some times what is going on here. There's been decades of research in AI and machine learning but recently papers look like their authors have landed in an undiscovered country and are having to invent everything from scratch. That's not good. There are pitfalls that all the previous generations have explored thoroughly by falling in them time and again. Those who don't remember those lessons will have to find that out the hard way.
it seems to me the concept of «concept» in the paper is "the embedding vector we get in systems like SONAR (which we could use to generalize ordered sets of tokens into more complex ideas)". That's pretty specific, only marginally related to past handling as mentioned.
(It is not separate from the context of LLMs.)