This is .. pretty interesting! According to the abstract, models trained long enough use some feature layers for magnitude assessment, and others for modular assessment, (e.g. even / odd). It's surprising to me that this is a stable outcome for trained LLMs when they encounter math. Definitely not what seems simplest to my meatspace brain.
wongarsu 36 days ago [-]
My meatspace brain can do fast accurate math up to about three digit results. After than I fall back to iterative processes with chain-of-thought, and possibly physical scratch space. My brain can however do magnitude assessment and modular assessment in near-constant time too, which I use to verify the correctness of the chain-of-thought result.
svachalek 36 days ago [-]
Meatspace brain uses digits, LLM uses tokens. i.e. when I enter "7953 + 5205" what gpt-4o is actually computing is [48186, 18, 659, 220, 26069, 20] (https://platform.openai.com/tokenizer)
So saying it's not the simplest is an understatement by far, it's doing millions or billions of times as much math as a calculator would. Ask an LLM to generate a program to do math, rather than doing the math itself.
duskwuff 35 days ago [-]
Nor does it help that numbers are written with the most significant digit first, but actually computing the sum requires the digits to be evaluated in the opposite order.
35 days ago [-]
lsy 35 days ago [-]
Something like this seems expected, right? If you tune a statistical model to very high accuracy in "addition" over tokens, then the resulting structure of the model must correspond to some structure in the training data. And fourier would make sense for some token like "123" which internally is represented as the integer 7633, but needs to "contain" information about the text digits for math to work. Notably, this still ends up being in some way a statistical endeavor rather than truly learning addition, as even the fine-tuned model doesn't reach 100% accuracy.
wongarsu 35 days ago [-]
> Notably, this still ends up being in some way a statistical endeavor rather than truly learning addition, as even the fine-tuned model doesn't reach 100% accuracy
If that's our metric then most humans haven't truly learned addition either
For any neural network, the standard you can expect for any learned skill is closer to a human learning that skill than to a computer programmed to do that thing. There will be occasional mistakes
mannykannot 35 days ago [-]
How LLMs tackle addition is an interesting question in its own right, independently of whether their accuracy provides a metric for judging their ability relative to that of typical humans.
Bolwin 35 days ago [-]
Well we have formally learned addition but most of the time I actually do it, I'm not doing it, I'm going based on some half remembered pattern checked with statistical expectations.
I'm sure the llm could formally do it too
godelski 35 days ago [-]
There's always a lot of research that is "expected", but there's nothing wrong with that. The two most common reasons this happens are:
1. Well somebody has got to do the work and we can't all just go around assuming stuff, even if we're pretty confident. The confirmation helps and is beneficial to the community.
2. It's obvious post hoc and you have gaslit yourself into thinking that you already knew it because you kinda knew it at a high level and you only read the result at a high level too so you entirely miss all the actual details and all the context (especially since there is so much context that never makes it into a paper[0])
3. (bonus) It addresses the same thing that someone else addressed but from a different approach and the new approach can provide additional insight.
Either way, it is beneficial to the community. Sure, nothing is groundbreaking but that's how science is. 99% incremental steps. And hey, these days most ML papers are just an active demonstration of how well you can brute force search optimal hyperparameters (i.e. how much compute you can afford). I see far fewer sufficiently isolating variables and actually provide strong evidence of the things claimed as novel, not recognizing that benchmark results are far from sufficient. But I blame reviewers for that, but also see rant in [0]
[0] I think the hardest thing about beginning a PhD is that you're reading a bunch of papers going like "why the fuck are they doing this?" and the problem is that you don't have enough breadth or depth to get it. You don't understand the decades long conversation of how we got here and what problems were being addressed along the way. To be fair, a lot of this is never stated explicitly and so you annoyingly have to piece everything together by reading a few hundred papers. But also, good luck providing all that context within page limits and besides, papers are written for /peers/, by which I mean niche peers, not domain peers. Ain't nobody got time to write textbooks, because you're just trying to publish so you don't perish and you're already exhausted from all the grant writing, rebuttals, bureaucratic work, and all that fun jazz.
35 days ago [-]
bongodongobob 35 days ago [-]
Seems closer to "truly learning addition" (whatever that means) than what humans do. We use a mechanical algorithm to carry the 1 etc.
pinkmuffinere 35 days ago [-]
What? This is a crazy take. I feel we (humanity) has truly understood addition to a very high degree. The fact that we can see what an LLM is doing and understand “oh it’s doing addition in a rather roundabout way via Fourier transforms” is a testament to just how well we understand addition — we can recognize hundreds of different ways to achieve the operation, know that they are equivalent, and pick the most convenient one for the situation
Ukv 35 days ago [-]
I think humans do two broad types of arithmetic:
#1: For small numbers, the answer is "directly" available to us without consciously applying any steps
#2: For larger numbers, we consciously apply some formal method to break the problem down into smaller steps of type #1. Like column addition, adding digit-by-digit and carrying the ones
Despite seeming direct, I'd argue that if we could see at a low-level what our neurons were actually doing for #1, like to get 3 + 5, it would likely also look roundabout. Might even be a similar process as with LLMs, approximating magnitude (~7-9) then snapping to parity (even, since 3 and 5 are odd).
LLMs should be capable of #2, including choosing an appropriate method, with chain-of-thought reasoning. But in addition to that, and I think what bongodongobob is getting at, is that LLMs appear to have a more robust #1 than us - being able to accurately add far larger numbers whereas we'd normally fall back to a step-by-step method after one or two digits.
globalnode 36 days ago [-]
Great, my mathematical nemesis is now a part of LLM functionality as well. Are people trying to make this stuff harder?
TeMPOraL 36 days ago [-]
IDK, the more I learn the more it seems to me that Fourier transform is reality's cheat code. It keeps showing up everywhere.
Like, the other day I learned[0] that if you shine a light through a small opening, the diffraction pattern you get on the other side is basically the Fourier transform of the aperture outline.
(Yes, this also implies that if you take a Fourier transform of an image and make a diffraction grating off the resulting pattern, projecting light through it should paint you the original image.)
Right, physics runs on differential equations and sinusoids/exponentials are eigenfunctions of differential equations.
You can project reality onto any complete basis of functions you like, but this one tends to diagonalize the physics of our universe, which is an overpowered ability inside of our universe.
mananaysiempre 36 days ago [-]
> tends to diagonalize the physics of our universe
Because it diagonalizes all good translationally invariant operators, and our universe is fond of translation invariance until you get into general relativity. (This sounds less mysterious once you learn that all good translationally invariant operators are essentially convolutions. Neither of these statements is often taught at the elementary level, probably because of the difficulty and ambiguity in defining “all good” and “are essentially”.)
krackers 35 days ago [-]
> good translationally invariant operators are essentially convolutions.
I first learned this seemingly obvious in hindsight corollary from another comment on HN [1] and it blew my mind. I wish it was included in the usual descriptions of why we choose complex exponential basis for things like the Laplace transform. It's all well and good that they're eigenfunctions of translations, but it still left me wondering why we care about that in the first place.
(If your first exposure is instead from a physics or EE perspective, I suspect the framing would be more obvious, as compared to how it's usually introduced in DiffEq when the choice of basis just seems like a "neat" trick given that it behaves well under differentiation).
> It’s all well and good that they’re eigenfunctions of translations, but it still left me wondering why we care about that in the first place.
You don’t need the convolution statement to see that (as the comment you linked above also demonstrates). A good linear algebra course should have the statement that any set of commuting operators has a common eigenbasis[1]. In particular, if an operator has nondegenerate eigenvalues, then its (essentially unique) eigenbasis is also an eigenbasis for any operator that commutes with it. Take a translation as the former operator and any translation-invariant operator as the latter and you see why all of these just got diagonalized simultaneously.
Above I’ve blatantly ignored all the infinite-dimensional problems that arise when attempting to explain grown-up Fourier transforms, but literally this is actually enough if your Fourier transform is finite-dimensional—the Fourier transform on Z/nZ (aka the discrete Fourier transform on a circle) is most commonly used in applications, but literally everything goes through word for word on an arbitrary finite Abelian group. If your mental powers of abstraction feel like they should be able to acquire some intuition about the real case from the finite case, I highly recommend you read Paul Garrett’s note on the topic[2].
That said, yes, the statement on convolution operators is unreasonably hard to find or stumble upon in the literature. Part of it is that stating it properly is annoying and fussy. Another part is purely terminological: the term “convolution operator” is really rare among books younger than half a century. The usual term is instead “Fourier multiplier” or just “multiplier”, which basically makes sense iff you already know the convolution theorem. Searching for the modern term should give you a plethora of sources. (AFAIU, part of the motivation for this switch is that working with the Fourier transform of your convolution kernel instead of the kernel itself allows one to avoid distributions / generalized functions—and the associated hardcore functional analysis—longer. Consider that, if you want to use kernels, already the literal identity operator forces you to work with the Dirac delta.)
[1] If A and B commute, then any eigenspace ker(A-tE) of A is invariant under B, so decompose your space into a direct sum of eigenspaces of one of your operators and recurse on each. Choose an arbitrary basis when the operators run out.
(Almost-)linear models do linear things, it seems, and the Fourier transform is the quintessential linear thing.
It is also an extremely neat piece of the real world, but I’m hesitant to guess your background and offer an explanation because your phrasing makes me suspect an engineering one. With concepts usually being the first to be culled in a course targeted at engineers, there could be quite a bit of concept debt to pay off before I could really offer something I could honestly call an explanation.
Have you tried the 3Blue1Brown video on the topic[1]? It does not AFAIR offer any answers as to why the Fourier transform should exist or be useful, but it does show very well what it does in the immediate sense.
Thanks for your reply. I think I'm going to have to start reading up on physics/differential equations. My linalg is ok but quite a bit of my computing background has been "here, this is how you calculate it" instead of concepts. I really feel theres something about Fourier that seems pretty important.
almostgotcaught 36 days ago [-]
You gotta be in the in-crowd to understand that this paper, like so many others, is one of those dumb posthoc analogy/metaphor papers. These papers are where they just ran a bunch of experiments (ie just ran the training script over and over) and formulated a hypothesis empirically. Of course in order to lend the hypothesis some credibility they have to make an allusion to something formal/mathematical:
> Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain
Brilliant and very rigorous!
wongarsu 36 days ago [-]
Curious that they chose to use GPT-2-XL, given the age of that model. I guess they wanted a small model (1.5B) and started work on this quite a while ago. Today there is a decent selection of much more capable 1.5B models (Quen2-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, TinySwallow 1.5B, Stella 1.5B, Qwen2.5-Math-1.5B, etc). But they are all derived from the Qwen series of models, which wasn't available when they started this research.
Sharlin 36 days ago [-]
You can think of GPT-2 as the D. melanogaster of language models.
nickpsecurity 36 days ago [-]
I was collecting examples of models trained on single GPU or with very low cost. A number of projects used Bert or GPT to since the implementations were very simple with with some components optimized. There’s also a lot of projects that have trained Bert in GPT two models which make for more scientific comparisons.
With no other information, those would be my guesses as to why one would use a GPT2 model.
scoresmoke 36 days ago [-]
GPT-2 follows the very well-studied architecture of Transformer decoder, so the outcomes of this study might be applicable to the more complicated models.
littlestymaar 36 days ago [-]
TinyLlama would have worked too and is older than the Qwen family.
imjonse 36 days ago [-]
The paper predates Qwen2 and R1, this work is probably a year old.
36 days ago [-]
ImHereToVote 36 days ago [-]
I understand GPT-2 has been somewhat mapped to a certain extent.
DoctorOetker 36 days ago [-]
> Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy.
what's the convention on the meaning of "pre-training" vs "training from scratch" ?
Is this a nomenclature shift?
currymj 36 days ago [-]
pre-trained model would mean training a language model to predict text, then starting from there and training it to add numbers.
training from scratch would be initializing a neural network, and training it to add numbers directly.
36 days ago [-]
dvrp 35 days ago [-]
Has anyone seen a similar paper to this applied to DiTs or diffusion in general? (or autoregressive models for image generation)
metadat 36 days ago [-]
Does this also hold for other functions such as sin, asin, multiplication, division, etc?
I agree that the commentary is nice, and should be paid attention to, but it's also from 2007. So perhaps you mean to say "it would be useful to consider this paper in the context of this 2007 commentary"? I would agree with that.
Interpreting Modular Addition in MLPs https://www.lesswrong.com/posts/cbDEjnRheYn38Dpc5/interpreti...
Paper Replication Walkthrough: Reverse-Engineering Modular Addition https://www.neelnanda.io/mechanistic-interpretability/modula...
So saying it's not the simplest is an understatement by far, it's doing millions or billions of times as much math as a calculator would. Ask an LLM to generate a program to do math, rather than doing the math itself.
If that's our metric then most humans haven't truly learned addition either
For any neural network, the standard you can expect for any learned skill is closer to a human learning that skill than to a computer programmed to do that thing. There will be occasional mistakes
I'm sure the llm could formally do it too
[0] I think the hardest thing about beginning a PhD is that you're reading a bunch of papers going like "why the fuck are they doing this?" and the problem is that you don't have enough breadth or depth to get it. You don't understand the decades long conversation of how we got here and what problems were being addressed along the way. To be fair, a lot of this is never stated explicitly and so you annoyingly have to piece everything together by reading a few hundred papers. But also, good luck providing all that context within page limits and besides, papers are written for /peers/, by which I mean niche peers, not domain peers. Ain't nobody got time to write textbooks, because you're just trying to publish so you don't perish and you're already exhausted from all the grant writing, rebuttals, bureaucratic work, and all that fun jazz.
#1: For small numbers, the answer is "directly" available to us without consciously applying any steps
#2: For larger numbers, we consciously apply some formal method to break the problem down into smaller steps of type #1. Like column addition, adding digit-by-digit and carrying the ones
Despite seeming direct, I'd argue that if we could see at a low-level what our neurons were actually doing for #1, like to get 3 + 5, it would likely also look roundabout. Might even be a similar process as with LLMs, approximating magnitude (~7-9) then snapping to parity (even, since 3 and 5 are odd).
LLMs should be capable of #2, including choosing an appropriate method, with chain-of-thought reasoning. But in addition to that, and I think what bongodongobob is getting at, is that LLMs appear to have a more robust #1 than us - being able to accurately add far larger numbers whereas we'd normally fall back to a step-by-step method after one or two digits.
Like, the other day I learned[0] that if you shine a light through a small opening, the diffraction pattern you get on the other side is basically the Fourier transform of the aperture outline.
(Yes, this also implies that if you take a Fourier transform of an image and make a diffraction grating off the resulting pattern, projecting light through it should paint you the original image.)
--
[0] - https://www.youtube.com/watch?v=Y9FZ4igNxNA
You can project reality onto any complete basis of functions you like, but this one tends to diagonalize the physics of our universe, which is an overpowered ability inside of our universe.
Because it diagonalizes all good translationally invariant operators, and our universe is fond of translation invariance until you get into general relativity. (This sounds less mysterious once you learn that all good translationally invariant operators are essentially convolutions. Neither of these statements is often taught at the elementary level, probably because of the difficulty and ambiguity in defining “all good” and “are essentially”.)
I first learned this seemingly obvious in hindsight corollary from another comment on HN [1] and it blew my mind. I wish it was included in the usual descriptions of why we choose complex exponential basis for things like the Laplace transform. It's all well and good that they're eigenfunctions of translations, but it still left me wondering why we care about that in the first place.
(If your first exposure is instead from a physics or EE perspective, I suspect the framing would be more obvious, as compared to how it's usually introduced in DiffEq when the choice of basis just seems like a "neat" trick given that it behaves well under differentiation).
[1] https://news.ycombinator.com/item?id=37915848
You don’t need the convolution statement to see that (as the comment you linked above also demonstrates). A good linear algebra course should have the statement that any set of commuting operators has a common eigenbasis[1]. In particular, if an operator has nondegenerate eigenvalues, then its (essentially unique) eigenbasis is also an eigenbasis for any operator that commutes with it. Take a translation as the former operator and any translation-invariant operator as the latter and you see why all of these just got diagonalized simultaneously.
Above I’ve blatantly ignored all the infinite-dimensional problems that arise when attempting to explain grown-up Fourier transforms, but literally this is actually enough if your Fourier transform is finite-dimensional—the Fourier transform on Z/nZ (aka the discrete Fourier transform on a circle) is most commonly used in applications, but literally everything goes through word for word on an arbitrary finite Abelian group. If your mental powers of abstraction feel like they should be able to acquire some intuition about the real case from the finite case, I highly recommend you read Paul Garrett’s note on the topic[2].
That said, yes, the statement on convolution operators is unreasonably hard to find or stumble upon in the literature. Part of it is that stating it properly is annoying and fussy. Another part is purely terminological: the term “convolution operator” is really rare among books younger than half a century. The usual term is instead “Fourier multiplier” or just “multiplier”, which basically makes sense iff you already know the convolution theorem. Searching for the modern term should give you a plethora of sources. (AFAIU, part of the motivation for this switch is that working with the Fourier transform of your convolution kernel instead of the kernel itself allows one to avoid distributions / generalized functions—and the associated hardcore functional analysis—longer. Consider that, if you want to use kernels, already the literal identity operator forces you to work with the Dirac delta.)
[1] If A and B commute, then any eigenspace ker(A-tE) of A is invariant under B, so decompose your space into a direct sum of eigenspaces of one of your operators and recurse on each. Choose an arbitrary basis when the operators run out.
[2] https://www-users.cse.umn.edu/~garrett/m/repns/notes_2014-15...
https://www.youtube.com/watch?v=EmKQsSDlaa4
It is also an extremely neat piece of the real world, but I’m hesitant to guess your background and offer an explanation because your phrasing makes me suspect an engineering one. With concepts usually being the first to be culled in a course targeted at engineers, there could be quite a bit of concept debt to pay off before I could really offer something I could honestly call an explanation.
Have you tried the 3Blue1Brown video on the topic[1]? It does not AFAIR offer any answers as to why the Fourier transform should exist or be useful, but it does show very well what it does in the immediate sense.
[1] https://www.youtube.com/watch?v=spUNpyF58BY
> Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain
Brilliant and very rigorous!
With no other information, those would be my guesses as to why one would use a GPT2 model.
what's the convention on the meaning of "pre-training" vs "training from scratch" ?
Is this a nomenclature shift?
training from scratch would be initializing a neural network, and training it to add numbers directly.
(Either that, or you linked to the wrong commentary. https://www.lesswrong.com/posts/E7z89FKLsHk5DkmDL/language-m... would be closer, other than being a different paper.)