This piece is basically set up to fail. It's historical, based on math, and doesn't venture towards any drama or stakes. I can easily imagine it being a farticle on a third-rate SEO webzine.
So why does it work okay as a New Yorker piece? How is their writing consistently good?
I think their secret sauce is the (implied) perspective. The impression that the author has a unique, complete, accurate take on the subject, and is letting you in on it piece by piece, in a meandering way.
tucnak 17 days ago [-]
You're exactly right: what you refer to as "perspective" is easily faked.
martindale 17 days ago [-]
Astroturf on HN. Never thought I'd see the day.
sdwr 14 days ago [-]
Yes I'm a shill for big "New Yorker magazine". You are so smart how did you figure it out.
If thresholding of P-values is the issue, E-values -- a recent, much more elegant, easier to work with, and more robust alternative to P-values -- solve this.
Harold Shipman with the NHS killed about one person per month for 30 years, and the response was to ask if doctors could be monitored to discover this earlier. However, the systems they conceived "eventually cast suspicion on the innocent". 25 years later the NHS is still struggling to answer these questions.
I can tell by your response that you are burdened by understanding at least one of a)what a p-value is b)What an “LLM hallucination” means or c)what actually causes llms to hallucinate.
If you set yourself free from the meaning of all the nouns in the sentence then you can get there.
mnky9800n 17 days ago [-]
I recommend setting yourself free of all nouns in general.
klysm 17 days ago [-]
Okay, if I forget what p-value means and just take it to mean probability, I guess I can see the point? It’s still wrong through
seanhunter 17 days ago [-]
Yes, it is still wrong even if you just think "probability", not "p-value".
For people who don't believe me, spin up your LLM of choice using the API in your favourite language[1] and make some query using a temperature of zero. You will find if you repeat the query multiple times you always get the same response. That is because it always giving you the highest weighted result in the transformer output whereas if you set a non-zero temperature (or use the default chat frontend) it does a weighted random sample.
So there is no probabilistic variance between responses with temperature set to zero for a given model, but you will nonetheless find that you can get the LLM to hallucinate. One way I've found to get LLMs to frequently hallucinate is to ask the difference between two concepts that are actually the same (eg gemini gave me a very convincing looking but totally wrong "explanation" of the difference between a linear map and a linear transformation in linear algebra [2].
Therefore the probabilistic nature of a normal LLM response can not be the reason for hallucination because when we turn that off we find we still get hallucinations.
The real reason that LLMs hallucinate is more mundane and yet more fundamental- Hallucinating (in the normal sense of the word) is actually all that LLMs do. This is what Karpathy is talking about when he says that LLMs "dream documents". We just specifically call it "hallucination" when the results are somehow undesirable, typically because they don't correspond with some particular facts we would like the model's output to be grounded in.
But LLMs don't have any sort of model of the world, they have weights which are a lossy compression of their raw training data, so in response to some prompt they give the response that they have learned in instruction fine-tuning minimizes whatever loss function was used for that fine-tuning process. That's all. When we use words like "hallucination" we are in danger of anthropomorphising the model and using our reasoning process to try to back into how the model actually works.
[1] You need to use the programming API rather than the usual web frontend to set the temperature parameter.
[2] For the curious, it more or less said that for one of them (I forget which) you could move the origin so turned it into an affine transformation, but it mangled the underlying maths further. The evidence has fallen out of my gemini history so I can't share it, but that sort of approach has been fruitful in the past. Neither chatgpt nor claude fall for that specific example fwliw.
BlueTemplar 17 days ago [-]
While I like to call out people for using "hallucinate" for this kind of behavior too (for language models, at least, it might actually be appropriate for visual models ?)-
> One way I've found to get LLMs to frequently hallucinate is to ask the difference between two concepts that are actually the same
-this only confirms my belief that "bulshitting" is an appropriate term to use for this behavior : doesn't exactly the same thing happen with (not savvy enough) human students ?
You call it "anthropomorphizing", and "not having a model of the world", but isn't it more like forcing a model of the world on the student / language model by the way that you frame the question ?
(Interestingly, there might be a parallel here with the article : with the language model not being a real student, but a statistical average over all students, including being "with one breast and one testicle".)
seanhunter 17 days ago [-]
Yes actually I think you're right.
zrm 17 days ago [-]
LLMs are essentially predicting what token could plausibly come next. They sort the possible next tokens by probability, often throw out the the ones with very low probabilities (p-value too low) and then use a weighted random number generator to choose one from the rest.
Sometimes that means you exclude a good next token, or include a bad one, so a bad one gets chosen. And then once it has, the thing is going to pick whatever is most likely to come after that, which will be some malarkey because it has already emitted a nonsense token and is now using that as context. But whatever it is, it will still sound plausible because it's still choosing from the most likely things to follow what has already been emitted.
seanhunter 17 days ago [-]
A p-value isn’t just any old probability- it has a specific meaning related to hypothesis testing[1]. A p-value is the conditional probability of seeing a result at least as extreme as some observation under the null hypothesis.
Yes LLMs generate tokens using a stochastic process, so it is probabalistic. Everyone knows that, but in the normal process of generating text, LLMs aren’t doing a hypothesis test so by definition p-values are completely irrelevant to how LLMs hallucinate.
This is such a common thing to misunderstand that I'm going to respond to my own message to give an explanation, because many of the links I've found make sense once you know the lingo etc but might not before then.
Say you go into a bar and just by chance there is a football[1] match on television between Denmark and France. You see a bunch of fans of each country and you think "Hey, the Danes look taller than the French". You want to find out whether this is true in general, so to test this hypothesis you persuade them during a lull in the match to line up and get measured. As luck would have it there are exactly n people from each country.
H_0 (the null hypothesis) is that the two population means are the same. That is, that Danish people have the same average height as French people.
H_1 (the alternative hypothesis) is that Danish people are taller on average than French people (ie the population mean is larger).
So you take the average height and see that the Danes in this bar are say 5cm taller on average than the French people in this bar.
The p-value is how likely it would be to select a random sample of n people from each of two populations (one from Danes, one from French people) with an average height of the Danes in the sample being at least 5cm larger than the French if the actual average height of the underlying populations you sampled from (all Danes and all French people) were the same.
How you use a p-value is typically if it is smaller than some threshold called a critical value you "reject the null hypothesis" at some significance level. So in this case if the p-value was small enough you conclude that the population means are unlikely to be the same.
Actually calculating the p-value is going to depend a bit on the distribution etc but that's what a p-value is. As you can see it's not just a probability.
[1]soccer if you're from the US
zrm 14 days ago [-]
I was vaguely aware of this and I don't know if I want to say I was just being careless or if that was the point because the original comment was kind of a mess and I was defending it in jest.
ellisv 17 days ago [-]
Would you elaborate?
svnt 17 days ago [-]
If I take it at face value they seem to be blaming LLM hallucinations on questionable training data from published science done badly while focused on p-values.
So why does it work okay as a New Yorker piece? How is their writing consistently good?
I think their secret sauce is the (implied) perspective. The impression that the author has a unique, complete, accurate take on the subject, and is letting you in on it piece by piece, in a meandering way.
https://arxiv.org/abs/2312.08040 https://arxiv.org/abs/2205.00901 https://arxiv.org/abs/2210.01948 https://arxiv.org/abs/2410.23614
If you set yourself free from the meaning of all the nouns in the sentence then you can get there.
For people who don't believe me, spin up your LLM of choice using the API in your favourite language[1] and make some query using a temperature of zero. You will find if you repeat the query multiple times you always get the same response. That is because it always giving you the highest weighted result in the transformer output whereas if you set a non-zero temperature (or use the default chat frontend) it does a weighted random sample.
So there is no probabilistic variance between responses with temperature set to zero for a given model, but you will nonetheless find that you can get the LLM to hallucinate. One way I've found to get LLMs to frequently hallucinate is to ask the difference between two concepts that are actually the same (eg gemini gave me a very convincing looking but totally wrong "explanation" of the difference between a linear map and a linear transformation in linear algebra [2].
Therefore the probabilistic nature of a normal LLM response can not be the reason for hallucination because when we turn that off we find we still get hallucinations.
The real reason that LLMs hallucinate is more mundane and yet more fundamental- Hallucinating (in the normal sense of the word) is actually all that LLMs do. This is what Karpathy is talking about when he says that LLMs "dream documents". We just specifically call it "hallucination" when the results are somehow undesirable, typically because they don't correspond with some particular facts we would like the model's output to be grounded in.
But LLMs don't have any sort of model of the world, they have weights which are a lossy compression of their raw training data, so in response to some prompt they give the response that they have learned in instruction fine-tuning minimizes whatever loss function was used for that fine-tuning process. That's all. When we use words like "hallucination" we are in danger of anthropomorphising the model and using our reasoning process to try to back into how the model actually works.
[1] You need to use the programming API rather than the usual web frontend to set the temperature parameter.
[2] For the curious, it more or less said that for one of them (I forget which) you could move the origin so turned it into an affine transformation, but it mangled the underlying maths further. The evidence has fallen out of my gemini history so I can't share it, but that sort of approach has been fruitful in the past. Neither chatgpt nor claude fall for that specific example fwliw.
> One way I've found to get LLMs to frequently hallucinate is to ask the difference between two concepts that are actually the same
-this only confirms my belief that "bulshitting" is an appropriate term to use for this behavior : doesn't exactly the same thing happen with (not savvy enough) human students ?
You call it "anthropomorphizing", and "not having a model of the world", but isn't it more like forcing a model of the world on the student / language model by the way that you frame the question ?
(Interestingly, there might be a parallel here with the article : with the language model not being a real student, but a statistical average over all students, including being "with one breast and one testicle".)
Sometimes that means you exclude a good next token, or include a bad one, so a bad one gets chosen. And then once it has, the thing is going to pick whatever is most likely to come after that, which will be some malarkey because it has already emitted a nonsense token and is now using that as context. But whatever it is, it will still sound plausible because it's still choosing from the most likely things to follow what has already been emitted.
Yes LLMs generate tokens using a stochastic process, so it is probabalistic. Everyone knows that, but in the normal process of generating text, LLMs aren’t doing a hypothesis test so by definition p-values are completely irrelevant to how LLMs hallucinate.
[1] https://math.libretexts.org/Courses/Queens_College/Introduct...
Say you go into a bar and just by chance there is a football[1] match on television between Denmark and France. You see a bunch of fans of each country and you think "Hey, the Danes look taller than the French". You want to find out whether this is true in general, so to test this hypothesis you persuade them during a lull in the match to line up and get measured. As luck would have it there are exactly n people from each country.
H_0 (the null hypothesis) is that the two population means are the same. That is, that Danish people have the same average height as French people.
H_1 (the alternative hypothesis) is that Danish people are taller on average than French people (ie the population mean is larger).
So you take the average height and see that the Danes in this bar are say 5cm taller on average than the French people in this bar.
The p-value is how likely it would be to select a random sample of n people from each of two populations (one from Danes, one from French people) with an average height of the Danes in the sample being at least 5cm larger than the French if the actual average height of the underlying populations you sampled from (all Danes and all French people) were the same.
How you use a p-value is typically if it is smaller than some threshold called a critical value you "reject the null hypothesis" at some significance level. So in this case if the p-value was small enough you conclude that the population means are unlikely to be the same.
Actually calculating the p-value is going to depend a bit on the distribution etc but that's what a p-value is. As you can see it's not just a probability.
[1]soccer if you're from the US