For anyone who hasn’t seen this before, mechanistic interpretability solves a very common problem with LLMs: when you ask a model to explain itself, you’re playing a game of rhetoric where the model tries to “convince” you of a reason for what it did by generating a plausible-sounding answer based on patterns in its training data. But unlike most trends of benchmark numbers getting better as models improve, more powerful models often score worse on tests designed to self-detect “untruthfulness” because they have stronger rhetoric, and are therefore more compelling at justifying lies after the fact. The objective is coherence, not truth.
Rhetoric isn’t reasoning. True explainability, like what overfitted Sparse Autoencoders claim they offer, basically results in the causal sequence of “thoughts” the model went through as it produces an answer. It’s the same way you may have a bunch of ephemeral thoughts in different directions while you think about anything.
stavros 34 days ago [-]
I want to point out here that people do the same: a lot of the time we don't know why we thought or did something, but we'll confabulate plausible-sounding rhetoric after the fact.
Yes in math. Formalisms come after casual thoughts, at every step.
mdp2021 34 days ago [-]
It's totally different: those formalisms are in a workbench, following a set of rules that either work or not.
So, yes, that (math) is representative of the actual process: pattern recognition gives you spontaneous ideas, that you assess for truthfulness in conscious acts of verification.
sinuhe69 34 days ago [-]
What is a casual thought that you cannot explain in math?
TeMPOraL 34 days ago [-]
That question makes no sense. You can explain anything in math, because math is a language and lets you define whatever terms and axioms you need at a given moment.
(Whether or not such explanation is useful for anything is another issue entirely.)
worldsayshi 34 days ago [-]
Can you explain how intuition led you to try a certain approach?
TeMPOraL 34 days ago [-]
Is it enough if I hand-wave it with probability distributions, or do you want me to write out adjacency search in a high-dimensional space?
legel 34 days ago [-]
Math comes from brains.
HeavyStorm 32 days ago [-]
That's some misunderstanding of the human brain and thought process...
mdp2021 34 days ago [-]
/Some/ people bullshit themselves stating the plausible; others check their hypotheses.
The difference is total in both humans and automated processes.
catskul2 34 days ago [-]
Everyone, every last one of us, does this every single day, all day, and only occasionally do we deviate to check ourselves, and often then it's to save face.
A Nobel prize was given for related research to Daniel Kahneman.
If you think it doesn't apply to you, you're definitely wrong.
mdp2021 34 days ago [-]
> occasionally
Properly educated people do it regularly, not occasionally. You are describing a definite set of people. No, it does not cover all.
Some people will output a pre-given answer; some people check.
Edit: sniper... Find some argument.
og_kalu 33 days ago [-]
Your decisions shape your preferences just as much as your preferences shape your decisions and you're not even aware of it. Yes, everybody regularly confabulates plausible sounding things that they themselves genuinely believe to be the 'real reason'. You're not immune or special.
I will check the article with more attention as soon as I will have the time, but: putting aside a question on how would a similar investigation prove that all people would function in the same way,
that does not seem to counter that some people «check their hypotheses» - as duly. Some people do exercise critical thinking. It is an intentional process.
og_kalu 33 days ago [-]
You're not getting it.
You ask A "Why did you choose that?" > He answers "I like the color blue"
This makes sense. This is what everyone thinks and believes is the actual sequence of such events.
But often, this is the actual sequence
"Let's go with this" > "Now i like the color blue"
'A' didn't lie to you or try to trick you. He didn't consciously rationalize liking blue after the fact. He's not stupid or "prone to bad thinking". Altering your perceptions of events without your conscious awareness is just simply something that your brain does fairly regularly.
Make no mistake. A genuinely likes blue now - the only difference is that he genuinely believes he made the choice because he liked blue instead of the brain having the tendency to make you favor your choices and giving him the like of blue so it sits better.
This is not something you "check your hypotheses" out of. And it's something every human deals with everyday, including you.
mdp2021 32 days ago [-]
I get what you are pointing at: you are focusing with some strictness on the post from Stavros, which states that "people pseudo-rationalize with plausible explanatory theories their not-at-the-time-rational behaviour".
But I was instead focusing at the general problem in the root post from Foundry27, and to a loose interpretation of the post from Stavros: the opposition between the faculty of generating convincing fantasies vs the faculty of critical thinking. (Such focus being there because more general and pressing in current AI than the contextual problem of "explanation", which is sort of a "perversion" when compared to the same in classical AI, where the steps are recorded procedurally owing to transparency, instead of the paradox of asking an obscure unreliable engine "what it did".)
What I meant is that a general scheme of bullshitting to oneself and pseudo-rationalizing it is not the only way. Please see the other sub-branch in which we talked about mathematics. In important cases, the fantasies are then consciously checked as thoroughly as constraints allow.
So I stated «/Some/ people bullshit themselves stating the plausible; others check their hypotheses ... Some people will output a pre-given answer; some people check» - as a crucial discriminator in the natural and artificial. Please note that the trend in the past two years has generated a believe in some that the at most preliminary part is all that there is.
Also note that Katskul wrote «only occasionally do we deviate to check ourselves» - so the reply is "No: the more one is educated and intellectually trained, the more one's thoughts are vetted - the thought process is disciplined to check its objects".
But I see re-checking the branch that the post from Stavros was strongly specific towards the "smaller" area of "pseudo-rationalizing", so I see why my posts may have looked odd-fitting.
mdp2021 34 days ago [-]
By the way: I have seldom come across a post so weak.
> every last one of us
And how do you prove it.
> A Nobel prize was given
So?
> If you think, you
Prove it.
Support it, at least. That is not discussion.
stavros 34 days ago [-]
How are you going to check your hypotheses for why you preferred that jacket to that other jacket?
mdp2021 34 days ago [-]
Do not lose the original point: some systems have a goal to sound plausible, while some have a goal to say the truth. Some systems, when asked "where have you been", will reply "at the baker's" because it is a nice narrative in their "novel writing, re-writing of reality", some other will check memory and say "at the butcher's", where they have actually been.
When people invent explicit reasons on why they turned left or right, those reasons remain hypotheses. The clumsy will promote those hypotheses to beliefs. The apt will keep the spontaneous ideas as hypotheses, until the ability to assess them comes.
og_kalu 33 days ago [-]
Everybody promotes these sorts of hypotheses to beliefs because it's not a conscious decision you are aware of. It's not about being clumsy or apt. You don't have much control over it.
It does not matter, that there may be a tendency towards bad thinking: what matters is the possibility of proper thinking and the training towards it (becoming more and more proficient at it and practicing it constantly, having it as your natural state; in automation, implementing it in the process).
What you control is the intentional revision of thought.
(I am acquainted with earlier studies about the corpus callosum but I do not know why you would mention that, what it would prove: maybe you could be clearer? I do not see how it could affect the notion of critical thinking.)
og_kalu 33 days ago [-]
I've explained it the best i can in the other comment. But you keep making the mistake that this is just a culprit of 'bad thinking' or 'intentional revision of thought' and while i'm not saying those things don't exist, It's not.
Not only are the rationalizations i'm talking about and which some of these papers allude to not intentional, they often happen without your conscious awareness.
mdp2021 32 days ago [-]
On my having come with percussions at the strings meeting see the other reply.
I want to check the papers you proposed as soon as I will have the time: I find it difficult to believe that the conscious cannot intercept those "changes of mind" and correct them.
But please note: you are writing «Not only are ... not intentional»... Immature thought needs not to be intentional at all: it is largely spontaneous thought. But whether part of an intentional process ("let us ponder towards some goal"), or whether part of the subterranean functions, when it becomes visible (or «intercepted» as I wrote above), the trained mind looks at it with diffidence and asks questions about its foundations - intentionally, in the conscious, as a learnt process.
34 days ago [-]
DSingularity 34 days ago [-]
Is that example representative for the LLM tasks for which we seek explainability ?
stavros 34 days ago [-]
Are we holding LLMs to a higher standard than people?
f_devd 34 days ago [-]
Ideally yes, LLMs are tools that we expect to work, people are inherently fallible and (even unintentionally) deceptive. LLMs being human-like in this specific way is not desirable.
stavros 34 days ago [-]
Then I think you'll be very disappointed. LLMs aren't in the same category as calculators, for example.
f_devd 34 days ago [-]
I have no illusions on LLMs, I have been working with them since og BERT, always with these same issues and more. I'm just stating what would be needed in the future to make them reliably useful outside of creative writing & (human-guided & checked) search.
If an LLM provides an incorrect/orthogonal rhetoric without a way to reliably fix/debug it it's just not as useful as it theoretically could be given the data contained in the parameters.
snthpy 34 days ago [-]
A{rt,I} imitating life
I believe that's why humans reason too. We make snap judgements and then use reason to try to convince others of our beliefs. Can't recall the reference right now but they argued that it's really a tool for social influence. That also explains why people who are good at it find it hard to admit when they are wrong - they're not used to having to do it because they can usually out argue others. Prominent examples are easy to find - X marks de spot.
jamesemmott 34 days ago [-]
I wonder if the reference you are reaching for, if it's not the Jonathan Haidt book suggested by a sibling comment, is The Enigma of Reason by the cognitive psychologists Hugo Mercier and Dan Sperber (2017).
In that book (quoting here from the abstract), Mercier and Sperber argue that reason 'is not geared to solitary use, to arriving at better beliefs and decisions on our own', but rather to 'help us justify our beliefs and actions to others, convince them through argumentation, and evaluate the justifications and arguments that others address to us'. Reason, they suggest, 'helps humans better exploit their uniquely rich social environment'.
They resist the idea (popularized by Daniel Kahneman) that there is 'a contrast between intuition and reasoning as if these were two quite different forms of inference', proposing instead that 'reasoning is itself a kind of intuitive inference'. For them, reason as a cognitive mechanism is 'much more opportunistic and eclectic' than is implied by the common association with formal systems like logic. 'The main role of logic in reasoning, we suggest, may well be a rhetorical one: logic helps simplify and schematize intuitive arguments, highlighting and often exaggerating their force.'
Their 'interactionist' perspective helps explain how illogical rhetoric can be so socially powerful; it is reason, 'a cognitive mechanism aimed at justifying oneself and convincing others', fulfilling its evolutionary social function.
Highly recommended, if you're not already familiar.
snthpy 34 days ago [-]
Thank you. That's exactly the idea and described much more eloquently. I probably heard it through the Sapolsky lecture from a sibling comment but that captures it exactly. Bookmarked.
omgwtfbyobbq 34 days ago [-]
I think Robert Sapolsky's lectures on yt cover this to some degree around 115.
Sometimes our cortex is in charge, sometimes other parts of our brain are, and we can't tell the difference. Regardless, if we try to justify it later, that justification isn't always coherent because we're not always using the part of our brain we consider to be rational.
snthpy 34 days ago [-]
Yes that was probably it because I rewatched that recently. Thanks!
shshshshs 34 days ago [-]
People who are good at reasoning find it hard to admit that they were wrong?
That’s not my experience. People with reason are.. reasonable.
You mention X and that’s not where the reasoners are. That’s where the (wanna be) politicians are. Rhetoric is not all of reasoning.
I can agree that rationalizing snap judgements is one of our capabilities but I am totally unconvinced that it is the totality of our reasoning capabilities. Perhaps I misunderstood.
Hedepig 34 days ago [-]
This is not totally my experience, I've debated a successful engineer who by all accounts has good reasoning skills, but he will absolutely double down on unreasonable ideas he's made on the fly he if can find what he considers a coherent argument behind them. Sometimes if I absolutely can prove him wrong he'll change his mind.
But I think this is ego getting in the way, and our reluctance to change our minds.
We like to point to artificial intelligence and explain how it works differently and then say therefore it's not "true reasoning". I'm not sure that's a good conclusion. We should look at the output and decide. As flawed as it is, I think it's rather impressive
mdp2021 34 days ago [-]
> ego getting in the way
That thing which was in fact identified thousands of years ago as the evil to ditch.
> reluctance to change our minds
That is clumsiness in a general drive that makes sense and is recognized part of the Belief Change Theory: epistemic change is conservative. I.e., when you revise a body of knowledge you do not want to lose valid notions. But conversely, you do not want to be unable to see change or errors, so there is a balance.
> it's not "true reasoning"
If it shows not to explicitly check its "spontaneous" ideas, then it is a correct formula to say 'it's not "true reasoning"'.
Hedepig 34 days ago [-]
> then it is a correct formula to say 'it's not "true reasoning"'
why is that point fundamental?
mdp2021 34 days ago [-]
Because the same way you do not want a human interlocutor to speak out of its dreams, uttering the first ideas that come to mind unvetted, and you want him to instead have thought hard and long and properly and diligently and well, equally you'll want the same from an automation.
Hedepig 33 days ago [-]
If we do figure out how to vet these thoughts, would you call it reasoning?
mdp2021 33 days ago [-]
> vet these thoughts, would you call it reasoning
Probably: other details may be missing, but checking one's ideas is a requirement. The sought engine must have critical thinking.
I have expressed very many times in the past two years, some times at length, always rephrasing it on the spot: the Intelligent entity refines a world model iteratively by assessing its contents.
Hedepig 33 days ago [-]
I do see your point, and it is a good point.
My observation is that the models are better at evaluating than they are generating, this is the technique used in the o1 models. They will use unaligned hidden tokens as "thinking" steps that will include evaluation of previous attempts.
I thought that was a good approach to vetting bad ideas.
mdp2021 33 days ago [-]
> My observation is that the [o1-like] models are better at evaluating than they are generating
This is very good (a very good thing that you see that the out-loud reasoning is working well as judgement),
but we at this stage face an architectural problem. The "model, exemplary" entities will iteratively judge and both * approximate the world model towards progressive truthfulness and completeness, and * refine their judgement abilities and general intellectual proficiency in the process. That (in a way) requires that the main body of knowledge (including "functioning", proficiency over the better processes) is updated. The current architectures I know are static... Instead, we want them to learn: to understand (not memorize) e.g. that Copernicus is better than Ptolemy and to use the gained intellectual keys in subsequent relevant processes.
The main body of knowledge - notions, judgements and abilities - should be affected in a permanent way, to make it grow (like natural minds can).
Hedepig 33 days ago [-]
The static nature of LLMs is a compelling argument against the reasoning ability.
But, it can learn, albeit in a limited way, using the context. Though to my knowledge that doesn't scale well.
33 days ago [-]
fragmede 34 days ago [-]
The smarter a person is, the better they are at rationalizing their decisions. Especially the really stupid decisions.
snthpy 34 days ago [-]
People with reason ... sound reasonable.
I think some prominent people on X who are good at reasoning from First Principles will double down on things rather than admit their mistake.
The other very prominent psychological phenomenon I have observed in the world is "Projection", i.e. the phenomenon of seeing qualities in other people that we have ourselves. I guess it is because we think others would do what we would do ourselves. Trump is a clear example of this - whatever he accuses someone else off, you know he is doing. Point here being that this doubling down on bad reasons in order to not admit my mistakes is something I've observed in myself. Reason does indeed help me to try and overcome it when I recognise it but the tricky part is being able to recognise it.
mdp2021 34 days ago [-]
Already before Galileo we had experiments to determine whether ideas represented reality or not. And in crucial cases, long before that, it meant life or death. This will be clear to engineers.
«Reason» is part of that mechanism of vetting ideas. You experience massive failures without it.
So, no, trained judgement is a real thing, and the presence of innumerable incompetent do not prove an alleged absence of the competent.
briffid 34 days ago [-]
Jonathan Haidt's The Righteous Mind describes this ín details.
snthpy 34 days ago [-]
Thanks
benreesman 34 days ago [-]
A lot of the mech interp stuff has seemed to me like a different kind of voodoo: the Integer Quantum Hall Effect? Overloading the term “Superposition” in a weird analogy not governed by serious group representation theory and some clear symmetry? You guys are reaching. And I’ve read all the papers. Spot the postdoc who decided to get paid.
But there is one thing in particular that I’ll acknowledge as a great insight and the beginnings of a very plausible research agenda: bounded near orthogonal vector spaces are wildly counterintuitive in high dimensions and there are existing results around it that create scope for rigor [1].
Superposition code is a well known concept in information theory - I think there is certainly more to the story then described in the current works, but it does feel like they are going in the right direction
drdeca 34 days ago [-]
Where are you seeing the integer quantum Hall effect mentioned? Or are you bringing it up rather than responding to it being brought up elsewhere? I don’t understand what the connection between IQHE and these SAE interpretability approaches is supposed to be.
benreesman 34 days ago [-]
Pardon me, the reference is to the fractional Hall effect.
"But our results may also be of broader interest. We find preliminary evidence that superposition may be linked to adversarial examples and grokking, and might also suggest a theory for the performance of mixture of experts models. More broadly, the toy model we investigate has unexpectedly rich structure, exhibiting phase changes, a geometric structure based on uniform polytopes, "energy level"-like jumps during training, and a phenomenon which is qualitatively similar to the fractional quantum Hall effect in physics, among other striking phenomena. We originally investigated the subject to gain understanding of cleanly-interpretable neurons in larger models, but we've found these toy models to be surprisingly interesting in their own right."
BTW, it's easy to test model's logic and truthfulness by giving it a wrong decision is if it was its, and asking to explain. Model has no memory and cannot distinguish the source of the text. 'Truthful' model should admit mistake without being asked. Likely model instead will do 'parallel construction' to support 'its' decision.
Onavo 34 days ago [-]
How does the causality part work? Can it spit out a graphical model?
34 days ago [-]
fsndz 34 days ago [-]
I stopped at: "causal sequence of “thoughts” "
benchmarkist 34 days ago [-]
Interpretability research is basically a projection of the original function implemented by the neural network onto a sub-space of "explanatory" functions that people consider to be more understandable. You're right that the words they use to sell the research is completely nonsensical because the abstract process has nothing to do with anything causal.
HeatrayEnjoyer 34 days ago [-]
All code is causal.
benchmarkist 34 days ago [-]
Which makes it entirely irrelevant as a descriptive term.
mdp2021 34 days ago [-]
"Servers shall be strict in formulation and flexible in interpretation."
jwuphysics 34 days ago [-]
Incredible, well-documented work -- this is an amazing effort!
Two things that caught my eye were (i) your loss curves and (ii) the assessment of dead latents. Our team also studied SAEs -- trained to reconstruct dense embeddings of paper abstracts rather than individual tokens [1]. We observed a power-law scaling of the lower bound of loss curves, even when we varied the sparsity level and the dimensionality of the SAE latent space. We also were able to totally mitigate dead latents with an auxiliary loss, and we saw smooth sinusoidal patterns throughout training iterations. Not sure if these were due to the specific application we performed (over paper abstracts embeddings) or if they represent more general phenomena.
I'm very happy you appreciate it - particularly the documentation. Writing the documentation was much harder for me than writing the code so I'm happy it is appreciated. I furthermore downloaded your paper and will read through it tomorrow morning - thank you for sharing it!
Eliezer 34 days ago [-]
This seems like decent alignment-positive work on a glance, though I haven't checked full details yet. I probably can't make it happen, but how much would someone need to pay you to make up your time, expense, and risk?
curious_cat_163 35 days ago [-]
Hey - Thanks for sharing!
Will take a closer look later but if you are hanging around now, it might be worth asking this now. I read this blog post recently:
And the author talks about challenges with evaluating SAEs. I wonder how you tackled that and where to look inside your repo for understanding the your approach around that if possible.
Thanks again!
PaulPauls 34 days ago [-]
So evaluating SAEs - determining which SAE is better at creating the most unique features while being as sparse as possible at the same time - is a very complex topic that is very much at the heart of the current research into LLM interpretability through SAEs.
Assuming you already solved the problem of finding multiple perfect SAE architectures and you trained them to perfection (very much an interesting ML engineering problem that this SAE project attempts to solve) then deciding on which SAE is better comes down to which SAE performs better on the metrics of your automated interpretability methodology. Particularly OpenAI's methodology emphasizes this automated interpretability at scale utilizing a lot of technical metrics upon which the SAEs can be scored _and thereby evaluated_.
Since determining the best metrics and methodology is such an open research question that I could've experimented on for a few additional months, have I instead opted for a simple approach in this first release. I am talking about my and OpenAI's methodology and the differences between both in chapter 4. Interpretability Analysis [1] in my Implementation Details & Results section. I can also recommend reading the OpenAI paper directly or visiting Anthropics transformer-circuits.pub website that often publishes smaller blog posts on exactly this topic.
Very cool work! Any plans to integrate it with SAELens?
PaulPauls 34 days ago [-]
Not sure yet to be honest. I'll definitely consider it but I'll reorient myself and what I plan to do next in the coming week. I also planned on maybe starting a simpler project and maybe showing people how to create the full model of a current Llama 3.2 implementation from scratch in pure PyTorch. I love building things from teh ground up and when I looked for documentation for the Llama 3.2 background section of this SAE project then the existing documentation I found was either too superficial or outdated and intended for Llama 1 or 2 - Documentation in ML gets outdated so quickly nowadays...
monkeycantype 34 days ago [-]
Thank you for posting this PaulPauls,
can I please ask a wacky question that I have about mech.interp. ?
we know that when we use a metric to measure humans, we game the metric, I wonder about future ai, gaming mech.interp.
for simplicity let me propose a 2d matrix to encode tokens, and hope my understanding of neural networks is good enough for this to be a valid question
I understand that higher dimensions have a lot more 'space', so accidental collisions are easily avoided, but 2d makes my example easier to explain.
if we had the following tokens mapped in 2d space
Apple 1a
Pear 1b
Donkey 2a
Horse 2b
it becomes impossible to understand if the neurons 1,2,a,b, all excited represents apple and horse or donkey and pear?
I wonder if a future, vastly more competent AI overseeing its own training might use this form of ambiguity as means to create a model in which the weights are selected so the potential for encoding collisions exist, so that it is possible to deceive an mech.int. observer as to the true thoughts of the model, essentially enabling the ai to think in euphemisms?
Majromax 34 days ago [-]
What you propose is a harder AI safety scenario.
You don't need a 'vastly more competent AI overseeing its own training' to elicit this potential problem, just a malicious AI researcher, looking for (e.g.) a model that's racist but that does not have any interperable activation patterns that identifiably correspond to racism.
The work here on this Show HN suggests that this kind of adversarial training might just barely be possible for a sufficiently-funded individual, and it seems like novel results would be very interesting.
samstevens 34 days ago [-]
I’m really excited to see some more open SAE work! The engineering effort is non trivial and I’m going to check out your dataloading code tomorrow. You might be interested in an currently in-progress project of mine to train SAEs on vision models: https://github.com/samuelstevens/saev
lynx23 34 days ago [-]
[dead]
jaykr_ 35 days ago [-]
This is awesome! I really appreciate the time you took to document everything!
PaulPauls 34 days ago [-]
Thank you for saying that! I have a much, much harder time documenting everything and writing out each decision in continuous text than actually writing the code. So it took a look time for me to write all of this down - so I'm happy you appreciate it! =)
moconnor 34 days ago [-]
Find a latent for the Golden Gate bridge and put a Golden Gate Llama 3.2 on HuggingFace. This will get even more attention and love, more so if you include link to a space to chat with it!
Also, you didn't ask for suggestions but putting some interesting results / visualizations at the top of the README is a very good idea.
vivekkalyan 34 days ago [-]
This is great work! Mechanistic interpretability has tons of use cases, it's great to see open research in that field.
You mentioned you spent your own time and money on it, would you be willing to share how much you spent? It would help others who might be considering independent research.
PaulPauls 34 days ago [-]
Thank you, I too am a big believer and enjoyer of open research. The actual code has clarity that complex research papers were never able to convey to me as well as the actual code could.
Regarding the cost I would probably sum it up to round about ~2.5k USD for just the actual execution cost. Development cost would've probably doubled that sum if I wouldn't already have a GPU workstation for experiments at home that I take for granted. That cost is made up of:
* ~400 USD for ~2 months of storage and traffic of 7.4 TB (3.2 TB of raw, 3.2 TB of preprocessed training data) on a GCP standard bucket
* ~100 USD for Anthropic claude requests for experimenting with the right system prompt and test runs and the actual final execution
* The other ~2k USD were used to rent 8x Nvidia RTX4090's together with a 5TB SSD from runpod.io for various stages of the experiments. For the actual SAE training I rented the node for 8 days straight and I would probably allocate an additional ~3-4 days of runtime just to experiments to determine the best Hyperparameters for training.
westurner 33 days ago [-]
The relative performance in err/watts/time compared to deep learning for feature selection instead of principal component analysis and standard xgboost or tabular xt TODO for optimization given the indicating features.
Coming from a layman's perspective, a genuine question regarding:
"Implements SAE training with auxiliary loss to prevent and revive dead latents, and gradient projection to stabilize training dynamics".
I struggle to understand this phrase "to prevent and revive ", perhaps this is simple speak to those that understand the subject of SAEs, but it feels a bit self contradictory to me, could anyone elaborate?
PaulPauls 34 days ago [-]
Just bad wording from me, trying to combine too much information in 1 sentence. The auxiliary loss is supposed to prevent dead latents from occuring in the first place - therefore "prevent dead latents" - and it is also supposed to revive the latents that are already dead - therefore "revive dead latents".
Now that I review that sentence again I see that I used 2 verbs on the same subject that could be interpreted differently depending on the verb. Me culpa. I hope you still gained some insights into it =)
imranhou 34 days ago [-]
Thanks for sharing! It is certainly interesting to me who is not in the mainstream, I'm sure your intended audience understood what you were saying.
versteegen 34 days ago [-]
A latent that is never active and hence doesn't (seem to) represent anything. A loss term to reduce the occurrence of that, and if it does happen, push it back to being active sometimes.
imranhou 34 days ago [-]
So basically preventing dead latents from occurring and whenever they do occur to possibly reviving them through the use of auxiliary loss term in the loss function? Thanks btw
dontknowit 32 days ago [-]
I imagine this kind of algorithm are like a derivative, they give a unit response, so you would need another filter to stabilize your system, that is some drop out to remove spurious revived latents.
yangwang92 34 days ago [-]
Nice! You did what I wanted. Have you tried to train SAE for vision encoder and language encoder? I am working on this idea. May we work together, let me initial an issue.
batterylake 34 days ago [-]
This is incredible!
PaulPauls, how would you like us to cite your work?
PaulPauls 34 days ago [-]
Thank you very much!
I included a section at the bottom that provides a sample bibtex citation. I didn't expect this much attention so I didn't even bother with a License but I'll include a MIT license later today and release 0.2.1
34 days ago [-]
enterthedragon 34 days ago [-]
This is amazing, the documentation is very well organized
Carrentt 34 days ago [-]
Fantastic work! I absolutely love all the documentation.
coolvision 33 days ago [-]
nice! did you use cloud GPUs or built your own machine?
Rhetoric isn’t reasoning. True explainability, like what overfitted Sparse Autoencoders claim they offer, basically results in the causal sequence of “thoughts” the model went through as it produces an answer. It’s the same way you may have a bunch of ephemeral thoughts in different directions while you think about anything.
So, yes, that (math) is representative of the actual process: pattern recognition gives you spontaneous ideas, that you assess for truthfulness in conscious acts of verification.
(Whether or not such explanation is useful for anything is another issue entirely.)
The difference is total in both humans and automated processes.
A Nobel prize was given for related research to Daniel Kahneman.
If you think it doesn't apply to you, you're definitely wrong.
Properly educated people do it regularly, not occasionally. You are describing a definite set of people. No, it does not cover all.
Some people will output a pre-given answer; some people check.
Edit: sniper... Find some argument.
https://pmc.ncbi.nlm.nih.gov/articles/PMC3196841/
that does not seem to counter that some people «check their hypotheses» - as duly. Some people do exercise critical thinking. It is an intentional process.
You ask A "Why did you choose that?" > He answers "I like the color blue"
This makes sense. This is what everyone thinks and believes is the actual sequence of such events.
But often, this is the actual sequence "Let's go with this" > "Now i like the color blue"
'A' didn't lie to you or try to trick you. He didn't consciously rationalize liking blue after the fact. He's not stupid or "prone to bad thinking". Altering your perceptions of events without your conscious awareness is just simply something that your brain does fairly regularly.
Make no mistake. A genuinely likes blue now - the only difference is that he genuinely believes he made the choice because he liked blue instead of the brain having the tendency to make you favor your choices and giving him the like of blue so it sits better.
This is not something you "check your hypotheses" out of. And it's something every human deals with everyday, including you.
But I was instead focusing at the general problem in the root post from Foundry27, and to a loose interpretation of the post from Stavros: the opposition between the faculty of generating convincing fantasies vs the faculty of critical thinking. (Such focus being there because more general and pressing in current AI than the contextual problem of "explanation", which is sort of a "perversion" when compared to the same in classical AI, where the steps are recorded procedurally owing to transparency, instead of the paradox of asking an obscure unreliable engine "what it did".)
What I meant is that a general scheme of bullshitting to oneself and pseudo-rationalizing it is not the only way. Please see the other sub-branch in which we talked about mathematics. In important cases, the fantasies are then consciously checked as thoroughly as constraints allow.
So I stated «/Some/ people bullshit themselves stating the plausible; others check their hypotheses ... Some people will output a pre-given answer; some people check» - as a crucial discriminator in the natural and artificial. Please note that the trend in the past two years has generated a believe in some that the at most preliminary part is all that there is.
Also note that Katskul wrote «only occasionally do we deviate to check ourselves» - so the reply is "No: the more one is educated and intellectually trained, the more one's thoughts are vetted - the thought process is disciplined to check its objects".
But I see re-checking the branch that the post from Stavros was strongly specific towards the "smaller" area of "pseudo-rationalizing", so I see why my posts may have looked odd-fitting.
> every last one of us
And how do you prove it.
> A Nobel prize was given
So?
> If you think, you
Prove it.
Support it, at least. That is not discussion.
When people invent explicit reasons on why they turned left or right, those reasons remain hypotheses. The clumsy will promote those hypotheses to beliefs. The apt will keep the spontaneous ideas as hypotheses, until the ability to assess them comes.
https://pmc.ncbi.nlm.nih.gov/articles/PMC3196841/
https://pure.uva.nl/ws/files/25987577/Split_Brain.pdf
What you control is the intentional revision of thought.
(I am acquainted with earlier studies about the corpus callosum but I do not know why you would mention that, what it would prove: maybe you could be clearer? I do not see how it could affect the notion of critical thinking.)
Not only are the rationalizations i'm talking about and which some of these papers allude to not intentional, they often happen without your conscious awareness.
I want to check the papers you proposed as soon as I will have the time: I find it difficult to believe that the conscious cannot intercept those "changes of mind" and correct them.
But please note: you are writing «Not only are ... not intentional»... Immature thought needs not to be intentional at all: it is largely spontaneous thought. But whether part of an intentional process ("let us ponder towards some goal"), or whether part of the subterranean functions, when it becomes visible (or «intercepted» as I wrote above), the trained mind looks at it with diffidence and asks questions about its foundations - intentionally, in the conscious, as a learnt process.
If an LLM provides an incorrect/orthogonal rhetoric without a way to reliably fix/debug it it's just not as useful as it theoretically could be given the data contained in the parameters.
I believe that's why humans reason too. We make snap judgements and then use reason to try to convince others of our beliefs. Can't recall the reference right now but they argued that it's really a tool for social influence. That also explains why people who are good at it find it hard to admit when they are wrong - they're not used to having to do it because they can usually out argue others. Prominent examples are easy to find - X marks de spot.
In that book (quoting here from the abstract), Mercier and Sperber argue that reason 'is not geared to solitary use, to arriving at better beliefs and decisions on our own', but rather to 'help us justify our beliefs and actions to others, convince them through argumentation, and evaluate the justifications and arguments that others address to us'. Reason, they suggest, 'helps humans better exploit their uniquely rich social environment'.
They resist the idea (popularized by Daniel Kahneman) that there is 'a contrast between intuition and reasoning as if these were two quite different forms of inference', proposing instead that 'reasoning is itself a kind of intuitive inference'. For them, reason as a cognitive mechanism is 'much more opportunistic and eclectic' than is implied by the common association with formal systems like logic. 'The main role of logic in reasoning, we suggest, may well be a rhetorical one: logic helps simplify and schematize intuitive arguments, highlighting and often exaggerating their force.'
Their 'interactionist' perspective helps explain how illogical rhetoric can be so socially powerful; it is reason, 'a cognitive mechanism aimed at justifying oneself and convincing others', fulfilling its evolutionary social function.
Highly recommended, if you're not already familiar.
https://youtu.be/wLE71i4JJiM?feature=shared
Sometimes our cortex is in charge, sometimes other parts of our brain are, and we can't tell the difference. Regardless, if we try to justify it later, that justification isn't always coherent because we're not always using the part of our brain we consider to be rational.
That’s not my experience. People with reason are.. reasonable.
You mention X and that’s not where the reasoners are. That’s where the (wanna be) politicians are. Rhetoric is not all of reasoning.
I can agree that rationalizing snap judgements is one of our capabilities but I am totally unconvinced that it is the totality of our reasoning capabilities. Perhaps I misunderstood.
But I think this is ego getting in the way, and our reluctance to change our minds.
We like to point to artificial intelligence and explain how it works differently and then say therefore it's not "true reasoning". I'm not sure that's a good conclusion. We should look at the output and decide. As flawed as it is, I think it's rather impressive
That thing which was in fact identified thousands of years ago as the evil to ditch.
> reluctance to change our minds
That is clumsiness in a general drive that makes sense and is recognized part of the Belief Change Theory: epistemic change is conservative. I.e., when you revise a body of knowledge you do not want to lose valid notions. But conversely, you do not want to be unable to see change or errors, so there is a balance.
> it's not "true reasoning"
If it shows not to explicitly check its "spontaneous" ideas, then it is a correct formula to say 'it's not "true reasoning"'.
why is that point fundamental?
Probably: other details may be missing, but checking one's ideas is a requirement. The sought engine must have critical thinking.
I have expressed very many times in the past two years, some times at length, always rephrasing it on the spot: the Intelligent entity refines a world model iteratively by assessing its contents.
My observation is that the models are better at evaluating than they are generating, this is the technique used in the o1 models. They will use unaligned hidden tokens as "thinking" steps that will include evaluation of previous attempts.
I thought that was a good approach to vetting bad ideas.
This is very good (a very good thing that you see that the out-loud reasoning is working well as judgement),
but we at this stage face an architectural problem. The "model, exemplary" entities will iteratively judge and both * approximate the world model towards progressive truthfulness and completeness, and * refine their judgement abilities and general intellectual proficiency in the process. That (in a way) requires that the main body of knowledge (including "functioning", proficiency over the better processes) is updated. The current architectures I know are static... Instead, we want them to learn: to understand (not memorize) e.g. that Copernicus is better than Ptolemy and to use the gained intellectual keys in subsequent relevant processes.
The main body of knowledge - notions, judgements and abilities - should be affected in a permanent way, to make it grow (like natural minds can).
But, it can learn, albeit in a limited way, using the context. Though to my knowledge that doesn't scale well.
I think some prominent people on X who are good at reasoning from First Principles will double down on things rather than admit their mistake.
The other very prominent psychological phenomenon I have observed in the world is "Projection", i.e. the phenomenon of seeing qualities in other people that we have ourselves. I guess it is because we think others would do what we would do ourselves. Trump is a clear example of this - whatever he accuses someone else off, you know he is doing. Point here being that this doubling down on bad reasons in order to not admit my mistakes is something I've observed in myself. Reason does indeed help me to try and overcome it when I recognise it but the tricky part is being able to recognise it.
«Reason» is part of that mechanism of vetting ideas. You experience massive failures without it.
So, no, trained judgement is a real thing, and the presence of innumerable incompetent do not prove an alleged absence of the competent.
But there is one thing in particular that I’ll acknowledge as a great insight and the beginnings of a very plausible research agenda: bounded near orthogonal vector spaces are wildly counterintuitive in high dimensions and there are existing results around it that create scope for rigor [1].
[1] https://en.m.wikipedia.org/wiki/Johnson%E2%80%93Lindenstraus...
"But our results may also be of broader interest. We find preliminary evidence that superposition may be linked to adversarial examples and grokking, and might also suggest a theory for the performance of mixture of experts models. More broadly, the toy model we investigate has unexpectedly rich structure, exhibiting phase changes, a geometric structure based on uniform polytopes, "energy level"-like jumps during training, and a phenomenon which is qualitatively similar to the fractional quantum Hall effect in physics, among other striking phenomena. We originally investigated the subject to gain understanding of cleanly-interpretable neurons in larger models, but we've found these toy models to be surprisingly interesting in their own right."
https://transformer-circuits.pub/2022/toy_model/index.html
Two things that caught my eye were (i) your loss curves and (ii) the assessment of dead latents. Our team also studied SAEs -- trained to reconstruct dense embeddings of paper abstracts rather than individual tokens [1]. We observed a power-law scaling of the lower bound of loss curves, even when we varied the sparsity level and the dimensionality of the SAE latent space. We also were able to totally mitigate dead latents with an auxiliary loss, and we saw smooth sinusoidal patterns throughout training iterations. Not sure if these were due to the specific application we performed (over paper abstracts embeddings) or if they represent more general phenomena.
[1] https://arxiv.org/abs/2408.00657
Will take a closer look later but if you are hanging around now, it might be worth asking this now. I read this blog post recently:
https://adamkarvonen.github.io/machine_learning/2024/06/11/s...
And the author talks about challenges with evaluating SAEs. I wonder how you tackled that and where to look inside your repo for understanding the your approach around that if possible.
Thanks again!
Assuming you already solved the problem of finding multiple perfect SAE architectures and you trained them to perfection (very much an interesting ML engineering problem that this SAE project attempts to solve) then deciding on which SAE is better comes down to which SAE performs better on the metrics of your automated interpretability methodology. Particularly OpenAI's methodology emphasizes this automated interpretability at scale utilizing a lot of technical metrics upon which the SAEs can be scored _and thereby evaluated_.
Since determining the best metrics and methodology is such an open research question that I could've experimented on for a few additional months, have I instead opted for a simple approach in this first release. I am talking about my and OpenAI's methodology and the differences between both in chapter 4. Interpretability Analysis [1] in my Implementation Details & Results section. I can also recommend reading the OpenAI paper directly or visiting Anthropics transformer-circuits.pub website that often publishes smaller blog posts on exactly this topic.
[1] https://github.com/PaulPauls/llama3_interpretability_sae#4-i... [2] https://transformer-circuits.pub/
can I please ask a wacky question that I have about mech.interp. ?
we know that when we use a metric to measure humans, we game the metric, I wonder about future ai, gaming mech.interp.
for simplicity let me propose a 2d matrix to encode tokens, and hope my understanding of neural networks is good enough for this to be a valid question I understand that higher dimensions have a lot more 'space', so accidental collisions are easily avoided, but 2d makes my example easier to explain.
if we had the following tokens mapped in 2d space
it becomes impossible to understand if the neurons 1,2,a,b, all excited represents apple and horse or donkey and pear?I wonder if a future, vastly more competent AI overseeing its own training might use this form of ambiguity as means to create a model in which the weights are selected so the potential for encoding collisions exist, so that it is possible to deceive an mech.int. observer as to the true thoughts of the model, essentially enabling the ai to think in euphemisms?
You don't need a 'vastly more competent AI overseeing its own training' to elicit this potential problem, just a malicious AI researcher, looking for (e.g.) a model that's racist but that does not have any interperable activation patterns that identifiably correspond to racism.
The work here on this Show HN suggests that this kind of adversarial training might just barely be possible for a sufficiently-funded individual, and it seems like novel results would be very interesting.
Also, you didn't ask for suggestions but putting some interesting results / visualizations at the top of the README is a very good idea.
You mentioned you spent your own time and money on it, would you be willing to share how much you spent? It would help others who might be considering independent research.
Regarding the cost I would probably sum it up to round about ~2.5k USD for just the actual execution cost. Development cost would've probably doubled that sum if I wouldn't already have a GPU workstation for experiments at home that I take for granted. That cost is made up of:
* ~400 USD for ~2 months of storage and traffic of 7.4 TB (3.2 TB of raw, 3.2 TB of preprocessed training data) on a GCP standard bucket
* ~100 USD for Anthropic claude requests for experimenting with the right system prompt and test runs and the actual final execution
* The other ~2k USD were used to rent 8x Nvidia RTX4090's together with a 5TB SSD from runpod.io for various stages of the experiments. For the actual SAE training I rented the node for 8 days straight and I would probably allocate an additional ~3-4 days of runtime just to experiments to determine the best Hyperparameters for training.
XAI: Explainable AI: https://en.wikipedia.org/wiki/Explainable_artificial_intelli...
/? XAI , #XAI , Explain, EXPLAIN PLAN , error/energy/time
> TabPFN: https://github.com/automl/TabPFN .. https://x.com/FrankRHutter/status/1583410845307977733 [2022]
"TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second" (2022) https://arxiv.org/abs/2308.08945
> FWIU TabPFN is Bayesian-calibrated/trained with better performance than xgboost for non-categorical data
> /? awesome "explainable ai" https://www.google.com/search?q=awesome+%22explainable+ai%22
- (Many other great resources)
- https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master... :
> Post model-creation analysis, ML interpretation/explainability
> /? awesome "explainable ai" "XAI"
I struggle to understand this phrase "to prevent and revive ", perhaps this is simple speak to those that understand the subject of SAEs, but it feels a bit self contradictory to me, could anyone elaborate?
Now that I review that sentence again I see that I used 2 verbs on the same subject that could be interpreted differently depending on the verb. Me culpa. I hope you still gained some insights into it =)
PaulPauls, how would you like us to cite your work?
I included a section at the bottom that provides a sample bibtex citation. I didn't expect this much attention so I didn't even bother with a License but I'll include a MIT license later today and release 0.2.1