I am curious why the author chose a genetic algorithm rather than standard backprop to distill the evil. Logistic regression seems like a pretty reasonable choice and it’ll be a lot faster than a genetic algorithm. Add an L1 penalty for sparsity.
In the past I’ve tried distilling not just the eval function but the outcome of the search (something like depth 20) in a neural net. It kind of works, but not very well until the net is pretty big. Deepmind did something similar.
obrhubr 15 days ago [-]
I didn’t think of that, but I guess the conclusion as to performance of the model would have remained the same.
janalsncm 15 days ago [-]
Reading over your article again, it seems like you made the same “mistake” as I did. In other words, the evals you’re seeing in the Lichess PGN aren’t the raw outputs of the Stockfish evaluation function, they’re the output of the search, which is a highly nonlinear function of millions of evals. If I recall from their docs, it’s something like depth 18, so your chances of distilling it with 20k parameters is essentially zero.
(I put mistake in quotes here because Deepmind showed that with a server farm of TPUs it is possible to distill the search as well. So it’s not impossible per se.)
But that’s ok! You’ve created a super fast evaluation function. Instead of trying to distill the depth 18 output, it will be much more realistic to distill the depth zero output, the NNUE. If you rerun those FENs in Stockfish you can pretty quickly create an easier dataset.
vunderba 15 days ago [-]
As part of my undergrad work, I used similar principles to the article (steady state genetic algorithms) to create a bot capable of playing Reversi where the fitness function was loosely defined as a set of "weights" on each square across the 8x8 board. These were used as part of the evaluation function in the classic minimax algorithm.
We trained over the course of 5-6 days, and the end generation was capable of beating an intermediate player but got completely destroyed by experts. It was a fun project!
renox 15 days ago [-]
I remember coding a Reversi as a school project: players liked my project because it was so weak that they could beat it!
senthilnayagam 15 days ago [-]
I built a list of simple games in pygame with code and solver generated by AI.
Have implemented 2048, Minesweeper, Sudoku and chess. First three are single player games have made good progress.
I still don't understand UCI interface and have not thought through chess solving. Hope will take another try this weekend
I especially enjoyed the link to The Bitter Lesson by Rich Sutton, which I hadn't read before. Now I wonder what "discoveries" have been built into today's AI models and how they might come to be detrimental.
Probably LLM maximalist ideas that suggest infinite “scaling laws” for LLMs (they are not laws), leading to ridiculous conclusions like building a $1 trillion cluster is the fastest way to AGI. People like Leopold Aschenbrenner are in this camp.
Imagine if LLMs were the only way we had to play chess. You’d need a centralized server and peak performance wouldn't even best a grandmaster. You spend $1 trillion building a super cluster because that’s all you know.
< - - This is where AI is today.
And then some startup creates Stockfish, a chess engine better than any LLM or grandmaster and can run on a smartphone.
PaulHoule 15 days ago [-]
Has there ever been a serious effort to play chess by "rule-based" methods as opposed to search?
(... other than the evaluation function being based on handwritten or learned rules)
janalsncm 15 days ago [-]
In the past, yes. That was essentially the approach up until the 1980s. Computers were too slow to run millions of evaluations per move, so you got a lot of efforts in a more heuristic-driven direction.
Examples of heuristic engines are Hans Berliner’s CAPS-II and PARADISE by David Wilkins. PARADISE used a database of 200 rules.
There are also engines that try to retroactively apply rules to Stockfish evals but they’re fairly confusing in my experience. It’s like being told the answer and having to come up with a reason after the fact.
zelphirkalt 15 days ago [-]
All this I guess comes after laying the ground work, like implementing a bitboard representation or something else, and implementing the logic for being able to tell, whether a move is valid or invalid. That in itself is already a lot of work. iiuc the idea here is "merely" writing the engine, and take everything else as a given.
obrhubr 15 days ago [-]
The game’s implementation itself was furnished with the competition by Sebastian Lague.
I completely agree that writing the move logic, validation, etc… is a difficult undertaking especially when it comes to optimisation which is what allows the bots built on top to perform well.
zelphirkalt 15 days ago [-]
That makes sense for such a competition. Thanks for making this even clearer!
Of course another interesting competition could be to develop _all_ of a chess game and engine and see how low in code people can go. But that would be an entirely different competition.
james_marks 15 days ago [-]
Is there not an open standard for this? I glanced and didn’t see anything, but seems crazy every chess project would need to define the game.
qsort 15 days ago [-]
There are open components for generic purposes, e.g. draw the board given a PGN, show the moves, check if moves are legal and so on, but they aren't suitable for engine internals. There are several tricks with bits to store positions, hashing schemes for transposition tables and the likes that are key to unlock the maximum performance possible, and they are usually tied to the engine implementation.
james_marks 13 days ago [-]
Interesting, that makes sense.
Given that, I think what I’m imagining is a PGN linter & renderer.
snickerbockers 15 days ago [-]
This is something that's been on my bucket list for a while now, although I'd do it using "normal" programming instead of DL.
i feel like if you can bruteforce every possible board configuration for the next 3 turns and then make the move that leads to more desirable outcomes, that ought to be enough to thrash most amateurs. Probably not good enough to beat somebody who actually understands chess tactics on a deeper level, but I expect most non-masters are just "winging it" and making things up as they go along, so a machine that can think farther ahead than you would win more often than not.
But like I said, this is all just me fantasizing about a project I haven't even started yet so my hypothesis could easily be wrong.
vunderba 15 days ago [-]
> bruteforce every possible board configuration for the next 3 turns and then make the move that leads to more desirable outcomes
I'd recommend familiarizing yourself with the minimax algorithm - since that's exactly what it does where "desirable outcome" is essentially the tunable evaluation aspect of the equation.
I did exactly this when the Laser chess variant was posted on Hacker News about last year. It does indeed work pretty well. Players eventually learn patterns that look good in 3 moves but not with a greater depth, though.
It's how real chess engines like Stockfish work, but they are highly optimized so they can search extreme depths.
janalsncm 15 days ago [-]
> make the move that leads to more desirable outcomes
This is kind of begging the question, right? How do you know what a “desirable” outcome is, other than mate-in-N positions? (In your case, N<=3.) The point of an evaluation function is to take a position and return a value which describes how desirable it is.
snickerbockers 15 days ago [-]
It's basically just a depth-first tree traversal with some heuristics to weigh different turns against each other (where a turn is defined as a possible board configuration that can be reached from the current configuration).
First you'd have a heuristic that ranks outcomes from turn 3, which would just mean weighing the pieces the computer lost against the pieces the opponent lost. Then you rank each turn-2 node based on how desirable the turn-3 leaf-nodes it leads to are and how likely the opponent is to actually make the moves that lead to it. Then you'd do the same for turn-1 and then pick the move that scored highest.
Quekid5 14 days ago [-]
Ah, yes... "just". I think you're seriously underestimating the difficulty here.
snickerbockers 12 days ago [-]
yeah, maybe. nothing's ever as easy in reality as on paper, but i've always considered my pet projects as long-term endeavors. i can keep working on the same thing for years at a time, and the reason i haven't actually done the chess thing despite often fantasizing about it is that i don't know if i can commit to it without adversely impacting my other pet projects.
binarymax 15 days ago [-]
Doesn’t look like failure to me - looks like a fun project that disproved your distillation hypothesis!
Great write up and thanks for sharing.
obrhubr 15 days ago [-]
Thanks for the kind words!
bob1029 15 days ago [-]
> To be exact every participant has a maximum of 1024 tokens at his disposal to craft the best chess bot they can.
I'd be quite tempted to try linear genetic programming with a variant of a Brainfuck-style UTM for this kind of constraint. Assuming 1024 tokens = 1024 bytes = 1024 instructions, I think there is some degree of performance that is feasible.
This is really interesting to me because hand-coding BF to do something like chess is absolutely beyond me. I wouldn't even know where to begin. A LGP-coded BF program would almost certainly outperform anything I could design with my own hands.
obrhubr 15 days ago [-]
This is highly specific to C# which was the language imposed on all participants.
But I agree that some languages might be especially adept to these kinds of tasks and it would be interesting to see which.
deadbabe 15 days ago [-]
Why build a chess engine these days, just use an LLM?
achierius 15 days ago [-]
LLMs are quite bad at chess actually, even compared to human players -- and certainly compared to proper modern chess engines
PaulHoule 15 days ago [-]
They usually struggle to always generate valid moves.
And that's pivotal.
If you have a program which always makes valid moves and gives up when it has lost you wrote a proper chess playing program. It may play badly, but it plays.
ElFitz 15 days ago [-]
That could now be achieved by precomputing all valid moves and using outlines[0] or Structured Outputs[1] to constrain the output.
… or just a valid move checker that prompts it to try again if it fails to make a valid move.
ElFitz 11 days ago [-]
Yes, but LLMs are expensive to run.
emptybits 15 days ago [-]
Why consume when you can create?
Building is often more fun and almost always more educational than just using. Even in “failure”. :-)
ziddoap 15 days ago [-]
Fun, a learning experience, practicing a skill set, curiosity, etc.
Take your pick.
janalsncm 15 days ago [-]
Your question sounds flippant but it’s actually quite deep. Why aren’t LLMs good at chess? The answer is likely that to be good at chess at some level requires search (and probably some ordering constraint on your evaluation function too that I’m not smart enough to figure out).
LLMs aren’t searching. They are memorizing. This explains their extremely poor performance on out of domain positions, whereas Stockfish can easily understand them.
deadbabe 15 days ago [-]
I think chess is a great example to use to demonstrate that LLMs don’t actually know anything, and will be entirely unsuitable to replace humans on anything except thoroughly solved and documented problems.
vikhack 15 days ago [-]
I don't really know how to do that but I'm really putting in my effort
In the past I’ve tried distilling not just the eval function but the outcome of the search (something like depth 20) in a neural net. It kind of works, but not very well until the net is pretty big. Deepmind did something similar.
(I put mistake in quotes here because Deepmind showed that with a server farm of TPUs it is possible to distill the search as well. So it’s not impossible per se.)
But that’s ok! You’ve created a super fast evaluation function. Instead of trying to distill the depth 18 output, it will be much more realistic to distill the depth zero output, the NNUE. If you rerun those FENs in Stockfish you can pretty quickly create an easier dataset.
We trained over the course of 5-6 days, and the end generation was capable of beating an intermediate player but got completely destroyed by experts. It was a fun project!
Have implemented 2048, Minesweeper, Sudoku and chess. First three are single player games have made good progress.
I still don't understand UCI interface and have not thought through chess solving. Hope will take another try this weekend
https://github.com/muonium-ai/simplegames
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Imagine if LLMs were the only way we had to play chess. You’d need a centralized server and peak performance wouldn't even best a grandmaster. You spend $1 trillion building a super cluster because that’s all you know.
< - - This is where AI is today.
And then some startup creates Stockfish, a chess engine better than any LLM or grandmaster and can run on a smartphone.
(... other than the evaluation function being based on handwritten or learned rules)
Examples of heuristic engines are Hans Berliner’s CAPS-II and PARADISE by David Wilkins. PARADISE used a database of 200 rules.
There are also engines that try to retroactively apply rules to Stockfish evals but they’re fairly confusing in my experience. It’s like being told the answer and having to come up with a reason after the fact.
Of course another interesting competition could be to develop _all_ of a chess game and engine and see how low in code people can go. But that would be an entirely different competition.
Given that, I think what I’m imagining is a PGN linter & renderer.
i feel like if you can bruteforce every possible board configuration for the next 3 turns and then make the move that leads to more desirable outcomes, that ought to be enough to thrash most amateurs. Probably not good enough to beat somebody who actually understands chess tactics on a deeper level, but I expect most non-masters are just "winging it" and making things up as they go along, so a machine that can think farther ahead than you would win more often than not.
But like I said, this is all just me fantasizing about a project I haven't even started yet so my hypothesis could easily be wrong.
I'd recommend familiarizing yourself with the minimax algorithm - since that's exactly what it does where "desirable outcome" is essentially the tunable evaluation aspect of the equation.
https://en.wikipedia.org/wiki/Minimax
It's how real chess engines like Stockfish work, but they are highly optimized so they can search extreme depths.
This is kind of begging the question, right? How do you know what a “desirable” outcome is, other than mate-in-N positions? (In your case, N<=3.) The point of an evaluation function is to take a position and return a value which describes how desirable it is.
First you'd have a heuristic that ranks outcomes from turn 3, which would just mean weighing the pieces the computer lost against the pieces the opponent lost. Then you rank each turn-2 node based on how desirable the turn-3 leaf-nodes it leads to are and how likely the opponent is to actually make the moves that lead to it. Then you'd do the same for turn-1 and then pick the move that scored highest.
Great write up and thanks for sharing.
I'd be quite tempted to try linear genetic programming with a variant of a Brainfuck-style UTM for this kind of constraint. Assuming 1024 tokens = 1024 bytes = 1024 instructions, I think there is some degree of performance that is feasible.
This is really interesting to me because hand-coding BF to do something like chess is absolutely beyond me. I wouldn't even know where to begin. A LGP-coded BF program would almost certainly outperform anything I could design with my own hands.
And that's pivotal.
If you have a program which always makes valid moves and gives up when it has lost you wrote a proper chess playing program. It may play badly, but it plays.
[0]: https://github.com/dottxt-ai/outlines
[1]: https://openai.com/index/introducing-structured-outputs-in-t...
Building is often more fun and almost always more educational than just using. Even in “failure”. :-)
Take your pick.
LLMs aren’t searching. They are memorizing. This explains their extremely poor performance on out of domain positions, whereas Stockfish can easily understand them.