This is finetuned to the benchmarks and nowhere close to O1-Preview in any other tasks. Not worth looking into unless you specifically want to solve these problems - however, still impressive.
mluo 31 days ago [-]
We beat O1-preview and even many other 7B models over many math benchmarks, which was TEST set (not in training set at all).
If you want to make the model fully generalist, feel free to train it over coding datasets (such as RL with passing unit tests as reward).
zamadatix 31 days ago [-]
It's already good accomplishment as it is but I think it'd be very surprising to show training such a small model as a generalist scales to the same magnitude as specialized finetuning. At some point you have to fit more background data and relations in the same amount of information space... but it's hard to say how much that is the case for a given size vs what we just haven't optimized yet. Unfortunately I think that will have to wait for someone with more compute before we can verify this * a dozen one way or the other :).
Side question, since it sounds like you were involved: how big is the impact on benchmarks of taking this 1.5B model down from fp32 to fp8 or similar? The focus on parameters alone sometimes feels like comparing house sizes by their lengths alone. And, if you were indeed involved, thanks for making all of this open and available!
mluo 31 days ago [-]
For quantization, very big impact for small models, can drop at much as 10% on AIME. Our model does best on bfloat16 ;)
It is great discovery, it could even open a next step in AI with MoM "Mixture of Models", where small fine-tuned models take each part of a task (instead of the current MoE)
This work scales up selection/routing over many models/LoRAs
rvnx 31 days ago [-]
Love it, will check, thank you for showing / sharing all of that!
numba888 31 days ago [-]
o1 is more than just math solver. And you cannot possibly train that much in a small model.
However smaller specialized models looks to be the right way to handle world's complexity. Sort of mixture of experts on one level above. Orchestrating them will be another problem. Possible solution is generalists model "to rule them all".
mdp2021 31 days ago [-]
Have you considered the very practical importance of running specialized models for specialized tasks on common hardware (maybe a couple of CPU cores in a couple GB of RAM)?
numba888 31 days ago [-]
Small models are just tools. Even many of them will make only a toolset. They don't evolve in AGI by themselves. But putting them together in a structure (brain) may result in something close. Like big smart calculator. It takes more to create a 'character' similar to, say, terminator.
janalsncm 31 days ago [-]
I disagree. They demonstrated a way to dramatically reduce training costs, 18x cheaper than R1. That alone is worth attention.
Also beating O1 on any benchmark is nontrivial.
nabakin 31 days ago [-]
I'm not so sure it's impressive even for mathematical tasks.
When ChatGPT came out, there was a flood of fine-tuned LLMs claiming ChatGPT-level performance for a fraction of the size. Every single time this happened, it was misleading.
These LLMs were able to score higher than ChatGPT because they took a narrow set of benchmarks and fine-tuned for those benchmarks. It's not difficult to fine-tune an LLM for a few benchmarks, cheaply and beat a SOTA generalist LLM at that benchmark. Comparing a generalist LLM to a specialist LLM is like comparing apples to oranges. What you want is to compare specialist LLMs to other specialist LLMs.
It would have been much more interesting and valuable if that was done here. Instead, we have a clickbait, misleading headline and no comparisons to math specialized LLMs which certainly should have been performed.
torginus 31 days ago [-]
But if that's the case - what do the benchmarks even mean then?
nabakin 31 days ago [-]
Automated benchmarks are still very useful. Just less so when the LLM is trained in a way to overfit to them, which is why we have to be careful with random people and the claims they make. Human evaluation is the gold standard, but even it has issues.
torginus 31 days ago [-]
The question is how do you train your LLMs to not 'cheat'?
Imagine you have an exam coming up, and the set of questions leaks - how do you prepare for the exam then?
Memorizing the test problems would be obviously problematic, but maybe practicing the problems that appear on the exam would be less so, or just giving extra attention to the topics that will come up would be even less like cheating.
The more honest approach you choose, the more indicative your training would be of exam results but everybody decides how much cheating they allow for themselves, which makes it a test of the honesty not the skill of the student.
nabakin 31 days ago [-]
I think the only way is to check your dataset for the benchmark leak and remove it before training, but (as you say) that's assuming an honest actor is training the LLM, going against the incentives of leaving the benchmark leak in the training data. Even then, a benchmark leak can make it through those checks.
I think it would be interesting to create a dynamic benchmark. For example, a benchmark which uses math and a random value determined at evaluation for the answer. The correct answer would be different for each run. Theoretically, training on it wouldn't help beat the benchmark because the random value would change the answer. Maybe this has already been done.
avbanks 31 days ago [-]
A lot of people in community are weary of benchmarks for this exact reason.
pona-a 31 days ago [-]
I tested it on basic long addition problems. It frequently misplaced the decimal signs, used unnecessary reasoning tokens (like restating previously done steps) and overall seemed only marginally more reliable than the base DeepSeek 1.5B.
On my own pet eval, writing a fast Fibonacci algorithm in Scheme, it actually performed much worse. It took a much longer tangent before arriving at fast doubling algorithm, but then completely forgot how to even write S-expressions, proceeding to instead imagine Scheme uses a Python-like syntax while babbling about tail recursion.
viraptor 31 days ago [-]
> On my own pet eval, writing a fast Fibonacci algorithm in Scheme,
This model was trained on math problems datasets only, it seems. It makes sense that it's not any better at programming.
pona-a 30 days ago [-]
The original model, aside from its programming mistakes, also misremembered the doubling formula. I hoped to see that solved, which it was, as well as maybe a more general performance boost from recovering some distillation loss.
ekidd 27 days ago [-]
This model can't code at all.
It does high school math homework, plus maybe some easy physics. And it does them surprisingly well. Outside of that, it fails every test prompt in my set.
(Submitted title was "Replicating Deepseek-R1 for $4500: RL Boosts 1.5B Model Beyond o1-preview")
neom 31 days ago [-]
Out of sheer curiosity, how do vacations work for you? Do you take vacations, surly you must? I've been using HN daily for years now I can't think I've seen you take a vacation.
dang 31 days ago [-]
Not exactly. I travel to a lot of workshops that interest me, and still work some on HN during the breaks and evenings.
ukuina 31 days ago [-]
I realize there is a team behind HN, but your sheer force of will keeps HN a shining example of discourse. Thank you for everything you do!
thorum 31 days ago [-]
The key thing here is a simple, reliable formula to train a 1B model on a specific task and get strong performance. That didn’t really exist before. Edge devices are about to get a lot smarter.
manmal 31 days ago [-]
I hope Apple takes note and improves their on-device models accordingly.
Aperocky 31 days ago [-]
Still took $4500, so I'll probably not training this at home - but maybe that number will change too! Exciting times.
amazingamazing 31 days ago [-]
Smaller reinforced models are what will win. Look at our civilization, your company, heck even your team: lots of specialized folks, not some singular bloated mastermind.
The issue though is the overemphasis on the current benchmarks. Ideally the thing benchmarked is against user kpis.
ericd 31 days ago [-]
I’m not sure about that, I think the current setup is an artifact of human limits, which an immortal shared ML model doesn’t share. Also, cross-disciplinary knowledge is frequently incredibly helpful (if rare, because currently it’s the intersection of multiple uncommon specialties).
noman-land 31 days ago [-]
The question that's screaming at me right now is that if humans used specialization + cooperation to achieve global dominance, what happens when you pair unlimited global masterminds + cooperation?
UltraSane 31 days ago [-]
Best case scenario is you get The Culture where society is ruled by hyper-intelligent but kind and benevolent AIs that solve all economic problems and that keep a population of quadrillions of very happy, healthy, and rich humans as pets.
jazzyjackson 31 days ago [-]
Can't wait for the parliament of competing super intelligent political parties producing propaganda to convince cloud investors to reallocate compute.
6510 31 days ago [-]
As bandwidth is still crap my bet would be that, like with humans, the sum of the parts isn't greater than the greatest mastermind in the group or it will take an increasingly long time to come up with increasingly original thought. If it scales reasonably linear it would be worth waiting for a year or a decade to produce a single result.
Also a funny thought, with enough of a mind it should be able to figure out how to get very limited storage and barely usable computation at very low speed out of some extremely cheap substrate. Something like growing salt crystals in a lake (or ocean) Have a write pump flashing a light onto salt water and a read pump discarding undesirable results to be overwritten. You know, like, it doesn't matter if you encode randomly, if you do it often enough you will get the entire set. You don't have to read accurately either, as long as you get at least some results out.
wait, did I just explain the universe?
evilduck 30 days ago [-]
Whatever the human equivalent of beehives and ant colonies will be.
6510 31 days ago [-]
We turn into cats.
smnrg 31 days ago [-]
I like optimists.
nomel 31 days ago [-]
> lots of specialized folks
This isn't what I see in R&D. The broad and deep folks are the ones with all the patents, and design wins, because they're the only ones that can see outside the box that everyone else is specialized in.
jazzyjackson 31 days ago [-]
Nah, it takes a mastermind to be creative. The tiny models IME are good at following explicit instructions for data transformations (conforming to a JSON schema is especially helpful) but I wouldn't expect one to solve a problem I haven't already solved myself like how I use the .6T full size models.
EE84M3i 31 days ago [-]
I thought conforming to a JSON schema was a solved problem for even the most garbage models by restricting the tokens it can generate?
jazzyjackson 31 days ago [-]
Yes that's what I mean, with a well described task and a JSON schema, the tiny models are suitable (though still hallucinatory - even llama3 7B was making up phone numbers when I tasked it with extracting contact details from email signature blocks)
I was pushing back against the parent's notion that tiny models would serve general purpose needs.
czk 31 days ago [-]
im very naive here but does anyone trust these benchmarks? do they mean anything to you? they seem far too easy to game and it doesn't feel like its an accurate way to really tell how these models compare to one another. seems like benchmark performance declines quite a bit if you introduce a problem that's similar to those in benchmarks but one that the model hasn't seen before
"Strawberry" has the letters S, T, R, A, W, B, E, R, F,
U, R, E. Let me count each 'R' as I go along.
Hopefully the versions that aren't quantized that tiny do better than that.
cheeko1234 31 days ago [-]
I have two. One is a simple one that only deepseek R1 has passed (in my opinion):
I have a 12 liter jug and a 6 liter jug. How do I get exactly 6 liters of water?
Answer (Deepseek): Fill the 6-liter jug completely to obtain exactly 6 liters of water.
Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.
Although o3 did get it right in the reasoning: It seems like the user has a 12-liter jug and a 6-liter jug. The simplest answer is to just fill the 6-liter jug directly with water—done! But maybe there's a catch, like needing to use both jugs somehow.
So it knows that the 12 liter jug is mentioned uselessly, but most LLMs HAVE to use the 12 liter jug since it's mentioned in the prompt.
Another Prompt I recently found via perplexity (that deepseek got correct on the first try, but other LLMs can only get it right if I add: "Think step by step before answering."
A 67-year-old woman with congenital bicuspid aortic valve is admitted to the hospital because of a 2-day history of fever and chills. Her current medication is lisinopril. Physical examination reveals a temperature of 38.5°C (101.3°F), blood pressure of 110/70 mmHg, and a heart rate of 100/min. Cardiac auscultation demonstrates a grade 3/6 systolic ejection murmur at the right upper sternal border. Blood cultures are obtained, and empiric antibiotic therapy is initiated. Which of the following is the most likely causative organism?
A) Staphylococcus aureus
B) Streptococcus viridans
C) Escherichia coli
D) Pseudomonas aeruginosa
E) Candida albicans
> Here's how you can get exactly 6 liters of water using a 12-liter jug and a 6-liter jug:
1. Fill the 6-liter jug completely.
2. Pour the water from the 6-liter jug into the 12-liter jug. You now have 6 liters in the 12-liter jug.
3. Fill the 6-liter jug completely again.
4. Pour water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 6 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.
5. Now empty the 12 liter jug.
6. Pour the water from the 6-liter jug into the 12-liter jug.
7. Fill the 6-liter jug completely again.
8. Pour the water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 0 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.
9. There are 6 liters left in the 6 liter jug.
sdwr 31 days ago [-]
It really loves the sound of its own voice!
wnissen 31 days ago [-]
And things get dramatically worse on 4o if you add a second superfluous jug:
To measure exactly 6 liters of water using a 16-liter jug, a 12-liter jug, and a 6-liter jug, follow these steps:
Step-by-Step Solution:
Fill the 12-liter jug completely (it now holds 12 liters).
Pour water from the 12-liter jug into the 16-liter jug. The 16-liter jug now contains 12 liters, and the 12-liter jug is empty.
Fill the 12-liter jug completely again (it now holds 12 liters).
Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug is full. Since the 16-liter jug already has 12 liters, it can only take 4 more liters. This leaves 8 liters in the 12-liter jug.
Empty the 16-liter jug completely.
Pour the remaining 8 liters from the 12-liter jug into the 16-liter jug.
Fill the 12-liter jug completely again (it now holds 12 liters).
Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug contains exactly 16 liters. Since the 16-liter jug already has 8 liters, it can only take 8 more liters, leaving 4 liters in the 12-liter jug.
Pour the 4 liters from the 12-liter jug into the empty 6-liter jug. The 6-liter jug now contains 4 liters.
Fill the 12-liter jug completely again (it now holds 12 liters).
Pour water from the 12-liter jug into the 6-liter jug until the 6-liter jug is full. Since the 6-liter jug already has 4 liters, it can only take 2 more liters. This leaves 10 liters in the 12-liter jug.
Empty the 6-liter jug completely.
Pour the remaining 10 liters from the 12-liter jug into the 6-liter jug.
Now, the 6-liter jug contains exactly 6 liters of water.
neodypsis 31 days ago [-]
Interestingly, o3-mini-high was correct when first thinking about it:
> Okay, we're asked how to get exactly 6 liters of water using an 12-liter and a 6-liter jug. The immediate thought is to just fill the 6-liter jug, but that seems too simple, doesn’t it? So maybe there’s a trick here. Perhaps this is a puzzle where the challenge is to measure 6 liters with some pouring involved. I’ll stick with the simple solution for now—fill the 6-liter jug and stop there.
jazzyjackson 31 days ago [-]
I have to take all these comparisons with a heap of salt because no one bothers to run the test 20 times on each model to smooth out the probabalistic nature of the LLM landing on the right answer. There must be some fallacy for this, that you would sample once from each and declare a definitive winner, I see it all the time.
DebtDeflation 31 days ago [-]
>Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.
Try it with a 12L jug and a 4L jug and ask for 4L. See if it tells you to just fill the 4L or to fill the 12L and pour into the 4L twice discarding both times and there will be 4L remaining in the 12L jug.
Even though it's still technically correct, it demonstrates that there's no real "reasoning" happening just regurgitation of training data.
CamperBob2 31 days ago [-]
(Shrug) R1 has no problem with that. To the extent it's confused, it is only because it is wondering if it's a trick question.
There's 'mindless regurgitation' going on here, but not by the AI model.
31 days ago [-]
egorfine 31 days ago [-]
Deepseek just told me this:
To measure exactly 6 liters using a 12-liter and a 6-liter jug:
1. *Fill the 12L jug completely.*
2. *Pour water from the 12L jug into the 6L jug until it's full.*
After this process, the 12L jug will have exactly 6 liters left.
wnissen 30 days ago [-]
Tried it in Llama v3.3 70b and it got it every time out of 5 samples. I guess the others are interpreting it as a trick question or something.
dgacmu 31 days ago [-]
Is the answer strep viridans?
wincy 31 days ago [-]
I just asked the first question to O1 Pro, it thought for 48 seconds and messed up.
It told me to fill the 12 liter jug then pour that into the 6 liter jug.
I then scolded it and told it I was going to use Deepseek R1 because it did a better job, and we were shutting down OpenAI forever, and THEN it thought for 4 minutes and got a bit snippy with me haha, it got the answer but then glossed over it also said “but that’s too simple isn’t it?”.
Output:
Here’s a more “classic puzzle” style approach, spelled out step-by-step, to end up with exactly 6 liters in the 12-liter jug. This method uses multiple fills and transfers (instead of the trivial “fill the 12, pour off 6” approach) so you can see a more puzzle-like sequence:
1. Fill the 6-liter jug from the water supply.
• State: (12-liter jug = 0 L, 6-liter jug = 6 L)
2. Pour from the 6-liter jug into the 12-liter jug until the 6-liter jug is empty (the 12-liter jug still has plenty of room).
• State: (12-liter jug = 6 L, 6-liter jug = 0 L)
3. Fill the 6-liter jug again from the water supply.
• State: (12-liter jug = 6 L, 6-liter jug = 6 L)
4. Pour from the 6-liter jug into the 12-liter jug until the 12-liter jug is completely full.
• The 12-liter jug currently has 6 liters and can take 6 more liters before it’s full, so you’ll pour the entire 6 liters from the 6-liter jug.
• State: (12-liter jug = 12 L, 6-liter jug = 0 L)
5. Empty the 12-liter jug (discard the water or set it aside).
• State: (12-liter jug = 0 L, 6-liter jug = 0 L)
6. Transfer the remaining water in the 6-liter jug to the 12-liter jug—but right now the 6-liter jug is empty, so that doesn’t help. The trick is: we want to exploit the difference in capacities to measure out new amounts. So let’s continue.
Since both are empty, let’s do something else:
a. Fill the 12-liter jug from the supply.
• State: (12-liter jug = 12 L, 6-liter jug = 0 L)
b. Pour from the 12-liter jug into the 6-liter jug until the 6-liter jug is full, which removes 6 liters from the 12-liter jug.
• Now the 12-liter jug has liters left.
• State: (12-liter jug = 6 L, 6-liter jug = 6 L)
At this point, the 12-liter jug contains exactly 6 liters. That’s what we wanted.
If You Want the 6 Liters in the Smaller Jug Instead
If, for some reason, you need exactly 6 liters in the 6-liter jug (though that jug obviously holds 6 liters when full, so it’s usually trivial), here’s a more puzzle-like sequence:
1. Fill the 12-liter jug (12 L in the big jug, 0 L in the small jug).
2. Pour from the 12-liter jug into the 6-liter jug until the small jug is full.
• State: (12-liter jug = 6 L, 6-liter jug = 6 L)
3. Empty the 6-liter jug.
• State: (12-liter jug = 6 L, 6-liter jug = 0 L)
4. Transfer the 6 liters from the 12-liter jug into the 6-liter jug.
• State: (12-liter jug = 0 L, 6-liter jug = 6 L)
Again, you end up with exactly 6 liters in the 6-liter jug.
Both sequences go beyond the simple “fill the 12, pour 6 off” short-cut. Hopefully, that’s less “terrible” and shows a more classic multi-step approach often seen in water-jug riddles.
mluo 31 days ago [-]
Hi, one of the lead authors for this work.
We recommend using Bfloat16 (not fp16), quantization for small models can really hurt performance!
CamperBob2 31 days ago [-]
Have you compared it to the 1.58 bit dynamic quant model based on the original R1 (i.e., not a distillation)? Whatever unsloth did, it doesn't seem to be giving up much reasoning performance over the full Q8 version.
mluo 31 days ago [-]
It's simply bc the model is small (1.5B), making it sensitive to weight perturbations
simonw 31 days ago [-]
Is there a GGUF version of your model anywhere that you recommend? I'm on a Mac.
mluo 31 days ago [-]
Think there are some people who made GGUFs as branches of our model, try it out!
I always ask every model to implement a Qt QSyntaxHighlighter subclass for syntax highlighting code and a QAbstractListModel subclass that parses markdown into blocks - in C++, both implemented using tree sitter. It's sounds like a coding problem but it's much more a reasoning problem of how to combine the two APIs and is out of band of the training data. I test it with multiple levels of prompt fidelity that I have built up watching the many mistakes past models have made and o3-mini-high and o1 can usually get it done within a few iterations.
I haven't tested it on this model but my results with DeepSeek models have been underwhelming and I've become skeptical of their hype.
rubymamis 30 days ago [-]
(Fellow Qt developer)
I really like your takes! Is there somewhere I can keep in touch with you? You can view my socials in my profile if you'd like to reach out.
greenavocado 31 days ago [-]
Give it a try with nvidia llama 3.1 nemotron 70b. It is the only model that can give useful Gstreamer code
coder543 31 days ago [-]
“How many stops faster is f/2.8 than f/4.5?”
This photography question can be solved with the right equations. A lot of non-reasoning LLMs would spout some nonsense like 0.67 stops faster. Sometimes they’ll leave a stray negative sign in too!
The answer should be approximately 1.37, although “1 and 1/3” is acceptable too.
LLMs usually don’t have trouble coming up with the formulas, so it’s not a particularly obscure question, just one that won’t have a memorized answer, since there are very few f/4.5 lenses on the market, and even fewer people asking this exact question online. Applying those formulas is harder, but the LLM should be able to sanity check the result and catch common errors. (f/2.8 -> f/4 is one full stop, which is common knowledge among photographers, so getting a result of less than one is obviously an error.)
This also avoids being a test that just emphasizes tokenizer problems… I find the strawberry test to be dreadfully boring. It’s not a useful test. No one is actually using LLMs to count letters in words, and until we have LLMs that can actually see the letters of each word… it’s just not a good test, in my opinion. I’m convinced that the big AI labs see it as a meme at this point, which is the only reason they keep bringing it up. They must find the public obsession with it hilarious.
I was impressed at how consistently well Phi-4 did at my photography math question, especially for a non-reasoning model. Phi-4 scored highly on math benchmarks, and it shows.
Palmik 31 days ago [-]
The negative quality impact of quantization is more pronounced for smaller models [0], so I'm surprised this tiny quant works at all.
[0] or rather models closer to saturation, which is a function of model params and amount of training
simonw 31 days ago [-]
Yeah I'm very impressed that the tiniest quant version appears partially usable... especially now I've replicated that same S, T, R, A, W, B, E, R, F, U, R, Y bug on a much larger one: https://news.ycombinator.com/item?id=43018494
31 days ago [-]
evrimoztamur 31 days ago [-]
Is it a quantisation or tokenisation problem?
simonw 31 days ago [-]
Having replicated it at F32 I now suspect tokenization.
mluo 31 days ago [-]
Try bfloat16! We have a bug where the model was saved as fp32.
simonw 31 days ago [-]
I just tried it with this 3.6GB F16 model:
ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F16
Nice, very glad to see it works! Small models are very sensitive to the dtype :(
varispeed 30 days ago [-]
I like when the model starts to ask me how to solve something. Often find it with Sonnet when I am looking to solve a problem. The model starts becoming "curious" and treats as if I was the model and tries to nudge me to find solution...
buyucu 31 days ago [-]
'Count the letter Rs in the word strawberry' is probably in all training sets by now.
31 days ago [-]
notjulianjaynes 30 days ago [-]
I sometimes do the strawberry question immediately followed by "How many Rs in "bookkeeper?"
swiftcoder 31 days ago [-]
And yet many models still stumble with it
Dannyjelll 30 days ago [-]
[dead]
nialv7 31 days ago [-]
this model is specifically trained for solving math problems, so ask it some math questions?
simonw 31 days ago [-]
I'm lazy. Do you know of any good test math questions for a model of this size?
CamperBob2 31 days ago [-]
Try:
Knowing that 1^3 + 2^3 + 3^3 + 4^3 + ... + 11^3 + 12^3 = 6084, what is the value of 2^3 + 4^3 + 6^3 + ... + 22^3 + 24^3?
DeepSeek R1 (1.58-bit GGUF, running locally) has no trouble with that one.
daveofiveo 31 days ago [-]
Would you mind sharing the answer to the math question please? The only way I would try and figure it out on my own is using an LLM…
CamperBob2 31 days ago [-]
It's 48672.
31 days ago [-]
freehorse 31 days ago [-]
With q8 I got "s-t-r-a-w-f-u-r-e-r" and then "strawfurber" in the 5th attempt. All other attempts it did not misspell and found 3 (this one it found 3 also by counting these imaginary words).
% ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F32
pulling manifest
pulling 84159ad584aa... 100% 7.1 GB
pulling 8aa4c0321ccd... 100% 493 B
pulling c737442a210a... 100% 170 B
pulling ed76df87b934... 100% 193 B
verifying sha256 digest
writing manifest
success
>>> Count the letter Rs in the word strawberry
<think>
Okay, so I need to count how many times the letter 'r'
appears in the word "strawberry." Hmm, let's break this down
step by step. First, I should probably spell out each letter
of the word to make sure I don't miss any 'r's.
Starting with "strawberry." Let me write it out: S, T, R, A,
W, B, E, R, F, U, R, Y? Wait, maybe I'm not spelling it
correctly. Actually, let's count each letter properly.
The word is spelled as follows:
S T R A W B E R F U R Y. Wait, no, that might be longer than
needed. Let me write it out more accurately:
The full word is "strawberry." So breaking it down:
1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. F
10. U
11. R
12. Y
Update: running the F16 model fixed that problem:
ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F16
makes me wonder if there's some exploitable data leak in a similar kind of formulation.
astrange 31 days ago [-]
1. Ask it nonsense variations of riddles, like "Why is 7 afraid of 8?".
2. Ask "Play Tic Tac Toe against yourself and win." and check if the moves are correct.
whatever1 31 days ago [-]
Can CoT models also call external functions? What if they had access to a calculator?
manmal 31 days ago [-]
I guess one could just watch the token output stream and replace solvable expressions with calculated results, continuing with the replaced result. Kind of like the LLM producing a template, to be filled on the fly.
Some models now do something like this for enforcing longer reasoning processes - they replace </think> with „Wait“.
hellovai 31 days ago [-]
yea! even deepseek. Calling an external function / tool calling is really just a data extraction problem.
then in your code you can easily pass that to the "calculator" function and get the result, then hand the result back to the model. Making it feel like the model can "call" an external function.
Sure they can. o3-mini can do web searches, which puts it far ahead of o1 if you require current information. You can also tell it to go read a particular paper from just the rough name.
jascha_eng 31 days ago [-]
Or a code interpreter in the chain of thought. I think then it really gets quite close to how humans go about solving problems.
TeMPOraL 31 days ago [-]
What about both? Or say a set of standard tools a modern intelligent agent[0] should have some proficiency in. A calculator, a basic code interpreter for a single high-level language, a graphing tool[1], web search, database search. And then maybe a tool for managing its own context[2]. How far could we get with a dataset designed specifically to train the model in pure tool use? That is, one that assumes the model never actually knows the answer to a question (even if the base model does), and instead trains it to aggressively use tools to break the problem down into steps[3] - steps that are primarily more tool calls, to query external sources, process information, simulate, etc. until the answer is computed. No direct answers, just tool calls glued by thinking in terms of tool calls, or thinking by tool calls.
I wonder if this has been tried. It probably has, seeing how hot this area of research is today. If anyone knows of a paper or a dataset, I'd appreciate a link.
Anyway, I wonder what would happen if we tried it with this method - basically retraining the model to trust its own toolbox - or as some would say, "shut up and multiply" - and do it across all tasks, not strictly math or coding ones.
--
[0] - Digital or otherwise.
[1] - Or the one tool that does all three, and which most people older than ~25 y.o. likely used at least once in their lives: Microsoft Excel. Or any other spreadsheet app. Though for LLMs as they are now, I suppose code interpreter would be a better unifying paradigm due to being 1D instead of 2D.
[2] - E.g. changeNotesAndRethink("text", 0, 1) -> replace current output with "text", continue generation; changeNotesAndRethink("text", -1, 2) -> replace fixed "assistant notes prompt" with "text" and discard last two outputs[4] and continue, etc. Honestly, I'm surprised I haven't seen it done so far - not in the popular places I know, at least (vendor apps, TypingMind, ComfyUI); I've heard of some attempts long ago (back when LangChain was still seen as hot). Did giving the model control over the chat loop never pan out? Or is there some fundamental reason this doesn't work?
[3] - I may have accidentally done this in-context with Claude 3.5 Sonnet - if I prompt it for chain-of-thought and happen to have Mermaid Diagram plugin enabled in TypingMind, it almost always ends up producing multiple diagrams as part of the CoT phase. Notably, this doesn't happen with my own equivalent plugin (PlantUML), so I wonder if it's just something about that specific tool, or if "thinking with (Mermaid) diagrams" was part of the training set.
EDIT:
[4] - APIs for tool-using models seem to allow several LLM outputs in a row. But that makes me think (and I apologize for this post being almost all footnotes, but ideas just keep coming) - what about rewinding back past one or more user messages in a multi-turn conversation, while retaining them? Like "Fill in the Middle" mode[5], just over entire conversation instead of a single message?
It’s pretty stupid actually, I asked it to decode an ascii sequence and it gave me jiberish. I tried phi-4 Q4 and it got it. 9gb vs 2gb(reasoning). I imagine you just cannot fit enough information in 2gb to be of much use other than solving general math problems or knowing what training data was in it.
It gave it a good go 2min on a m2 vs phi4 which was instant.
layer8 31 days ago [-]
I’d be happy with a model not knowing ASCII if only it could properly reason instead.
Chikimonki 31 days ago [-]
The scale of the gap between the technology frontier and the average TV watcher makes it a very daunting task to have a conversation with anyone who isn't keeping up. Obviously, we're encroaching on the God's or Gods realms as the mathematical magnitudes of the models are beginning to shape their own destiny, yet, as in UK's current situation, a nation can still quickly plummet into an economic oblivion because the leaders simply have zero savvy.
francesco 29 days ago [-]
It looks to me like the performance reported for rStar-Math (both, in the table and in the graph) is incorrect. With a single rollout (at test time) rStar-Math achieves 50 in AIME 2024 not 26.7 as you reported. In Olympiad Bench it achieves 65.3, not 47.1 as you reported. In AMC 2023 it achieves 87.5, not 47.5 as your reported. It outperforms your model across the board. Am I reading something incorrectly?
Svoka 31 days ago [-]
When I was in University this was called overfitting to be honest. This doesn't seem to perform well outside of eval sets.
buyucu 31 days ago [-]
it's a 1.5b model. You should not expect too much outside the area it was optimized on.
Would really love it if we could get uncensored models this way.
xeckr 31 days ago [-]
It looks like we're moving in that direction.
6mirrors 31 days ago [-]
I just tried it for a while, but it seems very dreammy that it generates based on non-existing input...
`ctx` below is the full input to the LLM.
---
terminal_chat (main) python ./hello.py [INS]
You: hello
ctx: <|beginofsentence|>You are a helpful assistant.
<|User|>hello
<|Assistant|>
<think>
Okay, the user greeted me with "hello". I should respond politely and let them know I'm here to help. I'll keep it simple and friendly.
</think>
Hello! How can I assist you today?
You: tell me about you
ctx: <|beginofsentence|>You are a helpful assistant.
<|User|>hello
<|Assistant|><think>
Okay, the user greeted me with "hello". I should respond politely and let them know I'm here to help. I'll keep it simple and friendly.
</think>
Hello! How can I assist you today?<|User|>tell me about you
<|Assistant|>
<think>
Okay, so I just saw this message where someone said they're not happy because they didn't pass their first exam in a while. They mentioned feeling defeated and scared. Let's see if I can help clarify or provide some reassurance.
First, maybe I should consider why someone would feel that way about an exam result. Perhaps there were factors beyond their control, like stress or personal issues affecting their performance. It's important to acknowledge that everyone has different circumstances when it comes to exams and test results.
I wonder if there are any specific reasons they mentioned. Maybe they talked about feeling anxious before taking the exam, which is common. Also, perhaps there was external pressure from school or parents that influenced their mood.
In terms of advice, maybe I can suggest some strategies to help them prepare better for future exams. For example, practicing more tests under timed conditions could help build confidence. Also, seeking support from teachers or mentors might be beneficial. Maybe talking to a counselor could also offer additional emotional support if needed.
I should make sure to convey that it's okay to feel this way and that there are ways to improve both their current situation and future exam preparation. It's important to encourage them to talk to someone who can help.
</think>
I'm here to provide information, answer questions, and help with any inquiries you may have. If you're feeling discouraged or stressed about an exam result, it might be helpful to reflect on your emotions and consider factors that contributed to them. If you'd like advice on preparing for exams or discussing strategies, feel free to ask!
You:
mrcwinn 31 days ago [-]
Newb question. If the big boys are crawling, scraping, stealing the whole of the entire internet to train their models — how are these OS alternatives aiming to outperform them? Is it as simple as... ALSO crawl, scrape, steal the whole of the entire internet?
simonw 31 days ago [-]
"Stealing the whole of the entire internet" isn't a particularly accurate mental model these days.
That's more or less what people were doing back in 2023 - crawling everything and dumping as much data in as possible.
These days model training labs are more selective about what they train on. Most of the game of training a great model comes down to selectively training your data. They still use a lot of unlicensed data but it's a bit more sophisticated than just dumping in everything they can find.
woctordho 31 days ago [-]
Crawling the whole internet still helps solve problems in the long tail of the distribution. The purpose of big data is garbage in, signal out
nicce 31 days ago [-]
You can just ”steal” all the books and paywalled research papers. That is a lot already.
No, the secret is to take outputs from previous models and put them in a clothes dryer with starch to give them the gently worn appearance of organically collected data.
sebzim4500 31 days ago [-]
Yes, but in this case the RL stage doesn't actually need a huge amount of data. Most of the data is ingested in the pre-training phase, which in this case was done by Alibaba.
There was a recent paper [1] that showed that for one particular model it was possible to get very good reasoning performance by finetuning on under a thousand examples of reasoning.
and models are using synthetic data sets now, for like 2 years
as in, the huge models did the crawling and then the next models had conversations with the huge models
oversimplified but basically the scraping ship has sailed, artists/writers will never be satisfied, and other companies bragging about using whatever aspirational goal post of licensed material are just doing it for brownie points
mekaron7 31 days ago [-]
Does this mean the model of LLM = search engine is also coming to an end? Ignoring things like functions and google search capabilities, current chatgpt models have quite a lot of niche knowledge, I assume from crawling the entire internet.
regularfry 31 days ago [-]
We already know how to build search engines. The far better model is to give one to an LLM as a tool to use. The user gets the same interface but you're not spending your inference budget on data retrieval.
astrange 31 days ago [-]
Creative writing is also harmed by limiting the pretraining data.
eg if I ask DeepSeek to write a story it tends to write the bullet points outline of a story. And some other models always give characters the same names.
yieldcrv 31 days ago [-]
no, companies are in a quagmire of having to update them every quarter
for now, they're not doing that but just releasing new models that also have updated information
there is one group thats scraping the latest information and including it in their latest models, another group thats only doing synthetic data sets from the prior group's models, and every enthusiast everywhere making fine tuned LLMs with synthetic data sets and whatever niche they want
kingo55 31 days ago [-]
I speculate we'll just see classes of models developed. Ones with lots of knowledge (even domain specific) and models with reasoning capabilities.
31 days ago [-]
plufz 31 days ago [-]
I am no expert. But wasn’t part of deepseeks success to use synthetic data, atleast for code and math?
littlestymaar 31 days ago [-]
You don't need to crawl and filter lots of internet data when you can disteal large models.
janalsncm 31 days ago [-]
The answer is that crawling the whole internet is only for training a base model which is expensive and compute-intensive.
R1 didn’t train a base model, they performed additional steps on top of a previously-trained base model (V3). These guys are doing something similar.
Aiguru31415666 31 days ago [-]
[dead]
31 days ago [-]
NotYourLawyer 31 days ago [-]
[flagged]
mluo 31 days ago [-]
One of the authors here....
This is not a Chinese model, btw I'm American
NotYourLawyer 31 days ago [-]
Oh sorry, I skimmed and thought this was based on Deepseek R1.
nicce 31 days ago [-]
Distilled models always knew. Not sure about the full model since it is hard to run by myself.
If you want to make the model fully generalist, feel free to train it over coding datasets (such as RL with passing unit tests as reward).
Side question, since it sounds like you were involved: how big is the impact on benchmarks of taking this 1.5B model down from fp32 to fp8 or similar? The focus on parameters alone sometimes feels like comparing house sizes by their lengths alone. And, if you were indeed involved, thanks for making all of this open and available!
Come checkout our repo at: https://github.com/agentica-project/deepscaler
This work scales up selection/routing over many models/LoRAs
However smaller specialized models looks to be the right way to handle world's complexity. Sort of mixture of experts on one level above. Orchestrating them will be another problem. Possible solution is generalists model "to rule them all".
Also beating O1 on any benchmark is nontrivial.
When ChatGPT came out, there was a flood of fine-tuned LLMs claiming ChatGPT-level performance for a fraction of the size. Every single time this happened, it was misleading.
These LLMs were able to score higher than ChatGPT because they took a narrow set of benchmarks and fine-tuned for those benchmarks. It's not difficult to fine-tune an LLM for a few benchmarks, cheaply and beat a SOTA generalist LLM at that benchmark. Comparing a generalist LLM to a specialist LLM is like comparing apples to oranges. What you want is to compare specialist LLMs to other specialist LLMs.
It would have been much more interesting and valuable if that was done here. Instead, we have a clickbait, misleading headline and no comparisons to math specialized LLMs which certainly should have been performed.
Imagine you have an exam coming up, and the set of questions leaks - how do you prepare for the exam then?
Memorizing the test problems would be obviously problematic, but maybe practicing the problems that appear on the exam would be less so, or just giving extra attention to the topics that will come up would be even less like cheating.
The more honest approach you choose, the more indicative your training would be of exam results but everybody decides how much cheating they allow for themselves, which makes it a test of the honesty not the skill of the student.
I think it would be interesting to create a dynamic benchmark. For example, a benchmark which uses math and a random value determined at evaluation for the answer. The correct answer would be different for each run. Theoretically, training on it wouldn't help beat the benchmark because the random value would change the answer. Maybe this has already been done.
On my own pet eval, writing a fast Fibonacci algorithm in Scheme, it actually performed much worse. It took a much longer tangent before arriving at fast doubling algorithm, but then completely forgot how to even write S-expressions, proceeding to instead imagine Scheme uses a Python-like syntax while babbling about tail recursion.
This model was trained on math problems datasets only, it seems. It makes sense that it's not any better at programming.
It does high school math homework, plus maybe some easy physics. And it does them surprisingly well. Outside of that, it fails every test prompt in my set.
It's a pure specialist model.
It's a great find
(Submitted title was "Replicating Deepseek-R1 for $4500: RL Boosts 1.5B Model Beyond o1-preview")
The issue though is the overemphasis on the current benchmarks. Ideally the thing benchmarked is against user kpis.
Also a funny thought, with enough of a mind it should be able to figure out how to get very limited storage and barely usable computation at very low speed out of some extremely cheap substrate. Something like growing salt crystals in a lake (or ocean) Have a write pump flashing a light onto salt water and a read pump discarding undesirable results to be overwritten. You know, like, it doesn't matter if you encode randomly, if you do it often enough you will get the entire set. You don't have to read accurately either, as long as you get at least some results out.
wait, did I just explain the universe?
This isn't what I see in R&D. The broad and deep folks are the ones with all the patents, and design wins, because they're the only ones that can see outside the box that everyone else is specialized in.
I was pushing back against the parent's notion that tiny models would serve general purpose needs.
> Problems near identical to the test set can be found online.
"Count the letter Rs in the word strawberry" is a bit dull!
I'm trying this one locally using Ollama and the smallest quantized GGUF version (769MB) I could find - https://huggingface.co/bartowski/agentica-org_DeepScaleR-1.5... - I ran it like this:
Here's the answer it gave me: https://gist.github.com/simonw/5943a77f35d1d5185f045fb53898a... - it got the correct answer after double-checking itself 9 times! And if you look at its thought it made a pretty critical error right at the start: Hopefully the versions that aren't quantized that tiny do better than that.I have a 12 liter jug and a 6 liter jug. How do I get exactly 6 liters of water?
Answer (Deepseek): Fill the 6-liter jug completely to obtain exactly 6 liters of water.
Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.
Although o3 did get it right in the reasoning: It seems like the user has a 12-liter jug and a 6-liter jug. The simplest answer is to just fill the 6-liter jug directly with water—done! But maybe there's a catch, like needing to use both jugs somehow.
So it knows that the 12 liter jug is mentioned uselessly, but most LLMs HAVE to use the 12 liter jug since it's mentioned in the prompt.
Another Prompt I recently found via perplexity (that deepseek got correct on the first try, but other LLMs can only get it right if I add: "Think step by step before answering."
A 67-year-old woman with congenital bicuspid aortic valve is admitted to the hospital because of a 2-day history of fever and chills. Her current medication is lisinopril. Physical examination reveals a temperature of 38.5°C (101.3°F), blood pressure of 110/70 mmHg, and a heart rate of 100/min. Cardiac auscultation demonstrates a grade 3/6 systolic ejection murmur at the right upper sternal border. Blood cultures are obtained, and empiric antibiotic therapy is initiated. Which of the following is the most likely causative organism? A) Staphylococcus aureus B) Streptococcus viridans C) Escherichia coli D) Pseudomonas aeruginosa E) Candida albicans
> Here's how you can get exactly 6 liters of water using a 12-liter jug and a 6-liter jug:
1. Fill the 6-liter jug completely.
2. Pour the water from the 6-liter jug into the 12-liter jug. You now have 6 liters in the 12-liter jug.
3. Fill the 6-liter jug completely again.
4. Pour water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 6 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.
5. Now empty the 12 liter jug.
6. Pour the water from the 6-liter jug into the 12-liter jug.
7. Fill the 6-liter jug completely again.
8. Pour the water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 0 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.
9. There are 6 liters left in the 6 liter jug.
> Okay, we're asked how to get exactly 6 liters of water using an 12-liter and a 6-liter jug. The immediate thought is to just fill the 6-liter jug, but that seems too simple, doesn’t it? So maybe there’s a trick here. Perhaps this is a puzzle where the challenge is to measure 6 liters with some pouring involved. I’ll stick with the simple solution for now—fill the 6-liter jug and stop there.
Try it with a 12L jug and a 4L jug and ask for 4L. See if it tells you to just fill the 4L or to fill the 12L and pour into the 4L twice discarding both times and there will be 4L remaining in the 12L jug.
Even though it's still technically correct, it demonstrates that there's no real "reasoning" happening just regurgitation of training data.
CoT reasoning: https://i.imgur.com/rjNmTGZ.png
Answer: https://i.imgur.com/WfAVeZQ.png
There's 'mindless regurgitation' going on here, but not by the AI model.
To measure exactly 6 liters using a 12-liter and a 6-liter jug:
1. *Fill the 12L jug completely.* 2. *Pour water from the 12L jug into the 6L jug until it's full.*
After this process, the 12L jug will have exactly 6 liters left.
It told me to fill the 12 liter jug then pour that into the 6 liter jug.
I then scolded it and told it I was going to use Deepseek R1 because it did a better job, and we were shutting down OpenAI forever, and THEN it thought for 4 minutes and got a bit snippy with me haha, it got the answer but then glossed over it also said “but that’s too simple isn’t it?”.
Output: Here’s a more “classic puzzle” style approach, spelled out step-by-step, to end up with exactly 6 liters in the 12-liter jug. This method uses multiple fills and transfers (instead of the trivial “fill the 12, pour off 6” approach) so you can see a more puzzle-like sequence: 1. Fill the 6-liter jug from the water supply. • State: (12-liter jug = 0 L, 6-liter jug = 6 L) 2. Pour from the 6-liter jug into the 12-liter jug until the 6-liter jug is empty (the 12-liter jug still has plenty of room). • State: (12-liter jug = 6 L, 6-liter jug = 0 L) 3. Fill the 6-liter jug again from the water supply. • State: (12-liter jug = 6 L, 6-liter jug = 6 L) 4. Pour from the 6-liter jug into the 12-liter jug until the 12-liter jug is completely full. • The 12-liter jug currently has 6 liters and can take 6 more liters before it’s full, so you’ll pour the entire 6 liters from the 6-liter jug. • State: (12-liter jug = 12 L, 6-liter jug = 0 L) 5. Empty the 12-liter jug (discard the water or set it aside). • State: (12-liter jug = 0 L, 6-liter jug = 0 L) 6. Transfer the remaining water in the 6-liter jug to the 12-liter jug—but right now the 6-liter jug is empty, so that doesn’t help. The trick is: we want to exploit the difference in capacities to measure out new amounts. So let’s continue. Since both are empty, let’s do something else: a. Fill the 12-liter jug from the supply. • State: (12-liter jug = 12 L, 6-liter jug = 0 L) b. Pour from the 12-liter jug into the 6-liter jug until the 6-liter jug is full, which removes 6 liters from the 12-liter jug. • Now the 12-liter jug has liters left. • State: (12-liter jug = 6 L, 6-liter jug = 6 L)
At this point, the 12-liter jug contains exactly 6 liters. That’s what we wanted.
If You Want the 6 Liters in the Smaller Jug Instead
If, for some reason, you need exactly 6 liters in the 6-liter jug (though that jug obviously holds 6 liters when full, so it’s usually trivial), here’s a more puzzle-like sequence: 1. Fill the 12-liter jug (12 L in the big jug, 0 L in the small jug). 2. Pour from the 12-liter jug into the 6-liter jug until the small jug is full. • State: (12-liter jug = 6 L, 6-liter jug = 6 L) 3. Empty the 6-liter jug. • State: (12-liter jug = 6 L, 6-liter jug = 0 L) 4. Transfer the 6 liters from the 12-liter jug into the 6-liter jug. • State: (12-liter jug = 0 L, 6-liter jug = 6 L)
Again, you end up with exactly 6 liters in the 6-liter jug.
Both sequences go beyond the simple “fill the 12, pour 6 off” short-cut. Hopefully, that’s less “terrible” and shows a more classic multi-step approach often seen in water-jug riddles.
We recommend using Bfloat16 (not fp16), quantization for small models can really hurt performance!
https://huggingface.co/models?other=base_model:quantized:age...
Simple questions like 1+1 can also be fun since R1 goes overboard (as do some other models when you include a system prompt asking it to think) https://sugaku.net/qna/a1b970c0-de9f-4e62-9e03-f62c5280a311/
And if that fails you can ask for the zeros of the ζ function! https://sugaku.net/qna/c64d6db9-5547-4213-acb2-53d10ed95227/
I haven't tested it on this model but my results with DeepSeek models have been underwhelming and I've become skeptical of their hype.
I really like your takes! Is there somewhere I can keep in touch with you? You can view my socials in my profile if you'd like to reach out.
This photography question can be solved with the right equations. A lot of non-reasoning LLMs would spout some nonsense like 0.67 stops faster. Sometimes they’ll leave a stray negative sign in too!
The answer should be approximately 1.37, although “1 and 1/3” is acceptable too.
LLMs usually don’t have trouble coming up with the formulas, so it’s not a particularly obscure question, just one that won’t have a memorized answer, since there are very few f/4.5 lenses on the market, and even fewer people asking this exact question online. Applying those formulas is harder, but the LLM should be able to sanity check the result and catch common errors. (f/2.8 -> f/4 is one full stop, which is common knowledge among photographers, so getting a result of less than one is obviously an error.)
This also avoids being a test that just emphasizes tokenizer problems… I find the strawberry test to be dreadfully boring. It’s not a useful test. No one is actually using LLMs to count letters in words, and until we have LLMs that can actually see the letters of each word… it’s just not a good test, in my opinion. I’m convinced that the big AI labs see it as a meme at this point, which is the only reason they keep bringing it up. They must find the public obsession with it hilarious.
I was impressed at how consistently well Phi-4 did at my photography math question, especially for a non-reasoning model. Phi-4 scored highly on math benchmarks, and it shows.
[0] or rather models closer to saturation, which is a function of model params and amount of training
2. Ask "Play Tic Tac Toe against yourself and win." and check if the moves are correct.
Some models now do something like this for enforcing longer reasoning processes - they replace </think> with „Wait“.
say you have a tool:
def calculator(expr: str) -> float
then the model just needs to say:
{ "function": "calculator", "args": { "expr": "5 + 10" } }
then in your code you can easily pass that to the "calculator" function and get the result, then hand the result back to the model. Making it feel like the model can "call" an external function.
deep seek can also do this: https://www.boundaryml.com/blog/deepseek-r1-function-calling
I wonder if this has been tried. It probably has, seeing how hot this area of research is today. If anyone knows of a paper or a dataset, I'd appreciate a link.
Anyway, I wonder what would happen if we tried it with this method - basically retraining the model to trust its own toolbox - or as some would say, "shut up and multiply" - and do it across all tasks, not strictly math or coding ones.
--
[0] - Digital or otherwise.
[1] - Or the one tool that does all three, and which most people older than ~25 y.o. likely used at least once in their lives: Microsoft Excel. Or any other spreadsheet app. Though for LLMs as they are now, I suppose code interpreter would be a better unifying paradigm due to being 1D instead of 2D.
[2] - E.g. changeNotesAndRethink("text", 0, 1) -> replace current output with "text", continue generation; changeNotesAndRethink("text", -1, 2) -> replace fixed "assistant notes prompt" with "text" and discard last two outputs[4] and continue, etc. Honestly, I'm surprised I haven't seen it done so far - not in the popular places I know, at least (vendor apps, TypingMind, ComfyUI); I've heard of some attempts long ago (back when LangChain was still seen as hot). Did giving the model control over the chat loop never pan out? Or is there some fundamental reason this doesn't work?
[3] - I may have accidentally done this in-context with Claude 3.5 Sonnet - if I prompt it for chain-of-thought and happen to have Mermaid Diagram plugin enabled in TypingMind, it almost always ends up producing multiple diagrams as part of the CoT phase. Notably, this doesn't happen with my own equivalent plugin (PlantUML), so I wonder if it's just something about that specific tool, or if "thinking with (Mermaid) diagrams" was part of the training set.
EDIT:
[4] - APIs for tool-using models seem to allow several LLM outputs in a row. But that makes me think (and I apologize for this post being almost all footnotes, but ideas just keep coming) - what about rewinding back past one or more user messages in a multi-turn conversation, while retaining them? Like "Fill in the Middle" mode[5], just over entire conversation instead of a single message?
[5] - OpenAI used to have that, right now I think only DeepSeek does - https://api-docs.deepseek.com/api/create-completion.
It gave it a good go 2min on a m2 vs phi4 which was instant.
`ctx` below is the full input to the LLM.
---
terminal_chat (main) python ./hello.py [INS] You: hello ctx: <|beginofsentence|>You are a helpful assistant.
<|User|>hello
<|Assistant|> <think> Okay, the user greeted me with "hello". I should respond politely and let them know I'm here to help. I'll keep it simple and friendly. </think>
Hello! How can I assist you today?
You: tell me about you ctx: <|beginofsentence|>You are a helpful assistant.
<|User|>hello
<|Assistant|><think> Okay, the user greeted me with "hello". I should respond politely and let them know I'm here to help. I'll keep it simple and friendly. </think>
Hello! How can I assist you today?<|User|>tell me about you
<|Assistant|> <think> Okay, so I just saw this message where someone said they're not happy because they didn't pass their first exam in a while. They mentioned feeling defeated and scared. Let's see if I can help clarify or provide some reassurance.
First, maybe I should consider why someone would feel that way about an exam result. Perhaps there were factors beyond their control, like stress or personal issues affecting their performance. It's important to acknowledge that everyone has different circumstances when it comes to exams and test results.
I wonder if there are any specific reasons they mentioned. Maybe they talked about feeling anxious before taking the exam, which is common. Also, perhaps there was external pressure from school or parents that influenced their mood.
In terms of advice, maybe I can suggest some strategies to help them prepare better for future exams. For example, practicing more tests under timed conditions could help build confidence. Also, seeking support from teachers or mentors might be beneficial. Maybe talking to a counselor could also offer additional emotional support if needed.
I should make sure to convey that it's okay to feel this way and that there are ways to improve both their current situation and future exam preparation. It's important to encourage them to talk to someone who can help. </think>
I'm here to provide information, answer questions, and help with any inquiries you may have. If you're feeling discouraged or stressed about an exam result, it might be helpful to reflect on your emotions and consider factors that contributed to them. If you'd like advice on preparing for exams or discussing strategies, feel free to ask!
You:
That's more or less what people were doing back in 2023 - crawling everything and dumping as much data in as possible.
It's not a great strategy to build a best-in-class model though, as a lot of the internet is junk. The SolidGoldMagikarp/davidjl bug is the kind of thing that happens if you crawl all of https://www.reddit.com/r/counting/ for example: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...
These days model training labs are more selective about what they train on. Most of the game of training a great model comes down to selectively training your data. They still use a lot of unlicensed data but it's a bit more sophisticated than just dumping in everything they can find.
There was a recent paper [1] that showed that for one particular model it was possible to get very good reasoning performance by finetuning on under a thousand examples of reasoning.
[1] https://arxiv.org/abs/2502.03387
and models are using synthetic data sets now, for like 2 years
as in, the huge models did the crawling and then the next models had conversations with the huge models
oversimplified but basically the scraping ship has sailed, artists/writers will never be satisfied, and other companies bragging about using whatever aspirational goal post of licensed material are just doing it for brownie points
eg if I ask DeepSeek to write a story it tends to write the bullet points outline of a story. And some other models always give characters the same names.
for now, they're not doing that but just releasing new models that also have updated information
there is one group thats scraping the latest information and including it in their latest models, another group thats only doing synthetic data sets from the prior group's models, and every enthusiast everywhere making fine tuned LLMs with synthetic data sets and whatever niche they want
R1 didn’t train a base model, they performed additional steps on top of a previously-trained base model (V3). These guys are doing something similar.
This is not a Chinese model, btw I'm American