This is particularly interesting as there seems to be, for decades, a general consensus that the problem of text compression is the same as the problem of artificial intelligence, for example https://en.wikipedia.org/wiki/Hutter_Prize
bravura 19 days ago [-]
"It is well established that compression is essentially prediction, which effectively links compression and langauge models (Delétang et al., 2023). The source coding theory from Shannon’s information theory (Shannon, 1948) suggests that the number of bits required by an optimal entropy encoder to compress a message ... is equal to the NLL of the message given by a statistical model." (https://ar5iv.labs.arxiv.org/html//2402.00861)
I will say again that Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression", which evaluates LLMs on their ability to predict future text, is amazing work that the field is currently sleeping on.
larodi 19 days ago [-]
I’m not sure how this generalises to grammar based compression such as SEQUITUR for example is… incidentally LZW also is though not advertised as such.
Math seems very limited when it comes to reasoning about generative grammars and their unfolding into text. Should the apparatus been there we’d probably had grammar/prolog based AI long ago…
jll29 18 days ago [-]
Grammars are not AI, it's just another formalism (like regular expressions, Turing machines etc.) - formalism alone doesn't solve anything.
In formal language theory, you have different classes of grammars, the most general ones correspond to Turing machines, i.e. they are a glofified assembler and you can do anything. The most restricted (in the Chomsky hierarchy), "Type 3" grammars, are basically another notation for regular expressions, and they described regular grammars.
There are algorithms for learning grammars, but the issue with that is that the induced grammars may not resemble anything that a human may write (in the same way that a clustering algorithm often does not give you the clusters you want).
But to answer your question, we need to separate the discussion between appropriate representation and method to solve a problem.
I believe grammar-based compression - if you accept probabilistic grammars - is similar to LLM-based compression at some level in the sense that highly probable sequence of words get learned (whether by dictionary, grammar, neural network = LLM, could be just an implementation detail). Whichever you choose, you still need to solve the problem you are trying to solve (any grammar formalism still needs a parsing algorithm, and an actual grammar that does something useful - even after you develop a parser generator).
[Side rant, not responding specifically to the parent or OP: as a linguist, I'd also warn everybody to use "AI" with an article: *"an AI" (asterisk marks wrong use). It wrongly suggests human-like properties when it's actually just a matrix of numbers that encode a model. Here is a test whether you are using "AI" right: replace it by "Applied Statistics" in a sentence and see if you would still say it.]
AI is just an academic field (ill-named for historic reasons), subpart of computer science, and while it's fair to talk about useful representations for modeling human-like behaviors, we should focus on what intelligence is, and talk about the limits of concrete models and possibilities to extend them.
The thing about LLMs is they are a bit like the perfect snake oil salesman: extremely articulate, but knows very little nothing about a lot, understands nothing. (Whatever one criticises, they do the one thing that they are designed for very well: to generate text. Sadly that misleads a lot of people that they are just next-word/next-sentence predictors.)
larodi 16 days ago [-]
You are very brave to call or not call something AI, but it is precisely generative grammars (a stochastic ones) who were initially considered AI - as a linguist you should know this better than myself.
retrac 19 days ago [-]
There's a general consensus that entropy is deeply spooky. It pops up in physics in black holes and the heat death of the universe. The physicist Erwin Schrodinger suggested that life itself consumes negative entropy, and others have proposed other definitions of life that are entropic. Some definitions of intelligence also centre on entropy.
What to make of all that however, has anything but consensus.
vintermann 18 days ago [-]
To have entropy, you need to have a notion of information. To have information, you have to decide which differences matter, I.e. which states you classify as the same.
This isn't a problem for physics, or for computer science. But it is a problem for would-be philosophers (including a few physicists and computer scientists!) who thought information was a shortcut to avoid answering big questions about what matters, what we care about.
pstuart 19 days ago [-]
I liked the awe you shared -- it made me want to learn more about entropy.
Y_Y 19 days ago [-]
[flagged]
endofreach 19 days ago [-]
> but on the internet you don't have to say anything and if you do it may as well have some substance
Seems like we're using different internets. Which i am glad about. I just wish mine had less of the negativity that's coming over from yours. Guess in the end, the people on your internet realize, it's more fun over here.
You could have expressed all of that with less maliciousness towards the person. Thank god, in my internet everyone can say whatever they want f they want. Because– and more people should remember this apparently– if i don't like it, i just turn off the internet, like grandma!
Wish all the best to you and everyone you care about in real life. I might be just a bot. You might be. We'll never know for certain. Don't let some bits mess with your feels.
Y_Y 18 days ago [-]
I'm sorry for for leaking negativity into your internet. I don't think negativity is inherently undesirable, but I don't think it's useful to express it towards people's selves. I meant only to criticize the comment without further implication.
In fact I went and got some references I really liked because I was hoping to add what I felt was missing from the discussion on entropy. My motivation in the end was to share my personal feeling of awe, and in a way that was accessible to the parent poster as well as other readers. How do you like that internet?
jancsika 18 days ago [-]
> My motivation in the end was to share my personal feeling of awe, and in a way that was accessible to the parent poster as well as other readers.
Then write it that way:
1. Remove the first paragraph, where you treat the OP like a child by telling them where it is and isn't appropriate to express their idea
2. Remove the first two sentences of the 2nd paragraph
3. Remove the clause "but you can't get that from a quip."
Now we've got the beginnings of a delicious comment! You could even garnish it at the beginning with something like "Not sure if we're talking about the same thing, but..." But you don't even really need it.
That's the difference between playing in a sandbox with others, and unwittingly kicking someone out of one.
retrac 19 days ago [-]
I was trying to convey a subjective and emotional experience. Obviously I failed.
I hope that when you try to express awe it isn't dismissed as weasel words.
I give up. Delete my account please dang. This site isn't good for my mental health.
AdieuToLogic 19 days ago [-]
> I give up. Delete my account please dang. This site isn't good for my mental health.
While I cannot speak to your conclusion, I can humbly suggest to not put any credence in what some rando says on the Internet. Including myself. :-)
Far better is it to dare mighty things, to win glorious
triumphs, even though checkered by failure... than to rank
with those poor spirits who neither enjoy nor suffer much,
because they live in a gray twilight that knows not victory
nor defeat.[0]
I didn't like your comment, that's all. I'm just one anonymous asshole, I can't invalidate your sense of awe.
FWIW, I didn't want or expect to harm your mental health.
AdieuToLogic 19 days ago [-]
>> This is all weasel words, and you've misspelled "Schroedinger"/"Schrödinger". That sort of comment might be fine for the pub, but on the internet you don't have to say anything and if you do it may as well have some substance.
> ... I can't invalidate your sense of awe.
Actually, yes. Yes, you can.
And so could I, or anyone really, given sufficiently focused vitriol.
For example, your sentence fragment "This is all weasel words" is incorrect English. "This is" should use the plural form "These are" as the subject is "words" and not "weasel", as well as the modifier "all" emphasizing plurality.
The irony of your subsequently pointing out a spelling error and then chastising the OP for same has not been lost.
WalterBright 19 days ago [-]
At least 50% of posts that point out a spelling or grammatical error contain one as well.
AdieuToLogic 19 days ago [-]
> At least 50% of posts that point out a spelling or grammatical error contain one as well.
Quite true. While I do not generally claim to be a grammatical wizard, I do know when I hear from one (hello Zortech-C++, it's been too long!).
If you don't mind pointing out my mistake(s) above, I would appreciate it as my goal was to exemplify the social effect of pedantic critique. Being corrected when doing same could serve as an additional benefit.
WalterBright 19 days ago [-]
It's nice to hear from a ZTC++ user!
Y_Y 18 days ago [-]
What's the unconditional rate of errors in posts generally? Without the prior I don't know if whinginging about spelling or grammar makes my posts correcter or incorrecter.
Y_Y 18 days ago [-]
> "This [comment] is all weasel words."
The subject was "this", referring to the comment.
By what standard of English did you reckon my post incorrect? I appreciate your effort to cheer up your parent post, and to improve my language skills, of course.
(I'm not the language usage police, though I am fussy about correctly rendering people's names.)
I didn't understand your gainsaying about invalidating awe. Whether or not the poster's awe was a real and worthwhile feeling seems to me entirely independent of my opinions.
I find your aims admirable. However, I regret to say that for me the irony, and purpose of this comment thread, have indeed been lost.
AdieuToLogic 16 days ago [-]
>> "This [comment] is all weasel words."
> The subject was "this", referring to the comment.
While I understand your clarification being the intent, in the original context "this" is in its determiner form and not pronoun form. Would the addition of "comment" have been included, then I believe most (if not all) readers would understand its use as the pronoun form it is often used as well as being associated with the noun form of "comment."
More important than my pedantry was an attempt to illustrate how corrections in this medium can be interpreted quite differently based on the person. As you intimate, my example did not affect you adversely (which is great BTW). How the OP responded to your original reply indicated a different effect unfortunately. I am not judging, only providing my observation.
A quote I wish I knew much earlier in my life is:
A sharp tongue is the only edge tool that grows keener with
constant use.[0]
Thank you, especially for deft use of pedantry as a tool for good, and that quote which I'll retain.
jakeogh 16 days ago [-]
Your comment is excellent, inspiring and quite true.
Please stay, otherwise the rest of us are stuck with the alternative (which essentially someone saying "read this wikipedia and Schrödinger original talks", with a perplexing pile of unhappyness, pretending to correct things that you didnt get wrong)
Y_Y 13 days ago [-]
Inspiring and true
I could leave if you like
CamperBob2 19 days ago [-]
You wouldn't toss out your radio because it picks up a bit of static now and then, would you? That's all that posts like that one amount to... static.
WhitneyLand 19 days ago [-]
I’m not sure this is strictly true. It seems more accurate to say there are deep connections between the two rather than they are theoretically equivalent problems. His work is really cool though no doubt.
micimize 19 days ago [-]
In the sense I understand that comparison, or have usually seen it referred to, the compressed representation is the internal latent in a (V)AE. Still, I haven't seen many attempts at compression that would store the latent + a delta to form lossless compression, that an AI system could then maybe use natively at high performance. Or if I have... I have not understood them.
nialv7 18 days ago [-]
it is true, but i think it's only of philosophical interests. for example, in a sense our physical laws are just human's attempt at compressing our universe.
the text model used here probably isn't going to be "intelligent" the same way those chat-oriented LLMs are. you can probably still sample text from it, but you can actually do the same with gzip[1].
Also worth checking out some of the author's other compressors e.g. another one of their neural network solutions using a transformer https://bellard.org/nncp/ holds the top spot in the Large Text Compression Benchmark. It's ~3 orders of magnitude slower though.
remram 19 days ago [-]
If I read this correctly, the largest test reported on this page is the "enwik9" dataset, which compresses to 213 MB with xz and only 135 MB with this method, a 78 MB difference... using a model that is 340 MB (and was probably trained on the test data).
No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
Please let me know if I misunderstand.
zamadatix 19 days ago [-]
> using a model that is 340 MB
"The model is quantized to 8 bits per parameter and evaluated using BF16 floating point numbers" means the model is stored as 1 byte per parameter even though it's using a 2 byte type during compute. This is backed up by checking the size of from the download which comes out as 171,363,973 bytes for the model file.
> and was probably trained on the test data
This is likely a safe assumption (enwik8 is the default training set for RWKV and no mention of using other data was given) however:
> No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
The Ts_zip+enwik9 size comes out to less than the 197,368,568 for xz+enwik9 listed in the Large Text Compression Benchmark
despite the large model file. Getting 20,929,618 total bytes smaller while keeping a good runtime speed is not bad and puts it decently high in the list (even when sorted by total size) despite the difference in approach. Keep in mind the top entry at 107,261,318 total bytes in the table is nncp by the same author (neural net but not LLM based) so it makes sense to keep an open mind as to why they thought this would be worth publishing.
remram 19 days ago [-]
I wouldn't be surprised if my math was wrong but I can't quite follow yours. ts_zip(171 MB you say)+llm-enwik9(135MB) = 306MB is still larger than xz(0.3MB)+xz-enwik9(213MB) = 213MB.
zamadatix 19 days ago [-]
I done did went and copied the enwik8 value for ts_zip when doing that compare, good catch!
I guess that leaves the question of "how well does the LLM's predictions work for things we're certain weren't in the test data set". If it's truly just the prebuilt RWKV then it is only trained on enwik8 and enwik9 is already a generalization but there's nothing really guaranteeing that assumption. On the other hand... I can't think of GB class open datasets of plain english to test with that aren't already in use on the page.
pmayrgundter 19 days ago [-]
Not following. That top entry is marked as Transformer, which does mean it's an LLM
zamadatix 19 days ago [-]
Of the two nncp uses transformers but isn't an LLM while ts_zip doesn't use transformers but is an LLM. Remember LLM just means large language model, it doesn't make any assumptions about how it's built. Similarly transformers just relate tokens according to attention, they don't make any assumptions those tokens must represent natural language.
I.e. anything you can tokenize can be wrangled using a transformer, not just language. Thankfully the same author also has a handy example of this: transformer based audio compression https://bellard.org/tsac/
pmayrgundter 18 days ago [-]
Fair nuff. Thanks!
binary132 19 days ago [-]
If you’re compressing 100 or 100k such datasets, presuming that it is not custom tuned for this corpus, then wouldn’t you still save much more than you spend?
remram 19 days ago [-]
I'm not saying the result is completely useless, I am comparing it to the age-old technique of using a dictionary. Does this new LLM-powered technique improve upon the old dictionary technique?
Dictionaries also don't require a GPU or this amount of RAM.
Where I assume LLMs would shine is lossy compression.
binary132 19 days ago [-]
Ah ok, I think we made different assumptions about whether the model was specific to the particular dataset so each one would need a new model — a dictionary is specific to the particular dataset being compressed, right? I was thinking the LLM would be a general-purpose text compression model.
remram 19 days ago [-]
Not particularly. You could make a dictionary from "the English web", with common character sequences found on those sites you use as input.
ksec 19 days ago [-]
I have the same question, what is the different between LLM and Dictionary in the context of compression. Can I not "train" a dictionary?
binary132 19 days ago [-]
AIUI, a dictionary is built during compression to specify the heuristics of a particular dataset and belongs to that specific dataset only. For example, it could be a ranking of the most frequent 10 symbols in the compressed file. That will be different for every input file.
mbreese 19 days ago [-]
> That will be different for every input file
That could be different for every input file, but it doesn't have to be. It could also be a fixed dictionary. For example, ZLIB allows for a user-defined dictionary [1].
In this case, I'd consider the LLM to be a fixed dictionary of sorts. A very large, fixed dictionary with probabilistic return values.
Ah, I see. I’d never thought of the possibility of using a dictionary not created specifically from the given input dataset, heh
mbreese 18 days ago [-]
Admittedly, I don’t think it is common, but I think there was a project a few years ago (Google?) that tried to compress HTML using at least a partially fixed dictionary.
Nowadays though, it’s apparently still something that’s being tried. Chrome now supports shared dictionaries for Zstd and Brotli. One idea being, you would likely benefit from having a shared dictionary used to decompress multiple artifacts for a site. But, you many not want everything compressed all together, so this way you get the compression benefit, but can have those artifacts split into different files.
Notably, solutions specialized for enwik9 (specifically fx2-cmix) take up only 110 MB, including the size of the decompressor.
justmarc 19 days ago [-]
This man is an absolute wizard, and a legend who hasn't stopped since the fantastic LZEXE days.
bhouston 19 days ago [-]
I believe almost all LLMs are trained using wikpedia these days. So compressing wikipedia well without including the size of the LLM in the compression result is a bit of a cheat. I guess one would argue it is a universal dataset representing understanding the English language and real-world relationships at this point but it is still a bit of a cheat.
atiedebee 19 days ago [-]
There's a reason compression benchmarks often times include the size of the executable when benchmarking compression ratios. Although Matt Mahoney's large text compression benchmark[0] does currently have a transformer model at number 1.
Looks like it’s been updated since then; commenters in that thread are saying the decompressor needs to run on the same hardware as the compressor; now the link says:
> “The model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads. This key point ensures that a compressed file can be decompressed using a different hardware or software configuration.”
0-_-0 19 days ago [-]
1 MBps is insanely fast for a method like this, it must be in the 100k tokens per second range. Probably with large batches.
droideqa 19 days ago [-]
I have always thought compression to be an analog to intelligence. The smarter you are, the better at summarization you are.
Twirrim 19 days ago [-]
"(and hopefully decompress)" is a horrifying descriptor.
hansvm 19 days ago [-]
It adds levity to the article and also introduces the reader to the sorts of things that can go wrong if they try it at home.
The last paragraph highlights how they fixed one of the main pitfalls I normally see in this sort of thing, where floating-point operations are mangled in myriad ways in the name of efficiency (almost always correct for physics or whatever, but a single bit being incorrect will occasionally mangle this compression scheme).
Mind you, actually doing what they claimed in that last paragraph is usually painful. The easiest approaches re-implement floating-point operations in software using integer instructions, and the complexity increases from there.
orbital-decay 18 days ago [-]
Not just efficiency, if you have e.g. floating point values arriving asynchronously to be accumulated, you'll always have a slightly unpredictable result.
Fun fact: Gemini 2.0 Flash is 100% deterministic with temp 0, unlike most models. This must be related to TPUs somehow, not sure why all previous Gemini versions are not like that, though.
perching_aix 19 days ago [-]
They're clearly just poking fun at it.
mikevin 19 days ago [-]
I'm curious what the compressed text looks like. Anyone have an example?
Lerc 19 days ago [-]
If it is within cooee of state of the art the compressed text should look like a pile of random bits.
If it looks like anything at all other than randomness then you can describe whatever it is that it looks like to get more compression.
munch117 19 days ago [-]
Binary goo, barely distinguishable from random data, if at all. The arithmetic coder will make sure of that.
It's the nature of compression: Any discernible pattern could have been exploited for further compression.
19 days ago [-]
j_juggernaut 19 days ago [-]
Made a quick and dirt streamlit app to play around encrypt decrypt
I will say again that Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression", which evaluates LLMs on their ability to predict future text, is amazing work that the field is currently sleeping on.
Devising the minimal grammar that generates the text is NP-hard (https://en.m.wikipedia.org/wiki/Smallest_grammar_problem)
Math seems very limited when it comes to reasoning about generative grammars and their unfolding into text. Should the apparatus been there we’d probably had grammar/prolog based AI long ago…
In formal language theory, you have different classes of grammars, the most general ones correspond to Turing machines, i.e. they are a glofified assembler and you can do anything. The most restricted (in the Chomsky hierarchy), "Type 3" grammars, are basically another notation for regular expressions, and they described regular grammars.
There are algorithms for learning grammars, but the issue with that is that the induced grammars may not resemble anything that a human may write (in the same way that a clustering algorithm often does not give you the clusters you want).
But to answer your question, we need to separate the discussion between appropriate representation and method to solve a problem. I believe grammar-based compression - if you accept probabilistic grammars - is similar to LLM-based compression at some level in the sense that highly probable sequence of words get learned (whether by dictionary, grammar, neural network = LLM, could be just an implementation detail). Whichever you choose, you still need to solve the problem you are trying to solve (any grammar formalism still needs a parsing algorithm, and an actual grammar that does something useful - even after you develop a parser generator).
[Side rant, not responding specifically to the parent or OP: as a linguist, I'd also warn everybody to use "AI" with an article: *"an AI" (asterisk marks wrong use). It wrongly suggests human-like properties when it's actually just a matrix of numbers that encode a model. Here is a test whether you are using "AI" right: replace it by "Applied Statistics" in a sentence and see if you would still say it.]
AI is just an academic field (ill-named for historic reasons), subpart of computer science, and while it's fair to talk about useful representations for modeling human-like behaviors, we should focus on what intelligence is, and talk about the limits of concrete models and possibilities to extend them.
The thing about LLMs is they are a bit like the perfect snake oil salesman: extremely articulate, but knows very little nothing about a lot, understands nothing. (Whatever one criticises, they do the one thing that they are designed for very well: to generate text. Sadly that misleads a lot of people that they are just next-word/next-sentence predictors.)
What to make of all that however, has anything but consensus.
This isn't a problem for physics, or for computer science. But it is a problem for would-be philosophers (including a few physicists and computer scientists!) who thought information was a shortcut to avoid answering big questions about what matters, what we care about.
Seems like we're using different internets. Which i am glad about. I just wish mine had less of the negativity that's coming over from yours. Guess in the end, the people on your internet realize, it's more fun over here.
You could have expressed all of that with less maliciousness towards the person. Thank god, in my internet everyone can say whatever they want f they want. Because– and more people should remember this apparently– if i don't like it, i just turn off the internet, like grandma!
Wish all the best to you and everyone you care about in real life. I might be just a bot. You might be. We'll never know for certain. Don't let some bits mess with your feels.
In fact I went and got some references I really liked because I was hoping to add what I felt was missing from the discussion on entropy. My motivation in the end was to share my personal feeling of awe, and in a way that was accessible to the parent poster as well as other readers. How do you like that internet?
Then write it that way:
1. Remove the first paragraph, where you treat the OP like a child by telling them where it is and isn't appropriate to express their idea
2. Remove the first two sentences of the 2nd paragraph
3. Remove the clause "but you can't get that from a quip."
Now we've got the beginnings of a delicious comment! You could even garnish it at the beginning with something like "Not sure if we're talking about the same thing, but..." But you don't even really need it.
That's the difference between playing in a sandbox with others, and unwittingly kicking someone out of one.
I hope that when you try to express awe it isn't dismissed as weasel words.
I give up. Delete my account please dang. This site isn't good for my mental health.
While I cannot speak to your conclusion, I can humbly suggest to not put any credence in what some rando says on the Internet. Including myself. :-)
0 - https://www.brainyquote.com/quotes/theodore_roosevelt_103499FWIW, I didn't want or expect to harm your mental health.
> ... I can't invalidate your sense of awe.
Actually, yes. Yes, you can.
And so could I, or anyone really, given sufficiently focused vitriol.
For example, your sentence fragment "This is all weasel words" is incorrect English. "This is" should use the plural form "These are" as the subject is "words" and not "weasel", as well as the modifier "all" emphasizing plurality.
The irony of your subsequently pointing out a spelling error and then chastising the OP for same has not been lost.
Quite true. While I do not generally claim to be a grammatical wizard, I do know when I hear from one (hello Zortech-C++, it's been too long!).
If you don't mind pointing out my mistake(s) above, I would appreciate it as my goal was to exemplify the social effect of pedantic critique. Being corrected when doing same could serve as an additional benefit.
The subject was "this", referring to the comment.
By what standard of English did you reckon my post incorrect? I appreciate your effort to cheer up your parent post, and to improve my language skills, of course.
(I'm not the language usage police, though I am fussy about correctly rendering people's names.)
I didn't understand your gainsaying about invalidating awe. Whether or not the poster's awe was a real and worthwhile feeling seems to me entirely independent of my opinions.
I find your aims admirable. However, I regret to say that for me the irony, and purpose of this comment thread, have indeed been lost.
> The subject was "this", referring to the comment.
While I understand your clarification being the intent, in the original context "this" is in its determiner form and not pronoun form. Would the addition of "comment" have been included, then I believe most (if not all) readers would understand its use as the pronoun form it is often used as well as being associated with the noun form of "comment."
More important than my pedantry was an attempt to illustrate how corrections in this medium can be interpreted quite differently based on the person. As you intimate, my example did not affect you adversely (which is great BTW). How the OP responded to your original reply indicated a different effect unfortunately. I am not judging, only providing my observation.
A quote I wish I knew much earlier in my life is:
HTH0 - https://www.brainyquote.com/quotes/washington_irving_384249
Please stay, otherwise the rest of us are stuck with the alternative (which essentially someone saying "read this wikipedia and Schrödinger original talks", with a perplexing pile of unhappyness, pretending to correct things that you didnt get wrong)
I could leave if you like
the text model used here probably isn't going to be "intelligent" the same way those chat-oriented LLMs are. you can probably still sample text from it, but you can actually do the same with gzip[1].
[1]: https://github.com/Futrell/ziplm
No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
Please let me know if I misunderstand.
"The model is quantized to 8 bits per parameter and evaluated using BF16 floating point numbers" means the model is stored as 1 byte per parameter even though it's using a 2 byte type during compute. This is backed up by checking the size of from the download which comes out as 171,363,973 bytes for the model file.
> and was probably trained on the test data
This is likely a safe assumption (enwik8 is the default training set for RWKV and no mention of using other data was given) however:
> No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
The Ts_zip+enwik9 size comes out to less than the 197,368,568 for xz+enwik9 listed in the Large Text Compression Benchmark despite the large model file. Getting 20,929,618 total bytes smaller while keeping a good runtime speed is not bad and puts it decently high in the list (even when sorted by total size) despite the difference in approach. Keep in mind the top entry at 107,261,318 total bytes in the table is nncp by the same author (neural net but not LLM based) so it makes sense to keep an open mind as to why they thought this would be worth publishing.
I guess that leaves the question of "how well does the LLM's predictions work for things we're certain weren't in the test data set". If it's truly just the prebuilt RWKV then it is only trained on enwik8 and enwik9 is already a generalization but there's nothing really guaranteeing that assumption. On the other hand... I can't think of GB class open datasets of plain english to test with that aren't already in use on the page.
I.e. anything you can tokenize can be wrangled using a transformer, not just language. Thankfully the same author also has a handy example of this: transformer based audio compression https://bellard.org/tsac/
Dictionaries also don't require a GPU or this amount of RAM.
Where I assume LLMs would shine is lossy compression.
That could be different for every input file, but it doesn't have to be. It could also be a fixed dictionary. For example, ZLIB allows for a user-defined dictionary [1].
In this case, I'd consider the LLM to be a fixed dictionary of sorts. A very large, fixed dictionary with probabilistic return values.
[1] https://www.rfc-editor.org/rfc/rfc1950#page-9
Nowadays though, it’s apparently still something that’s being tried. Chrome now supports shared dictionaries for Zstd and Brotli. One idea being, you would likely benefit from having a shared dictionary used to decompress multiple artifacts for a site. But, you many not want everything compressed all together, so this way you get the compression benefit, but can have those artifacts split into different files.
https://developer.chrome.com/blog/shared-dictionary-compress...
[0] http://www.mattmahoney.net/dc/text.html
Demo and code? Available at bellard.org as well.
Has anyone done the work of comparing this to other similar extreme audio compression solutions?
> “The model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads. This key point ensures that a compressed file can be decompressed using a different hardware or software configuration.”
The last paragraph highlights how they fixed one of the main pitfalls I normally see in this sort of thing, where floating-point operations are mangled in myriad ways in the name of efficiency (almost always correct for physics or whatever, but a single bit being incorrect will occasionally mangle this compression scheme).
Mind you, actually doing what they claimed in that last paragraph is usually painful. The easiest approaches re-implement floating-point operations in software using integer instructions, and the complexity increases from there.
Fun fact: Gemini 2.0 Flash is 100% deterministic with temp 0, unlike most models. This must be related to TPUs somehow, not sure why all previous Gemini versions are not like that, though.
If it looks like anything at all other than randomness then you can describe whatever it is that it looks like to get more compression.
It's the nature of compression: Any discernible pattern could have been exploited for further compression.
https://llmencryptdecrypt-euyfofcjh8bf2utuha2zox.streamlit.a...