The way this works is awesome. If I understand correctly, it's like that, given (part of) a sentence, the next token really in the sequence will be one predicted by the model among the top scoring ones, so most next tokens can be mapped to very low numbers (0 if the actual next token it's the best token in the LLM prediction, 1 if it is the second best, ...). This small numbers can be encoded very efficiently using trivial old techniques. And boom: done.
So for instance:
> In my pasta I put a lot of [cheese]
LLM top N tokens for "In my pasta I put a lot of" will be [0:tomato, 1:cheese, 2:oil]
The real next token is "cheese" so I'll store "1".
Well, this is neat, but also very computationally expensive :D So for my small ESP32 LoRa devices I used this: https://github.com/antirez/smaz2
And so forth.
gliptic 19 days ago [-]
I'm pretty sure it doesn't use ranking. That leaves a lot of performance on the table. Instead you would use the actual predicted token probabilities and arithmetic coding.
antirez 19 days ago [-]
I supposed it used arithmetic coding with the ranking bacause they have a distribution easy to exploit: zero more likely, one a bit less and so forth. What's your guess? Unfortunately Bellard is as smart as hermetic. We are here guessing what should be a README file.
gliptic 19 days ago [-]
The model gives you a probability distribution over the tokens. You could use that directly with arithmetic coding, but there are ways to convert that to a distribution over e.g. the next byte instead which would improve efficiency further by removing the redundancy in alternative token encodings. ts_zip does this, and README says this works similar to ts_zip.
EDIT: Hm, or maybe ts_zip uses just the token probabilities directly. I thought it was slightly more efficient about it.
"The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities."
antirez 19 days ago [-]
Oh, that makes sense! So they use the probability of the next token itself. Thanks for clarifying. Also clever trick about the multiple potential tokens to represent the same text.
bravura 19 days ago [-]
antirez, it's probably identical to the approach in this paper:
Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression" (https://ar5iv.labs.arxiv.org/html//2402.00861).
There's a pretty straight line from assigning probabilities (to a sequence of tokens) to arithmetic compression as an optimal compression algorithm for that distribution.
gus_massa 19 days ago [-]
If you are going to zip the resulting file, it may be useful to have a lot of 0s.
If you are going to send the result as is,
Huffman coding (with some escape for unusal words(?)) will be better. I think even better than the other method that forgets the probabilities and then tries to compresd it.
antirez 19 days ago [-]
Just to clarify: even storing ranking, here would likely produce good results, but not as good as storing the probability, since it exploits better the ability of arithmetic coding to store this fractional intervals. But here the fundamental trick is that the LLM can compress the "next in sequence" information in a distribution that is much better to compress than the initial data itself.
gliptic 19 days ago [-]
This is especially true for instance when you have two or more tokens that are about equally likely, or one token that is virtually certain, which ranking would obscure.
antirez 19 days ago [-]
Indeed.
astrange 18 days ago [-]
Arithmetic coding is better than Huffman coding because it can use a fractional number of bits per code, while Huffman has to use a whole number of bits.
IIRC the only reason it wasn't always used for everything is patents. A secondary reason being that it can be slow if you design it without thinking about performance, eg if you use divisions.
gliptic 18 days ago [-]
Huffman still has a performance edge for static distributions. ANS bridges some of the performance gap between arithmetic coding and huffman.
astrange 16 days ago [-]
But there's no such thing as a static distribution :)
amelius 19 days ago [-]
This is very similar to how many compression schemes work. Look up Huffman coding to begin with.
For anyone interested in this topic, Primeagen has a pretty great video on how he used several encoding schemes to save bandwidth in one of his projects.
Seems like an ideal compression method for LoRa/Meshtastic-style communication. An LLM wouldn't run on an ESP32, but there are several that could run on a raspberry pi.
It's not just natural language that could be compressed this way, either. Code (HTML, JS, etc) could be compressed with the same technique/models. I bet that the same general idea could work for image compression as well, using an image/diffusion model (or perhaps a multimodal model for everything.)
This could lead to an entire internet of content using just a few bits.
soulofmischief 19 days ago [-]
The key insight is that the larger the shared context between parties, the more efficient communication can be, as communication tends towards a purely relational construct. The limit of this is two parties that both share the exact same context and inputs, the inputs should produce the same hidden state within both parties and communication is not even necessary because both parties have the same knowledge and state.
That's not new to anyone familiar with compression or information theory, but the novelty here is the LLM itself. It's absolutely plausible that, given an already highly compressed relationally-encoded context like a trained LLM, very few bits could be communicated to communicate very abstract and complex ideas, letting the LLM recontextualize information which has been compressed across several semantic and contextual layers, effectively leveraging a complete (but lossy) history of human knowledge against every single bit of information communicated.
nullc 19 days ago [-]
LORA is also pretty slow, like the 'long fast' mode that most meshtastic users use is about a kilobit per second... and presumably a small percentage of the traffic at any time is traffic in channels that you're monitoring.
Probably decoding few tokens per second is fast enough to deliver more goodput than the existing uncompressed usage.
Retr0id 19 days ago [-]
It'd be a fun experiment to try making it lossy.
You could adjust tokens towards what's more statistically probable, and therefore more compressible (in your example, it'd be picking tomato instead of cheese)
lxgr 19 days ago [-]
I could see that as a plot point in a science fiction story: Intergalactic telegrams are prohibitively expensive, so before sending one you're offered various variants of your text that amount to the same thing but save data due to using more generic (per zeitgeist) language :)
Compare also with commercial code [1], a close historical analog, albeit with handcrafted, as opposed to ML-derived, compression tables. (There was a single code point for "twins, born alive and well, one boy and one girl", for example! [2])
Comes up as a minor plot point in Vernor Vinge's The Blabber (1988):
> “And from your standpoint, Hamid, there’s one big drawback. The mean bandwidth of this thing [an ansible, more or less] is just under six bits per minute.”
> “Huh? Ten seconds to send a single bit?”
> “Yup. Skandr left three protocols at the Lothlrimarre end: ASCII, a Hamming map to a subset of English, and an AI scheme that guesses what you’d say if you used more bits. The first is Skandr’s idea of a joke, and I wouldn’t trust the third more than wishful thinking.”
(Good advice at the end there.)
dekhn 19 days ago [-]
A bit of an aside- in one of the sequels to A Fire Upon the Deep, somebody has to interpret some very lossy audio and video into the most likely explanation, but they are stuck with a stupider than usual AI and it misinterprets the results (it's implied if they had their full AI it would have gotten the interpretation correct even with the ambiguity). This episode in the book completely changed how I think about imputation under uncertainty. Often, I don't want a single high confidence prediction, I want a probability distribution of the most likely predictions, rank ordered.
wat10000 19 days ago [-]
Fire has evocations, which are videos compressed down to something like just a description, then rendered at the receiving end in a way that hopefully has some resemblance to the original.
One viewer stumbles onto a key insight about the struggle taking place, but they only have evocations so they’re not sure. And they sound like a total kook so everyone ignores them.
dekhn 19 days ago [-]
I can't find an exact reference and I don't want to spoil too much, I think this was in Children of the Sky, at some point one of the wolf creatures is imprisoned and somebody uses the ship's reduced AI to spy on them.
abecedarius 19 days ago [-]
Slight correction: the person with the spy system didn't believe its reports in the end -- the doubt/dismissal problem came before sharing with others, as I remember it. Agreed this was in Children of the Sky.
(aFutD also had a case of high-stakes suspicion of highly compressed messages during Zone turbulence, as I think the GP meant.)
askvictor 19 days ago [-]
I shudder to think how current LLMs would go with this. I guess we can currently do this easily for still images and audio. Kind of reminds me of Translation Party.
duskwuff 19 days ago [-]
"Hexapodia is the key insight."
taylorius 18 days ago [-]
I see you Twirlip of the mists
antirez 19 days ago [-]
Yep. For lossy what could work even better is an encoder-decoder model, so that it is possible to just save the embedding, and later the embedding will be turned back into the meaning.
srush 19 days ago [-]
I've tried to build sort of model several times, but could never get it to work. The challenge is that small perturbations in encoder space lead to removing semantically important details (e.g. dates). You really want these to mess up syntax instead to get something more analogous to a lossy video encoder.
nullc 19 days ago [-]
I built a lossy text compressor in the days before LLMs.
I used a word embedding to convert the text to a space where similar tokens had similar semantic meaning, then I modified an ordinary LZ encoder to choose cheaper tokens if they were 'close enough' according to some tunable loss parameter.
It "worked", but was better at producing amusing outputs than any other purpose. Perhaps you wouldn't have considered that working!
In terms of a modern implementation using an LLM, I would think that I could improve the retention of details like that by adapting the loss parameter based on the flatness of the model. E.g. for a date the model may be confident that the figures are numbers but pretty uniform among the numbers. Though I bet those details you want to preserve have a lot of the document's actual entropy.
antirez 19 days ago [-]
Yep, makes sense... Something like 20 years ago I experimented with encoder/decoder models for lossy images compression and it worked very well, but it's a completely different domain indeed, where there aren't single local concentration of entropy that messes with the whole result.
mikaraento 18 days ago [-]
I guess text in images would be similar, and is indeed where image generation models struggle to get the details right.
E.g., making a greeting card with somebody's name spelled correctly.
__MatrixMan__ 18 days ago [-]
Seems like an opportunity to do some steganography. If the model isn't known by the attacker (or perhaps they can be "salted" to become unknown) then the actual message can be encoded in the offsets.
This would be much nicer than text-in-image steganography because services often alter images before displaying them, but they rarely do that to text (assuming the usual charset and no consecutive whitespace).
The idea seems similar enough that I wanted to share. The same way you can hide information in the text to prove it was generated by a specific model and version, of course you can use this for secrets as well.
Groxx 18 days ago [-]
tbh I'm not sure this would qualify as steganography - the message doesn't exist at all in the encoded form. It's not hidden, it's completely gone, the information is now split into two pieces.
So it's cryptography. With a shared dictionary. Basically just ECB, though with an unbelievably large and complicated code book.
dTal 17 days ago [-]
Realistically, 1) the model will be one of a small number of available models which can be tested against, 2) LLMs converge on similar predictions, so even an "unknown" model can hardly be considered cryptographic. The advantage of using an LLM this way is the ability to hide an information stream in an innocuous looking text, in the form of subtle deviations from the most likely next word. Hence, steganography.
Incidentally, this technique is actually an old one that has already seen use in magic shows and confidence tricks; two people who wish to communicate covertly, such as a magician and an "audience member", can exchange small amounts of information by varying the wording of simple responses: "yes", "that's right", "quite so". This can be used to, for instance, discreetly reveal the value of a card that the magician is "guessing" through "ESP".
__MatrixMan__ 18 days ago [-]
I may be abusing some definition or another, but I'd say that if the primary design goal is that your cyphertext can masquerade as cleartext, "steganography" scratches the itch pretty well, if not precisely.
Groxx 18 days ago [-]
48298346,1,3,2,3,1,2,3 doesn't really masquerade as cleartext.
you could hide that as text in other text, and that'd be steganography.
__MatrixMan__ 17 days ago [-]
Sorry I wasn't very complete with my description. I mean that 0,0,0,0... would correspond with the "most probable" continuation of some prompt and it would map to sensical english. And then 48298346,1,3,2... would correspond with a less probable continuation of the prompt, but it would also map to sensical english. But where more vs less probable, and the associated probabilities, are only knowable by someone with access to the secret LLM.
So you'd feed the algorithm some starter text like: "Here's my favorite recipe for brownies", and then you'd give it some data to encode, and depending on which data you gave it, you'd get a different, but "plausible", recipe for brownies. The recipient could reverse the recpie back into numbers, and from that they'd decode the hidden message.
The trick would be balancing the LLM's attempt to make sense against whatever additional constraints came along with your data encoding scheme. If you tried to encode too much cyphertext into a too-short brownies recipe, the recipe would fail to be convincingly a recipe. Conveniently, it's conventional to prefix recipes with a tremendous amount of text that nobody reads, so you've got a lot of entropy to play in.
Groxx 15 days ago [-]
oooh, yeah, I was thinking about it backwards. sorry about that! yeah I'd agree that's steganography.
I would definitely expect something like this to happen at some point. as long as people use LLMs with a non-zero temperature, you'd expect variation from anyone's output, so it'd be super hard to detect / super deniable with just the output.
19 days ago [-]
userbinator 19 days ago [-]
very computationally expensive
The same goes for all the other higher-order probability models, which are used in what is currently the best known compression algorithm:
LLMs are just another way to do the probability modeling.
throawayonthe 19 days ago [-]
[dead]
userbinator 19 days ago [-]
The download is 153MB, compressed... didn't even bother to wait for it to finish once I saw the size.
The brotli comparison is IMHO slightly misleading. Yes, it "embeds a dictionary to optimize the compression of small messages", but that dictionary is a few orders of magnitude smaller than the embedded "dictionary" which is the LLM in ts_sms.
There's a reason the Hutter Prize (and the demoscene) counts the whole data necessary to reproduce its output. In other words, ts_sms took around 18 bytes + ~152MB while brotli took around 70 bytes + ~128KB (approximately size of its dictionary and decompressor.)
theamk 19 days ago [-]
Life is more than that competitions?
For example, antirez mentioned LoRa in the earlier thread - that's a cheap, license-free radio, which achieves a large range at the expense of low rate (250 bit/sec). That's 30 bytes/second, not including framing overhead and retransmission.
If you wanted to build a communication system out of those, this compression method would be great. You'd have LORA device that connects to a regular cell phone and provides connectivity, and all the compression/decompression and UI happens on the cell phone. 150MB is nothing for modern phones, but you'd see a real improvement in message speed.
f33d5173 19 days ago [-]
You can compress arbitrarily many messages with it, and the dictionary remains 153MB. Why it's worth pointing out that brotli already uses a dictionary is that otherwise it would be generating the dictionary as it compressed, meaning that short messages would be pessimized. So brotli is in some sense the state of the art for short messages.
tshaddox 19 days ago [-]
This is obviously relevant to the Hutter Prize, which is intended to incentivize AI research by awarding cash to people who can losslessly compress a large English text corpus:
From a cursory web search it doesn't appear that LLMs have been useful for this particular challenge, presumably because the challenge imposes rather strict size, CPU, and memory constraints.
willvarfar 19 days ago [-]
This compressor is by a certain Fabrice Bellard, an overactive overachiving powerhouse of a programmer who happens to be leading the Large Text Compression Benchmark https://www.mattmahoney.net/dc/text.html, which is maintained by Matt Mahoney who happens to run the Hutter Prize :)
Fabrice also makes some programs you might use, like FFMEG and QEMU
Fabrice hasn't worked on ffmpeg for a very long time now, the main developer is Michael Niedermayer. (Or was last time I checked.)
TacticalCoder 19 days ago [-]
> FFMEG
For those unaware, it's a typo. willvarfar meant FFMPEG.
berbec 19 days ago [-]
> Fabrice also makes some programs you might use, like FFMEG and QEMU
This would be the one sentence that wouldn't cause me to look down on somoene, if used as a third-person humble-brag.
evertedsphere 19 days ago [-]
it's simpler: the hutter prize imposes a 110M constraint on the sum of the sizes of your program (including any data it needs to run) and the compressed data
llms are generally large
gliptic 19 days ago [-]
This could be circumvented by _training_ the LLM on the fly on the previously observed file data. This is what Bellard's other NN compressor, nncp, does [1], which is currently #1 on Mahoney's benchmark [2]. Unfortunately this is too slow, especially running on the CPU as Hutter's challenge stipulates IIRC.
In fact, pretty much every adaptive compression algorithm does. The eventual compression ratio would thus be determined by the algorithm (nncp, cmix, ...; also includes smaller tweaks like those typically made by the Hutter Prize winners) and its hyperparameters.
gliptic 18 days ago [-]
Yes, the only exception is dictionaries used in preprocessing, but I think that's mostly a tradeoff to reduce the runtime.
19 days ago [-]
vlovich123 19 days ago [-]
More because lossy compression is what's been analogized to intelligence and this prize is doing a sleight of hand to insert lossless compression as if that doesn't make a difference. That's more why LLMs aren't really all that useful.
tshaddox 19 days ago [-]
I wouldn't call that a sleight of hand. Surely better lossy compression can be trivially used to implement better lossless compression, and the latter is just much easier to quantify for a benchmark.
vlovich123 19 days ago [-]
Not a single lossless compression technique I’m aware of starts of in lossy compression.
They have different goals and utilize completely different techniques.
At most lossy techniques leverage lossless techniques (eg to compress non-perceptual binary headers) not the other way round.
leijurv 19 days ago [-]
Here's the submission that won the Hutter Prize in 2021: https://github.com/amargaritov/starlit It uses a LSTM to predict the next token lossily, then uses https://en.wikipedia.org/wiki/Arithmetic_coding to convert that to lossless compression. Lossless compression can definitely leverage a lossy compressor, such as via arithmetic coding. Also see: https://en.wikipedia.org/wiki/Context-adaptive_binary_arithm... which has a simple "Example" section - imagine if the top prediction made by your neural network was correct, you emit "0", if the 2nd was correct, you emit "10", if the 3rd, "110", if the 4th, "1110". As you can see, this is lossless, but the fundamental prediction is lossy, and the better that prediction is, the better the compression. (In actuality, you wouldn't waste your 1 bits like this, you'd use arithmetic coding instead).
willvarfar 19 days ago [-]
Yes, one way to think about arithmetic compression is encoding the difference between the prediction and reality.
This isn't normally what people mean by lossy compression, though. In lossy compression (e.g. mainstream media compression like JPEG) you work out what the user doesn't value and throw it away.
vlovich123 18 days ago [-]
That’s a stretch to call it lossy. To my eye the purpose of the LSTM seems indistinguishable from a traditional compression dictionary.
And that still doesn’t show how lossless compression is tied to intelligence. The example I always like to give is, “What’s more intelligent? Reciting the US war of independence Wikipedia page verbatim every time or being able to synthesize a useful summary in your own words and provide relevant contextual information such as it’s role in the French Revolution?”
tshaddox 18 days ago [-]
These lossless compression algorithms don’t just recall a fixed string of text. Obviously that can be accomplished trivially by simply storing the text directly.
These lossless compression algorithms compress a large corpus of English text from an encyclopedia. The idea is that you can compress this text more if you know more about English grammar, the subject matter of the text, logic, etc.
I think you’re distracted by the lossless part. The only difference here between lossy and lossless compression is that the lossy algorithm also needs to generate the diff between its initial output and the real target text. Clearly a lossy algorithm with lower error needs to waste fewer bits representing that error.
vlovich123 17 days ago [-]
There’s no immediately obvious reason that you should be able to come up with a diff correction to apply to recreate loseless that is more efficient than traditional compressors. In essence you’re still stuck with the idea that you have a compression dictionary and trying to build a more efficient dictionary. It’s not clear there’s a link between that and intelligence.
canjobear 19 days ago [-]
This is standard lossless compression. None of the concepts particular to lossy compression (like rate-distortion theory) are used.
kianN 19 days ago [-]
For those wondering how it works:
> The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities. [1]
It’s also mentioned that the model is configured to be deterministic, which is how I would guess the decompression is able to map a set of token likelihoods to the original token?
isn’t a LLM itself basically a compression of the texts from the internet? you can download the model and decompress the (larger) content with compute power (lossy)
kianN 19 days ago [-]
Yeah that’s exactly how I think of llms in my head: lossy compression that interpolates in order to fill in gaps. Hallucination is simply interpolation error. Which is guaranteed in lossy compression.
giovannibonetti 19 days ago [-]
Regarding lossless text compression, does anyone know how a simple way to compress repetitive JSON(B) data in a regular Postgres table? Ideally I would use columnar compression [1], but I'm limited to the extensions supported by Google Cloud SQL [2].
Since my JSON(B) data is fairly repetitive, my bet would be to store some sort of JSON schema in a parent table. I'm storing the response body from a API call to a third-party API, so normalizing it by hand is probably out of the question.
I wonder if Avro can be helpful for storing the JSON schema. Even if I had to create custom PL/SQL functions for my top 10 JSON schemas it would be ok, since the data is growing very quickly and I imagine it could be compressed at least 10x compared to regular JSON or JSONB columns.
I ended up with a similar problem. We replaced the data with a simple binary serialization format, gzip’ed that, and then base64 encoded the gzipped data. It’s far from perfect but it was 250x saving in our case making it go from “stupidly large” to “we don’t care” with an hours work.
coder543 19 days ago [-]
For compressing short (<100 bytes), repetitive strings, you could potentially train a zstd dictionary on your dataset, and then use that same dictionary for all rows. Of course, you’d want to disable several zstd defaults, like outputting the zstd header, since every single byte counts for short string compression.
ianburrell 19 days ago [-]
Postgres supports toast (long record) compression. It seems to support enabling on columns. It looks like it supports LZ4 and Zstd now. Zstd has better compression at expense of more time.
Too 18 days ago [-]
Two similar approaches using the fact that large portions of the text in a record is static.
TOAST compression is likely your best option for that data. You may need to lower the data size threshold for toast for that column.
brody_hamer 19 days ago [-]
I haven’t played around with it too much myself, but I remember reading that gzip (or at least python’s compatible zlib library) supports a “seed dictionary” of expected fragments”.
I gather that you’d supply the same “seed” during both compression and decompression, and this would reduce the amount of information embedded into the compressed result.
duskwuff 19 days ago [-]
Many other compression libraries, like zstd, support functionality along those lines. For that matter, brotli's big party trick is having a built-in dictionary, tuned for web content.
It's easy to implement in LZ-style compressors - it amounts to injecting the dictionary as context, as if it had been previously output by the decompressor. (There's a striking parallel to how LLM prompting works.)
tdiff 19 days ago [-]
How is llm here better than Markov chains created from a corpus of English text? I guess similar idea must have been explored million times in traditional compression studies.
max_ 19 days ago [-]
Does this guy (Fabrice Bellard) have a podcast interview anyone would recommend?
silisili 19 days ago [-]
AFAIK, he doesn't do videos or interviews. His web presence is pretty sparse. I remember trying to dig something up last year and coming up blank. Totally respect that, but a bummer for folks hoping to get a peek inside his mind.
If nothing else, I hope he finds time to write his thoughts into a book at some point.
usr1106 19 days ago [-]
He seems to spend all time to write truly amazing software.
mNovak 19 days ago [-]
I recall someone using one of the image generation models for pretty impressive (lossy) compression as well -- I wonder if AI data compression/inflation will be a viable concept in the future; the cost of inference right now is high, but it feels similar to the way cryptographic functions were more expensive before they got universal hardware acceleration.
hangonhn 19 days ago [-]
At a startup where I worked many years ago, they trained a model to take the image and screen size as the input and it would output the JPG compression level to use so that the image appears the same to people. It worked exceedingly well that a major software company offered to acquire the startup just for that. Alas, the founders were too ambitious/greedy and said no. It all burned down.
kevmo314 19 days ago [-]
That seems like a fun project to replicate independently. You didn't want to rebuild it?
stabbles 19 days ago [-]
It's a bit confusing to show the output as multibyte utf-8 characters and compare that to a base64 string
Retr0id 19 days ago [-]
The comparison example uses base64 too
stabbles 19 days ago [-]
Ah, my mistake. I thought that was meant to show a dictionary and brotli encoded string separately.
slater 19 days ago [-]
i always wondered if e.g. telcos had special short codes for stuff people often send, like at xmas many people write "merry christmas" in an SMS, and the telco just sends out "[code:mx]" to all recipient phones, to save on bandwidth and disk space?
qingcharles 19 days ago [-]
No, the systems are not that sophisticated, from having worked on them in the past.
I wonder if this is at all similar to what Apple uses for their satellite iMessage/SMS service, as that's a domain where it's probably worth spending significant compute on both sides to shave off even a single byte to transmit.
Retr0id 19 days ago [-]
What's the throughput like, for both compression and decompression?
crazygringo 19 days ago [-]
What is this encoding scheme that produces Chinese characters from binary data? E.g. from the first example:
> 뮭䅰㼦覞㻪紹陠聚牊
I've never seen that before. The base64 below it, in contrast, is quite familiar.
lxgr 19 days ago [-]
There's a family of encodings optimized for fitting the most information possible into an Unicode string of a given length, e.g. for gimmicks like fitting the most possible binary data into tweets.
For short messages in the mobile phone (i.e. GSM/3GPP) sense, which was my first association for "short message compression", it doubt that it works better than just sending binary messages with the appropriate header, but if that's not an option, it might just beat a custom alphabet based on the 7-bit GSM charset [1] (since that allows 100% of possible 7-bit characters to be used, whereas UTF-16 probably has at least some reserved codepoints that might be causing problems).
The practical use for this could be satellite messaging (e.g. InReach) where a message is limited to ~160 characters, and costs about a dollar per message.
the5avage 18 days ago [-]
Is there a paper explaining it in more detail? I also saw on his website he has a similar algorithm for audio compression...
deadbabe 19 days ago [-]
Could this become an attack vector somehow? The greatest minds could probably find a way to get a malicious payload decompressed into the output.
Retr0id 19 days ago [-]
It's lossless, at worst you'd make the compression ratio worse for certain inputs.
deadbabe 19 days ago [-]
With LLM based compression, could we get something like the opposite of lossless, like hallucinatory? All the original content, plus more?
ychen306 19 days ago [-]
How this works is the LLM predicts the probability of the next token and then an arithmetic coder turns that probability distribution into bits. So it will never hallucinate. In the worst case, when the LLM makes an outrageous prediction, you just use more bits, but it doesn't affect correctness.
Retr0id 19 days ago [-]
Not if the compression scheme is lossless, which it is here, per my previous comment.
tshaddox 19 days ago [-]
Presuming the software is implemented correctly, that can't happen (per the definition of "lossless"). I can imagine this happening with a careless implementation, e.g. if circumstances conspire to allow a slightly different version or configuration of the LLM to be used across compression and decompression.
semiquaver 19 days ago [-]
LLMs are deterministic at zero temperature.
gloflo 18 days ago [-]
Deterministically lossy/hallucinatory.
18 days ago [-]
unit149 19 days ago [-]
[dead]
gcr 18 days ago [-]
Decoding random gibberish into semantically meaningful sentences is fascinating.
It's really fun to see what happens when you feed the model keysmash! Each part of the input space seems highly semantically meaningful.
Here's a few decompressions of short strings (in base64):
$ ./ts_sms.exe d -F base64 sAbC
Functional improvements of the wva
$ ./ts_sms.exe d -F base64 aBcDefGh
In the Case of Detained Van Vliet {#
$ ./ts_sms.exe d -F base64 yolo9000
Give the best tendering
$ ./ts_sms.exe d -F base64 elonMuskSuckss=
As a result, there are safety mandates on radium-based medical devices
$ ./ts_sms.exe d -F base64 trump4Prezident=
Order Fostering Actions Supported in May
In our yellow
$ ./ts_sms.exe d -F base64 harris4Prezident=
Colleges Beto O'Rourke voted with Cher ¡La
$ ./ts_sms.exe d -F base64 obama4Prezident=
2018 AFC Champions League activity televised live on Telegram:
$ ./ts_sms.exe d -F base64 hunter2=
All contact and birthday parties
$ ./ts_sms.exe d -F base64 'correctHorseBatteryStaples='
---
author:
- Stefano Vezzalini
- Paolo Di Rio
- Petros Maev
- Chris Copi
- Andreas Smit
bibliography:
$ ./ts_sms.exe d -F base64 'https//news/ycombinator/com/item/id/42517035'
Allergen-specific Tregs or Treg used in cancer immunotherapy.
Tregs are a critical feature of immunotherapies for cancer. Our previous
studies indicated a role of Tregs in multiple
cancers such as breast, liver, prostate, lung, renal and pancreatitis. Ten years ago, most clinical studies were positi
ve, and zero percent response rates
$ ./ts_sms.exe d -F base64 'helloWorld='
US Internal Revenue Service (IRS) seized $1.6 billion worth of bitcoin and
In terms of compressions, set phrases are pretty short:
$ ./ts_sms.exe c -F base64 'I love you'
G5eY
$ ./ts_sms.exe c -F base64 'Happy Birthday'
6C+g
Common mutations lead to much shorter output than uncommon mutations / typos, as expected:
$ ./ts_sms.exe c -F base64 'one in the hand is worth two in the bush'
Y+ox+lmtc++G
$ ./ts_sms.exe c -F base64 'One in the hand is worth two in the bush'
kC4Y5cUJgL3s
$ ./ts_sms.exe c -F base64 'One in the hand is worth two in the bush.'
kC4Y5cUJgL3b
$ ./ts_sms.exe c -F base64 'One in the hand .is worth two in the bush.'
kC4Y5c+urSDmrod4
Note that the correct version of this idiom is a couple bits shorter:
$ ./ts_sms.exe c -F base64 'A bird in the hand is worth two in the bush.'
ERdNZC0WYw==
Slight corruptions at different points lead to wildly different (but meaningful) output:
$ ./ts_sms.exe d -F base64 FRdNZC0WYw==
Dionis Ellison
Dionis Ellison is an American film director,
$ ./ts_sms.exe d -F base64 ERcNZC0WYw==
A preliminary assessment of an endodontic periapical fluor
$ ./ts_sms.exe d -F base64 ERdNYC0WYw==
A bird in the hand and love of the divine
$ ./ts_sms.exe d -F base64 ERdNZC1WYw==
A bird in the hand is worth thinking about
$ ./ts_sms.exe d -F base64 ERdNZD0WYw==
A bird in the hand is nearly as big as the human body
$ ./ts_sms.exe d -F base64 ERdNZC0wYw==
A bird in the hand is worth something!
Friday
$ ./ts_sms.exe d -F base64 ERdNZC0XYw==
A bird in the hand is worth two studies
userbinator 18 days ago [-]
$ ./ts_sms.exe d -F base64 elonMuskSuckss=
As a result, there are safety mandates on radium-based medical devices
LOL!
That "decompression" is reminiscent of Terry Davis of TempleOS fame, who had written a random sentence generator that he interpreted as "speaking to God".
18 days ago [-]
yalok 19 days ago [-]
What’s the size of the model used here?
GaggiX 19 days ago [-]
The model used is RWKV 169M v4.
19 days ago [-]
jonplackett 19 days ago [-]
Would this also work for video encoding using something like Sora?
Get Sora to guess the next frame and then correct any parts that are wrong?
I mean, it would be an absolutely insane waste of power, but maybe one day it’ll make sense!
MPSimmons 18 days ago [-]
Cool, now all I need is the tiny encoded message and a 7+B weight model, plus some electricity.
This is more like a book cipher than a compression algorithm.
mlok 19 days ago [-]
LLMs, and now this, make me think of the (non-existant) "Sloot Digital Coding System" that could be viewed as a form of "compression".
I view it as a form of fraud. There's no way that worked or could have worked.
dekhn 19 days ago [-]
Perhaps not literally, but you can easily imagine training an embedding on a large amount of existing video, and then delivering somebody "the point in space that decodes to the video with the least residual compared to the original".
Conceptually, most modern movies are just linear combinations of basis tropes (tvtropes.org).
mlok 19 days ago [-]
Yes that is why I specified it was non-existent. But the idea behind it is in the same vein somehow. Maybe what Sloot envisioned was something similar to LLMs.
perching_aix 19 days ago [-]
I thought I had come up with something with a similar performance once. Then a couple hours later I realized that I just (still) suck at combinatorics :)
RandomThoughts3 19 days ago [-]
It’s a very clever idea.
I could see it becoming very useful if on device LLM becomes a thing. That might allow storing a lot of original sources for not much additional data. We might be able to get an on device chat bot sending you to a copy of Wikipedia/reference material all stored on device and working fully offline.
zamadatix 19 days ago [-]
If mobile phone conversations over the last 2 decades have taught me anything it's that people talk about anything but battery life and ultimately the crowd ends up doing "whatever means I don't have to put it on the charger twice a day". Especially when the base iPhone SE already has enough storage to fit more text than one could read in their life anyways.
lxgr 19 days ago [-]
If you like that idea, give Kiwix a try! Best ~60 GB I have stored on my phone :) And it comes in handy more often than initially expected.
So for instance:
> In my pasta I put a lot of [cheese]
LLM top N tokens for "In my pasta I put a lot of" will be [0:tomato, 1:cheese, 2:oil]
The real next token is "cheese" so I'll store "1".
Well, this is neat, but also very computationally expensive :D So for my small ESP32 LoRa devices I used this: https://github.com/antirez/smaz2 And so forth.
EDIT: Hm, or maybe ts_zip uses just the token probabilities directly. I thought it was slightly more efficient about it.
"The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities."
There's a pretty straight line from assigning probabilities (to a sequence of tokens) to arithmetic compression as an optimal compression algorithm for that distribution.
If you are going to send the result as is, Huffman coding (with some escape for unusal words(?)) will be better. I think even better than the other method that forgets the probabilities and then tries to compresd it.
IIRC the only reason it wasn't always used for everything is patents. A secondary reason being that it can be slow if you design it without thinking about performance, eg if you use divisions.
https://en.wikipedia.org/wiki/Huffman_coding
https://www.youtube.com/watch?v=3f9tbqSIm-E
It's not just natural language that could be compressed this way, either. Code (HTML, JS, etc) could be compressed with the same technique/models. I bet that the same general idea could work for image compression as well, using an image/diffusion model (or perhaps a multimodal model for everything.)
This could lead to an entire internet of content using just a few bits.
That's not new to anyone familiar with compression or information theory, but the novelty here is the LLM itself. It's absolutely plausible that, given an already highly compressed relationally-encoded context like a trained LLM, very few bits could be communicated to communicate very abstract and complex ideas, letting the LLM recontextualize information which has been compressed across several semantic and contextual layers, effectively leveraging a complete (but lossy) history of human knowledge against every single bit of information communicated.
Probably decoding few tokens per second is fast enough to deliver more goodput than the existing uncompressed usage.
You could adjust tokens towards what's more statistically probable, and therefore more compressible (in your example, it'd be picking tomato instead of cheese)
Compare also with commercial code [1], a close historical analog, albeit with handcrafted, as opposed to ML-derived, compression tables. (There was a single code point for "twins, born alive and well, one boy and one girl", for example! [2])
[1] https://en.wikipedia.org/wiki/Commercial_code_(communication...
[2] https://archive.org/details/unicodeuniversa00unkngoog/
> “And from your standpoint, Hamid, there’s one big drawback. The mean bandwidth of this thing [an ansible, more or less] is just under six bits per minute.”
> “Huh? Ten seconds to send a single bit?”
> “Yup. Skandr left three protocols at the Lothlrimarre end: ASCII, a Hamming map to a subset of English, and an AI scheme that guesses what you’d say if you used more bits. The first is Skandr’s idea of a joke, and I wouldn’t trust the third more than wishful thinking.”
(Good advice at the end there.)
One viewer stumbles onto a key insight about the struggle taking place, but they only have evocations so they’re not sure. And they sound like a total kook so everyone ignores them.
(aFutD also had a case of high-stakes suspicion of highly compressed messages during Zone turbulence, as I think the GP meant.)
I used a word embedding to convert the text to a space where similar tokens had similar semantic meaning, then I modified an ordinary LZ encoder to choose cheaper tokens if they were 'close enough' according to some tunable loss parameter.
It "worked", but was better at producing amusing outputs than any other purpose. Perhaps you wouldn't have considered that working!
In terms of a modern implementation using an LLM, I would think that I could improve the retention of details like that by adapting the loss parameter based on the flatness of the model. E.g. for a date the model may be confident that the figures are numbers but pretty uniform among the numbers. Though I bet those details you want to preserve have a lot of the document's actual entropy.
E.g., making a greeting card with somebody's name spelled correctly.
This would be much nicer than text-in-image steganography because services often alter images before displaying them, but they rarely do that to text (assuming the usual charset and no consecutive whitespace).
The idea seems similar enough that I wanted to share. The same way you can hide information in the text to prove it was generated by a specific model and version, of course you can use this for secrets as well.
So it's cryptography. With a shared dictionary. Basically just ECB, though with an unbelievably large and complicated code book.
Incidentally, this technique is actually an old one that has already seen use in magic shows and confidence tricks; two people who wish to communicate covertly, such as a magician and an "audience member", can exchange small amounts of information by varying the wording of simple responses: "yes", "that's right", "quite so". This can be used to, for instance, discreetly reveal the value of a card that the magician is "guessing" through "ESP".
you could hide that as text in other text, and that'd be steganography.
So you'd feed the algorithm some starter text like: "Here's my favorite recipe for brownies", and then you'd give it some data to encode, and depending on which data you gave it, you'd get a different, but "plausible", recipe for brownies. The recipient could reverse the recpie back into numbers, and from that they'd decode the hidden message.
The trick would be balancing the LLM's attempt to make sense against whatever additional constraints came along with your data encoding scheme. If you tried to encode too much cyphertext into a too-short brownies recipe, the recipe would fail to be convincingly a recipe. Conveniently, it's conventional to prefix recipes with a tremendous amount of text that nobody reads, so you've got a lot of entropy to play in.
I would definitely expect something like this to happen at some point. as long as people use LLMs with a non-zero temperature, you'd expect variation from anyone's output, so it'd be super hard to detect / super deniable with just the output.
The same goes for all the other higher-order probability models, which are used in what is currently the best known compression algorithm:
https://en.wikipedia.org/wiki/PAQ
LLMs are just another way to do the probability modeling.
The brotli comparison is IMHO slightly misleading. Yes, it "embeds a dictionary to optimize the compression of small messages", but that dictionary is a few orders of magnitude smaller than the embedded "dictionary" which is the LLM in ts_sms.
There's a reason the Hutter Prize (and the demoscene) counts the whole data necessary to reproduce its output. In other words, ts_sms took around 18 bytes + ~152MB while brotli took around 70 bytes + ~128KB (approximately size of its dictionary and decompressor.)
For example, antirez mentioned LoRa in the earlier thread - that's a cheap, license-free radio, which achieves a large range at the expense of low rate (250 bit/sec). That's 30 bytes/second, not including framing overhead and retransmission.
If you wanted to build a communication system out of those, this compression method would be great. You'd have LORA device that connects to a regular cell phone and provides connectivity, and all the compression/decompression and UI happens on the cell phone. 150MB is nothing for modern phones, but you'd see a real improvement in message speed.
https://en.wikipedia.org/wiki/Hutter_Prize
From a cursory web search it doesn't appear that LLMs have been useful for this particular challenge, presumably because the challenge imposes rather strict size, CPU, and memory constraints.
Fabrice also makes some programs you might use, like FFMEG and QEMU
https://bellard.org/
For those unaware, it's a typo. willvarfar meant FFMPEG.
This would be the one sentence that wouldn't cause me to look down on somoene, if used as a third-person humble-brag.
llms are generally large
[1] https://bellard.org/nncp/
[2] http://mattmahoney.net/dc/text.html
They have different goals and utilize completely different techniques.
At most lossy techniques leverage lossless techniques (eg to compress non-perceptual binary headers) not the other way round.
This isn't normally what people mean by lossy compression, though. In lossy compression (e.g. mainstream media compression like JPEG) you work out what the user doesn't value and throw it away.
And that still doesn’t show how lossless compression is tied to intelligence. The example I always like to give is, “What’s more intelligent? Reciting the US war of independence Wikipedia page verbatim every time or being able to synthesize a useful summary in your own words and provide relevant contextual information such as it’s role in the French Revolution?”
These lossless compression algorithms compress a large corpus of English text from an encyclopedia. The idea is that you can compress this text more if you know more about English grammar, the subject matter of the text, logic, etc.
I think you’re distracted by the lossless part. The only difference here between lossy and lossless compression is that the lossy algorithm also needs to generate the diff between its initial output and the real target text. Clearly a lossy algorithm with lower error needs to waste fewer bits representing that error.
> The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities. [1]
It’s also mentioned that the model is configured to be deterministic, which is how I would guess the decompression is able to map a set of token likelihoods to the original token?
[1] https://bellard.org/ts_zip/
Discussed (once more) in a neighbor thread: https://news.ycombinator.com/item?id=42549083
Since my JSON(B) data is fairly repetitive, my bet would be to store some sort of JSON schema in a parent table. I'm storing the response body from a API call to a third-party API, so normalizing it by hand is probably out of the question.
I wonder if Avro can be helpful for storing the JSON schema. Even if I had to create custom PL/SQL functions for my top 10 JSON schemas it would be ok, since the data is growing very quickly and I imagine it could be compressed at least 10x compared to regular JSON or JSONB columns.
[1] https://github.com/citusdata/citus?tab=readme-ov-file#creati... [2] https://cloud.google.com/sql/docs/postgres/extensions
Using CLP: https://www.uber.com/blog/reducing-logging-cost-by-two-order...
https://messagetemplates.org/
Rewrite the log to extract and group by common identifiers:
https://bit.kevinslin.com/p/lossless-log-aggregation
I gather that you’d supply the same “seed” during both compression and decompression, and this would reduce the amount of information embedded into the compressed result.
It's easy to implement in LZ-style compressors - it amounts to injecting the dictionary as context, as if it had been previously output by the decompressor. (There's a striking parallel to how LLM prompting works.)
If nothing else, I hope he finds time to write his thoughts into a book at some point.
I wonder if this is at all similar to what Apple uses for their satellite iMessage/SMS service, as that's a domain where it's probably worth spending significant compute on both sides to shave off even a single byte to transmit.
> 뮭䅰㼦覞㻪紹陠聚牊
I've never seen that before. The base64 below it, in contrast, is quite familiar.
For example: https://github.com/qntm/base65536
For short messages in the mobile phone (i.e. GSM/3GPP) sense, which was my first association for "short message compression", it doubt that it works better than just sending binary messages with the appropriate header, but if that's not an option, it might just beat a custom alphabet based on the 7-bit GSM charset [1] (since that allows 100% of possible 7-bit characters to be used, whereas UTF-16 probably has at least some reserved codepoints that might be causing problems).
[1] https://en.wikipedia.org/wiki/GSM_03.38
It's really fun to see what happens when you feed the model keysmash! Each part of the input space seems highly semantically meaningful.
Here's a few decompressions of short strings (in base64):
In terms of compressions, set phrases are pretty short: Common mutations lead to much shorter output than uncommon mutations / typos, as expected: Note that the correct version of this idiom is a couple bits shorter: Slight corruptions at different points lead to wildly different (but meaningful) output:That "decompression" is reminiscent of Terry Davis of TempleOS fame, who had written a random sentence generator that he interpreted as "speaking to God".
Get Sora to guess the next frame and then correct any parts that are wrong?
I mean, it would be an absolutely insane waste of power, but maybe one day it’ll make sense!
This is more like a book cipher than a compression algorithm.
https://en.m.wikipedia.org/wiki/Sloot_Digital_Coding_System
Conceptually, most modern movies are just linear combinations of basis tropes (tvtropes.org).
I could see it becoming very useful if on device LLM becomes a thing. That might allow storing a lot of original sources for not much additional data. We might be able to get an on device chat bot sending you to a copy of Wikipedia/reference material all stored on device and working fully offline.