i think its cool that now both RNN and LSTM (with xLSTM) now have modern attention-inspired variants that solve the previous issues. I wonder if 1) its possible to overcome the "hardware lottery" that transformers have now won, and 2) if recurrent/selective state can do the kind of proper lookback on extremely long context that we will want it to do to compete with full attention (easy to say no, harder to propose what to do about it).
there's also Liquid AI, whatever it is that they do.
HarHarVeryFunny 17 days ago [-]
The Transformer was specifically conceived to take advantage of pre-existing massively parallel hardware, so it's a bit backwards to say it "won the hardware lottery". Where the Transformer did "win the lottery" is that the key-value form of self-attention (invented by Noam Shazeer) needed to make parallel processing work seems to have accidentally unlocked capabilities like "induction heads" that make this type of architecture extremely well suited to language prediction.
Given limits on clock speed, massive parallelism is always going to be the way to approach brain-like levels of parallel computation, so any model architecture aspiring to human level AGI needs to be able to take advantage of that.
swyx 17 days ago [-]
you are correct of course but i meant hardware lottery in the sense of dedicated silicon companies like Etched and MatX that have now emerged to make chips that only run transformers (not exactly true for matx but hey i am simplifying. would be cool if matx ran other arch's but its not a priority)
shawntan 17 days ago [-]
Although marketed as such, RWKV isn't really an RNN.
In the recent RWKV7 incarnation, you could argue it's a type of Linear RNN, but past versions had an issue of taking its previous state from a lower layer, allowing for parallelism, but makes it closer to a convolution than a recurrent computation.
As for 1), I'd like to believe so, but it's hard to get people away from the addictive drug that is the easily parallelised transformer, 2) (actual) RNNs and attention mechanisms to me seem fairly powerful (expressivity wise) and perhaps most acceptable by the community.
bravura 17 days ago [-]
Recent work by Feng et al from Bengio's lab focus on how attention can be formulated as an RNN ("Attention as RNN": https://arxiv.org/pdf/2405.13956) and how minimal versions of GRUs and LSTMs can be trained in parallel by removing some parameters ("Were RNNs All We Needed?" https://arxiv.org/pdf/2410.01201).
It's possible we start seeing more blended version of RNN/attention architecture exploring different LLM properties.
In particular, Aaren architecture in the former paper "can not only (i) be trained in parallel (like Transformers) but also (ii) be updated
efficiently with new tokens, requiring only constant memory for inferences (like
traditional RNNs)."
shawntan 17 days ago [-]
The formulations in attention as rnn have similar issues as rwkv. Fundamentally it's a question of what we call an RNN.
Personally I think it's important not to call some of these recent architectures RNNs because they have theoretical properties that do not match (read: they're worse) what we've "classically" called RNNs.
As a rule of thumb: you generally don't get parallelism for free, you pay for it with poorer expressivity.
inciampati 18 days ago [-]
the recurrent model needs a mechanism to replay past context. no need to go quadratic to access all of it. they could replay multiple times to get effects similar to attention.
the hardware lottery, well... imo it's really about leveraging fully parallel training to learn how to use a memory. attention is quadratic but it can be computed in parallel. it's an end to end learned memory. getting that kind of pattern into RNNs won't be easy but it's going to be crucial before we boil the ocean.
pico_creator 18 days ago [-]
RWKV already solve the parallel compute problem for GPU, based on the changes it has done - so it is a recurrent model that can scale to thousands++ of GPU no issue.
Arguably with other recurrent architecture (State Space, etc) with very different design implementation. The issue of old recurrent design was just the way LSTM was designed. Not the recurrent nature.
intalentive 18 days ago [-]
Idea for killer app for recurrent models: low latency, low memory LLM / TTS coupling. Start decoding / generating speech as soon as new tokens are generated. When the LLM is cranking out token t, the TTS is already working on token t-1. It doesn’t have to wait. Then when the LLM is finished, the TTS is nearly finished too. The two models being colocated you just saved another network call as well.
Recurrent models with constant hidden state are naturally suited to streaming data, potentially opening the door to unexplored new use cases.
computerex 18 days ago [-]
New multimodal models take raw speech input and provide raw speech output, no tts in the middle.
Seems like the future - so much meaning and context is lost otherwise.
intalentive 18 days ago [-]
Very cool. Logical next step. Would be interested to know what the dataset looks like.
moffkalast 18 days ago [-]
Youtube. Youtube is the dataset.
pico_creator 18 days ago [-]
This is actually the hypothesis for cartesia (state space team), and hence their deep focus on voice model specifically. Taking full advantage of recurrent models constant time compute, for low latencies.
RWKV team's focus is still however is first in the multi-lingual text space, then multi-modal space in the future.
On a side node, and that's what led me to the link above, I wonder if it would be possible to chain N streaming LLMs in an agent workflow and get a final output stream almost instantaneously without waiting for N-1 LLM to complete their reply.
yshui 18 days ago [-]
Any autoregressive model can do what you are describing. transformers are generating one token at a time too, not all at once.
intalentive 17 days ago [-]
True but memory requirements grow with sequence length. For recurrent models the memory requirement is constant. This is why I qualified with "low memory".
whimsicalism 17 days ago [-]
yes but transformers are much slower than state space models
pico_creator 18 days ago [-]
Hey there, im Eugene / PicoCreator - co-leading the RWKV project - feel free to AMA =)
Ey7NFZ3P0nzAe 18 days ago [-]
I noticed the lack of support from ollama and llama.cpp for RWKV. As those are (to my eyes) very strong drivers of experimentation (i.e. supporting them means vastly more outreach) I was considering whether you were considering taking this into your own hands by contributing code to them? Or rather is the fact that you are not (AFAIK) doing it because you lack the bandwidth in terms of man power or any other reason?
nickpsecurity 17 days ago [-]
It’s really, interesting work. I’m glad you’ve kept at it. I’d like to ask you about two issues.
I keep seeing papers like “Repeat After Me” claiming serious weaknesses of state space vs transformer models. What are the current weaknesses of RWKV vs transformers? Have you mitigated them? If so, how?
The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally. Pre-1920’s works in Project Gutenberg are totally public domain. Both the model and the training data would be 100% legal for reproducible research. Would your team be willing to train a 3B-7B model on only Gutenberg and release it to the public domain?
(Note: The Stack without GitHub Issues can be used for permissive code. However, there could be contamination issues like incorrect licenses, PII, etc. So, maybe at least one, 100% legal model. Maybe a second with Gutenberg and The Stack for coding research.)
> The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally.
That really depends on whether LLM pretraining ends up held as an infringing use. (Of course, it’ll take a while for the cases to work through the courts and for a body of jurisprudence to be developed on this subject.)
nickpsecurity 17 days ago [-]
There’s two legal issues: sharing copyrighted data; training on it. It’s the latter that’s ambiguous. My problem is the former.
Making copies of and sharing copyrighted works without the authors’ permission is already illegal as proven in countless, file-sharing cases. The AI trainers do this with data sets like Common Crawl, The Pile, and RefinedWeb. Just sharing them is illegal for most of the content in them.
I got ideas for how to deal with that in countries with TDM exceptions, like Singapore. For now, the only things we can share with others for model training are (a) public domain works and (b) content licensed for permissive use and sharing. Gutenberg entries before a certain year should be pretty risk-free.
Ey7NFZ3P0nzAe 18 days ago [-]
Has there been progress towards making RWKV multimodal? Can be use projector layers to send images to RWKV?
Its the same principle as open transformer models where an adapter is used to generate the embedding
However currently the core team focus is in scaling the core text model, as this would be the key performance driver, before adapting multi-modal.
The tech is there, the base model needs to be better
Ey7NFZ3P0nzAe 18 days ago [-]
I'm quite interested in repeng [0] (representztion engineering) for steerability of (so fzr transformer based) LLMs and was wondering if anyone had tried such methods on rwkv (or mamba for that matter). Maybe there are some low hanging fruits about it.
One of the interesting "new direction" for RWKV and Mamba (or any recurrent model), is the monitoring and manipulation of the state in between token. For steerability, alignment, etc =)
Not saying its a good or bad idea, but pointing out that having a fixed state in between has interesting applications in this space
low_tech_punk 18 days ago [-]
Thanks! The 0.1B version looks perfect for embedded system. What is the key benefit of attention-free architecture?
pico_creator 18 days ago [-]
lower compute cost especially over longer sequence length. Depending on context length, its 10x, 100x, or even 1000x+ cheaper. (quadratic vs linear cost difference)
bratao 18 days ago [-]
What would be the most performant way to run a inference using RWKV?
Do you have and speed comparison to a similar sized transformer?
I have a task(OCR cleaning) that I´m evaluating faster options and look like RWKV would be a nice alternative.
littlestymaar 18 days ago [-]
Has there been any plans to build a “reasoning” llm using RWKV? With the increase in inference token count caused by such methods, the muhc lower footprint of recurrent architecture could really make a difference for such a use-case.
theLiminator 18 days ago [-]
Do you have an in depth comparison between RWKV and models like mamba or s4?
ps Eugene you should brag about that on the homepage of RWKV.
smusamashah 18 days ago [-]
How does it compare with other LLMs in terms of performance? Is.it near GPT 3 or Llama or what?
Fischgericht 17 days ago [-]
"RWKV (pronounced RwaKuv)" - love it. How does the corw make? Rwa! Rwa! Rwa!
bbor 17 days ago [-]
Thank god I’m not the only one stunned by that. I don’t need IPA, but this isn’t even vaguely pronounceable!
upghost 18 days ago [-]
Seems really cool. Does anyone have any sample code to link to? Do RNN models use the same pytorch/hugging face Python stuff or is it completely different...?
bigattichouse 16 days ago [-]
I've spent an afternoon attempting to compile and run RWKV7 locally.. and I just don't get it. lotta errors in compiling... and it's a lot. Like a lot, a lot... it's walls of versions and sub projects.
Any kind of quickstart guide?
Also found.tried rwkv.cpp, and I can't seem to compile that either.
nullc 18 days ago [-]
Anyone ever look at doing a MoE like composition with RWKV and a transformer?
pico_creator 18 days ago [-]
Not an MoE, but we have already done hybrid models. And found it to be highly performant (as per the training budget)
This is a full drop in replacement for any transformer model use cases on model sizes 32B and under, as it has equal performance to existing open 32B models in most benchmarks
We are in works on a 70B, which will be a full drop in replacement for most text use cases
lostmsu 18 days ago [-]
Why aren't you on lmarena (former chatbot arena) leaderboard?
pico_creator 18 days ago [-]
kinda on a todo list, the model is open source on HF for anyone who is willing to make it work with lmarena
swyx 18 days ago [-]
how about finetuning your 32B to be R1QWQKV?
pico_creator 18 days ago [-]
There is a current lack of "O1 style" reasoning dataset in open source space. QWQ did not release their dataset. So that would take some time for the community to prepare.
It's definitely something we are tracking to do as well =)
2023: https://latent.space/p/rwkv
2024: https://www.youtube.com/watch?v=LPe6iC73lrc <- offers a bit of professional compare and contrast vs state space models.
i think its cool that now both RNN and LSTM (with xLSTM) now have modern attention-inspired variants that solve the previous issues. I wonder if 1) its possible to overcome the "hardware lottery" that transformers have now won, and 2) if recurrent/selective state can do the kind of proper lookback on extremely long context that we will want it to do to compete with full attention (easy to say no, harder to propose what to do about it).
there's also Liquid AI, whatever it is that they do.
Given limits on clock speed, massive parallelism is always going to be the way to approach brain-like levels of parallel computation, so any model architecture aspiring to human level AGI needs to be able to take advantage of that.
In the recent RWKV7 incarnation, you could argue it's a type of Linear RNN, but past versions had an issue of taking its previous state from a lower layer, allowing for parallelism, but makes it closer to a convolution than a recurrent computation.
As for 1), I'd like to believe so, but it's hard to get people away from the addictive drug that is the easily parallelised transformer, 2) (actual) RNNs and attention mechanisms to me seem fairly powerful (expressivity wise) and perhaps most acceptable by the community.
It's possible we start seeing more blended version of RNN/attention architecture exploring different LLM properties.
In particular, Aaren architecture in the former paper "can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs)."
Personally I think it's important not to call some of these recent architectures RNNs because they have theoretical properties that do not match (read: they're worse) what we've "classically" called RNNs.
Ref: https://arxiv.org/abs/2404.08819
As a rule of thumb: you generally don't get parallelism for free, you pay for it with poorer expressivity.
the hardware lottery, well... imo it's really about leveraging fully parallel training to learn how to use a memory. attention is quadratic but it can be computed in parallel. it's an end to end learned memory. getting that kind of pattern into RNNs won't be easy but it's going to be crucial before we boil the ocean.
Arguably with other recurrent architecture (State Space, etc) with very different design implementation. The issue of old recurrent design was just the way LSTM was designed. Not the recurrent nature.
Recurrent models with constant hidden state are naturally suited to streaming data, potentially opening the door to unexplored new use cases.
RWKV team's focus is still however is first in the multi-lingual text space, then multi-modal space in the future.
its one of those retrospectively obvious/genius insights that i wish i understood when i first met him
On a side node, and that's what led me to the link above, I wonder if it would be possible to chain N streaming LLMs in an agent workflow and get a final output stream almost instantaneously without waiting for N-1 LLM to complete their reply.
I keep seeing papers like “Repeat After Me” claiming serious weaknesses of state space vs transformer models. What are the current weaknesses of RWKV vs transformers? Have you mitigated them? If so, how?
The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally. Pre-1920’s works in Project Gutenberg are totally public domain. Both the model and the training data would be 100% legal for reproducible research. Would your team be willing to train a 3B-7B model on only Gutenberg and release it to the public domain?
(Note: The Stack without GitHub Issues can be used for permissive code. However, there could be contamination issues like incorrect licenses, PII, etc. So, maybe at least one, 100% legal model. Maybe a second with Gutenberg and The Stack for coding research.)
Example use of Gutenberg:
https://www.tensorflow.org/datasets/catalog/pg19
That really depends on whether LLM pretraining ends up held as an infringing use. (Of course, it’ll take a while for the cases to work through the courts and for a body of jurisprudence to be developed on this subject.)
Making copies of and sharing copyrighted works without the authors’ permission is already illegal as proven in countless, file-sharing cases. The AI trainers do this with data sets like Common Crawl, The Pile, and RefinedWeb. Just sharing them is illegal for most of the content in them.
I got ideas for how to deal with that in countries with TDM exceptions, like Singapore. For now, the only things we can share with others for model training are (a) public domain works and (b) content licensed for permissive use and sharing. Gutenberg entries before a certain year should be pretty risk-free.
Its the same principle as open transformer models where an adapter is used to generate the embedding
However currently the core team focus is in scaling the core text model, as this would be the key performance driver, before adapting multi-modal.
The tech is there, the base model needs to be better
[0] https://github.com/vgel/repeng/issues
Not saying its a good or bad idea, but pointing out that having a fixed state in between has interesting applications in this space
I have a task(OCR cleaning) that I´m evaluating faster options and look like RWKV would be a nice alternative.
ps Eugene you should brag about that on the homepage of RWKV.
Any kind of quickstart guide?
Also found.tried rwkv.cpp, and I can't seem to compile that either.
https://arxiv.org/abs/2407.12077
This is a full drop in replacement for any transformer model use cases on model sizes 32B and under, as it has equal performance to existing open 32B models in most benchmarks
We are in works on a 70B, which will be a full drop in replacement for most text use cases
It's definitely something we are tracking to do as well =)