> In summary, we used a 2.4-million-parameter small language model to achieve accuracy comparable to a 100-million-parameter BERT model in text classification.
Neat, but the question will be how the scaling laws hold up
PaulHoule 13 days ago [-]
Doesn't have to.
I use models like the 100M parameter BERT model for text classification and they work great. I get a 0.78 AUC with one model; Tik Tok gets about 0.82 for a similar problem and I'm sure they spent at least 500x what I spent on mine. I could 10x my parameters and get an 0.79 AUC but I don't know if I'd feel the difference. (I got about 0.71 AUC with bag of words + logistic regression and perceive a big difference between the output of the SBERT model and that)
My current model can do a complete training cycle which involves training about 20 models and picking the best in about 3 minutes. The process is highly reliable and can run unattended every day, I could run it every hour if I wanted. I worked on another classifier based on fine-tuning a larger model and it took about 30 minutes to train just one model and was not reliable at all.
If you can 50x the speed the BERT model and 1/50 the resources that's a big boon that makes text classification more accessible, the only excuse people have now is that it is too hard to make a training set.
jerpint 13 days ago [-]
Somewhat agreed for use cases of text classification, but for anything requiring more language understanding it is a desirable property
Neat, but the question will be how the scaling laws hold up
I use models like the 100M parameter BERT model for text classification and they work great. I get a 0.78 AUC with one model; Tik Tok gets about 0.82 for a similar problem and I'm sure they spent at least 500x what I spent on mine. I could 10x my parameters and get an 0.79 AUC but I don't know if I'd feel the difference. (I got about 0.71 AUC with bag of words + logistic regression and perceive a big difference between the output of the SBERT model and that)
My current model can do a complete training cycle which involves training about 20 models and picking the best in about 3 minutes. The process is highly reliable and can run unattended every day, I could run it every hour if I wanted. I worked on another classifier based on fine-tuning a larger model and it took about 30 minutes to train just one model and was not reliable at all.
If you can 50x the speed the BERT model and 1/50 the resources that's a big boon that makes text classification more accessible, the only excuse people have now is that it is too hard to make a training set.