An interesting choice to go with 256GB of DDR5 ECC; if spending so much on the 6x4090's, might as well try to hit 1 TB of RAM as well.
The cost of this... not even sure. Astronomical.
manmal 22 days ago [-]
On Reddit there's reports of 8x4090, or even 8xH100. I don't know where people get this kind of money for this, and why they don't rent infra instead.
FridgeSeal 22 days ago [-]
Probably because they are after a lot of fast, local storage, and _that_ is where rented ML infra providers will sting you.
Edit: could also just be more-money-than-sense. Never discount stupidity.
dogma1138 21 days ago [-]
Hardware can be resold, and bought 2nd hand also.
The 4090 will likely maintain 50% of its current value due to its memory capacity over the next 12-18 months.
CapEx vs OpEx is a thing even if you are not a business…
KuriousCat 21 days ago [-]
Why do you think RoI is better when infra is rented?
taskforcegemini 20 days ago [-]
I mean, at some point someone has to buy them to be able to offer services on them to others.
Renting comes with certain limitations owners don't have.
And some people have too much money to not invest in fun.
belter 22 days ago [-]
Don't forget to talk to your local power company one year in advance. They will need to upgrade your local substation transformer... :-)
amluto 21 days ago [-]
This build is 3kVA max. That’s about 1/3 of a current gen EV, only 15% of an original Tesla Model S with dual chargers, and about equal to a standard American oven. This is much more polite to the grid than, say, a couple of tea kettles or especially a reasonably sized electric tankless water heater.
dkkergoog 21 days ago [-]
[dead]
keyle 22 days ago [-]
This article was written or rewritten via your model right?
The last paragraphs fell totally like AI.
Anyway I'd like a follow up on the curating, cleaning and training part which is far more interesting than how to select hardware which we've been doing for over 25 years.
red2awn 22 days ago [-]
> Architecture Advantages: Enhanced ray tracing, Shader Execution Reordering, and DLSS 3 technology for improved efficiency.
This jumps right out as written by AI, these features have nothing to do with training LLMs.
sabareesh 22 days ago [-]
Yes it is , thanks for the feedback. I will soon add it to github
_just7_ 22 days ago [-]
I would be much more intrested in a piece on what you can train with this kind of rig, rather than the rig itself
minimaxir 22 days ago [-]
The bottleneck for most model training sizes is VRAM, and since each 4090 has 24 GB VRAM, that's 96 GB VRAM total. The article mentions that it can train LLMs from scratch up to 1 billion hyperparameters, which tracks.
Nowadays that's not a lot: a single H100 that you can now rent has 80 GB VRAM, and doesn't have the technical overhead of handling work across GPUs.
tmostak 22 days ago [-]
You should be able to train/full-fine-tune (i.e. full weight updates, not LoRA) a much larger model with 96GB of VRAM. I generally have been able to do a full fine-tune (which is equivalent to training a model from scratch) of 34B parameter models at full bf16 using 8XA100 servers (640GB of VRAM) if I enable gradient checkpointing, meaning a 96GB VRAM box should be able to handle models of up to 5B parameters. Of course if you use LoRA, you should be able to go much larger than this, depending on your rank.
sabareesh 22 days ago [-]
Definitely agree but part of the reason why i built this to learn about all the overhead and gotchas
llm_nerd 22 days ago [-]
Is there a reason you used hyperparameters rather than parameters? I was going to politely correct the terminology but you seem to be in AI for some time so either it was a mistype or I am misunderstanding what you are referencing.
didgeoridoo 22 days ago [-]
I imagine that when you get really deep into model training, it can seem like there are a billion hyperparameters you have to worry about.
minimaxir 22 days ago [-]
It's a force of habit, parameters would be more accurate (almost everyone uses them interchangeably nowadays)
21 days ago [-]
unixpickle 22 days ago [-]
Wait what? Who actually calls trainable params "hyperparameters"? Nobody at OpenAI does, as far as I know.
minimaxir 22 days ago [-]
People who are making quick social media posts while taking a casual walk outside on websites that don't make it easy to edit posts and are not expecting to be nitpicked about it.
Overall, it's something I've seen very often on social media and less technical articles about LLMs. OpenAI would fall into the "almost" category.
llm_nerd 21 days ago [-]
It's okay to say that you mistyped or whatever, while taking a casual walk outside on websites that don't make it easy to edit posts and are not expected to be nitpicked about it. Throwing in that everyone uses them interchangeably, however, is just profoundly wrong on every level.
I wasn't nitpicking. It is a HUGE differentiation, and I pointed it out specifically because people pick up on terminology so people who might not know better will go forward and just drop in the more super duper hyperparameter, not realizing that it makes them look like they don't know what they're talking about. As I said in the other post, no one who knows anything uses them interchangeably. It is just completely wrong.
minimaxir 21 days ago [-]
Again, I've heard and used the terminology "model hyperparameter" in place of "model parameter", and I've also heard "model parameter" in place of "model hyperparameter" because not every human interaction is a paper on arXiv and the terms are obviously very similar. The context of the term is what matters in the end (as demonstrated by other comments following my correct intent), and society will not crumble if using either term incorrectly in casual conversation. No one intentionally uses the wrong term, but as jokingly said in another comment "when you get really deep into model training, it can seem like there are a billion hyperparameters you have to worry about."
I appreciate being corrected, but you are the one who asked for my opinion based on my extensive time in AI, you can choose to believe it or not.
Bancakes 22 days ago [-]
I doubt the RAM is added up. I think that’s only a feature reserved for their NVLinked HPC series cards. In fact, without nvlink, I don’t see how you’d connect them together to compute a single task in a performant and efficient way.
How long does training a 1B or 500M model take approximately on the 4-GPU setup? Or does that dramatically depend on the training data? I didn’t see that info on your pages.
sabareesh 22 days ago [-]
Roughly it takes 7 days to train on 100B tokens on 500M model
paxys 22 days ago [-]
And where you get the training data from.
sabareesh 22 days ago [-]
Start with FineWebEdu
sabareesh 22 days ago [-]
Hey HN I am sharing my experience on how i pretrained my own LLM by building a ML rig at home
senectus1 22 days ago [-]
this is a decent birds eye view thanks, could you expand on this to show how long it took to produce... what model you produced? What did you produce? what did you train for.. the posts seems to suggest its for diffusion purposes?
On a tangent, if I wished to fine-tune one of those medium sized models like Gemma2 9B or Llama 3.2 Vision 11B, what kind of hardware would I need and how would I go about it?
I see a lot of guides but most focus on getting the toolchain up and running, and not much talk about what kind of dataset do I need to do a good fine tuning.
Any pointers appreciated.
pilooch 22 days ago [-]
I do this for many application. 2 to 4 RTXA5000 do the job (Lora finetune). As for dataset, depending on your task, you need image / text pairs.
magicalhippo 22 days ago [-]
> As for dataset, depending on your task, you need image / text pairs.
I guess the main question is, do you just prepare training data as if you were training from scratch, or is there some particularities to finetuning that should be considered?
MuffinFlavored 21 days ago [-]
What would you expect from fine tuning? What would the input training material be, and what would the expected differences in output be?
magicalhippo 21 days ago [-]
In several cases I've been wanting better prompt adherence.
Llama 3.2 Vision is very strictly trained to output a summary at the end which I find difficult to get it stop doing for example.
Another one is that when given a math problem and asked to generate some code that computes the result, most models outputs code fine but insists on doing calculations themselves even if the prompt explicitly say they shouldn't. As expected, sometimes these intermediate calculations are incorrect and hence I don't want the LLM to do that when the produced code would handle it perfectly. If the input prompt contains "four times five" I want the model to generate "4 * 5" rather than "20", consistently.
I've been curious to see if I could tune them to adhere better to the kind of prompts I would be giving.
For LLama 3.2 Vision I've also been curios if I can get it to focus on different details when asked to describe certain images. In many cases it is great but sometimes misses some key aspects.
As for the input training material, that's what I'm trying to figure out what I need. I feel a lot of the guides are like that "how to draw an owl" meme[1], leaving out some crucial aspects of the whole process. Obviously I need input prompts and expected answers, but how many, how much variation on each example, and do I need to include data it was already trained on to avoid overfitting or something like that? None of the guides I've found so far touch on these aspects.
nice writeup, but i feel that for most people, the software side of training models should be more interesting and accessible.
for one, "full" gpu utilization, one or many, remains an open topic in training workflows. spending efforts towards that, while renting from cloud, is a more accessible and fruitful to me than to finetune for marginal improvements.
this course was a nice source of inspiration - https://efficientml.ai/ - and i highly recommend looking into this to see what to do next with whatever hardware you have to work with.
KeplerBoy 22 days ago [-]
Let's talk riser cables. I keep encountering issues with riser connectors claiming to support PCIe 4.0, which seem to have sub-par performance. They work fine with the GPUs and NICs I tested them with, but attaching a nvme drive causes all kinds of issues and prevents the machine from booting. I guess nvme isn't as tolerant of elevated bit-error-rates.
That just doesn't inspire a lot of confidence in those risers, so now I'm contemplating mcio risers.
Neywiny 22 days ago [-]
NVMe sits over PCIe. I'd be more inclined to believe they're playing games with their voltage levels to lower power consumption on mobile/embedded (not based on anything but I wouldn't be surprised). Or, if you're then going to an m.2 adapter, something with that.
I'd love to read something you wrote, not something you had an AI model write for you.
abc-1 22 days ago [-]
Fun for a wealthy hobbyist, but if you want to do real work, you’re better off renting from Runpod. Good blog though.
sabareesh 22 days ago [-]
One of the motivation is to do several distillation, experimentation, research. But as you mentioned there are better ways to do this
bb88 22 days ago [-]
All you need is a 4x 4090 GPUs and a dedicated 30 amp circuit.
andrewmcwatters 22 days ago [-]
Why are people downvoting this? Yes, you really do need a dedicated circuit to run this type of machine. You will trip your circuit breaker if you don't have sufficient wattage on the line to run something rated for this power draw.
Commercial setups are not appropriate for typical 15 amp circuit loads.
andrewmcwatters 22 days ago [-]
Further, If you can afford to build this, you can afford to purchase at least the Romex, an AFCI circuit breaker, raceway, and run it into whatever room in the house you plan on operating this in.
fzzzy 22 days ago [-]
You sure? In my experiments with multi gpu inference, I couldn't get anywhere close to max theoretical power draw.
bb88 22 days ago [-]
Yes!
His power supplies are 2x1500 Watt. That puts it at 3KW max which is more than a 20A circuit can provide (2400W).
The standard outlet is typically rated at 15 amps or 1800W. And the 15A breaker is on one circuit. You can get 20A circuits but they need to be wired for it, and replacing the breaker won't cut it.
Assuming his GPU is ~450W (his number) and power supplies are 80% efficient, well that means he's pulling close to ~2400 watts which is super close to the limit of a 20A circuit.
4 * 450 / 0.80 efficiency = 2250W.
That doesn't include the power consumed by the CPU or mother board or other things on that circuit. But a 170W CPU would easily push this over 2400W provided by a 20A circuit.
leoc 22 days ago [-]
In the US. The UK or EU will do you 3000W out of a standard domestic socket.
taneq 22 days ago [-]
2.4kW yeah, 3kW technically needs a 15A socket.
bb88 22 days ago [-]
In the UK they get that by doubling the voltage. The current draw will still be similar to the US.
It's over current that causes fires.
bpye 22 days ago [-]
The _power_ draw will be similar, but a 13A 230V outlet can do 2990W, vs a 15A 110V outlet at 1650W.
bb88 22 days ago [-]
You proved my argument! Lol.
sabareesh 22 days ago [-]
Well during training all GPUs were consuming max ~ 450W
fzzzy 22 days ago [-]
Thanks, good to know. Perhaps it is different for diffusion; with llms, layers are generally split across gpus, meaning inference has to happen on one gpu before the values can be passed between the layer split.
Y_Y 22 days ago [-]
That's only if your model is too big for a single GPU and you're not batching.
fzzzy 22 days ago [-]
Yes, that's what I was doing. Thanks for the info.
22 days ago [-]
halyconWays 22 days ago [-]
Why not 3090s? Same VRAM and cheaper. With both setups you'd be limited to 1B. By contrast, you can run 4-bit quants of Llama 70B on two {3,4}090s, and it's still pretty lobotomized by modern standards.
You can also train your own model even without GPUs. Just depends on parameter size.
sabareesh 22 days ago [-]
It is previous architecture and it doesnt support newer version of Flash Attention , fp8 training etc
jszymborski 22 days ago [-]
It is, however, like 3x cheaper.
halyconWays 21 days ago [-]
That's fair. I did run into that issue when trying to speed up Hunyuan
anonytrary 22 days ago [-]
Thanks for sharing. Have you prodded the model with various inputs and written an article that show various output examples? I'd love to get an idea of what sort of "end product" 4x4090s is capable of producing.
sabareesh 22 days ago [-]
You might find more information here helpful https://sabareesh.com/posts/llm-intro/
But i am still in process of evaluating post training process with RL. RLHF is almost a mirage that shows what is possible but not the full capability of what model can do
NKosmatos 22 days ago [-]
Wouldn’t a cluster of M4 minis cost less and provide more VRAM? There are posts about people getting decent performance for a lot less than 12k USD.
lostmsu 22 days ago [-]
If you want to wait for over a year to get your model trained (vs 7 days).
angoragoats 22 days ago [-]
If you are willing and able to put together the type of system described in the OP (a workstation-class PC, with multiple discrete GPUs and often multiple power supplies), a Mac never makes sense. There are hardware options available at essentially every price point that beat (in some cases drastically) the performance and memory capacity of a Mac.
And I say this at the risk of being called pedantic, but a cluster of Mac minis would have zero VRAM.
sabareesh 22 days ago [-]
You get more vram but not enough cores
whimsicalism 22 days ago [-]
no, these chips are optimized for inference not training & frankly cuda is still table stakes.
HN loves it some Apple
lostmsu 22 days ago [-]
They are not optimized for inference vs RTX GPUs.
jmward01 22 days ago [-]
You can get 4060 ti 16GB cards for ~$450 or 4070 ti 16gb for ~850 instead of the $2.5k for a 4090. I wonder how well 4 of those cards would perform. The 4060 TDP is 165w instead of 450w for the 4090. The 4070 looks like the best tradeoff though for cost/power/etc though. You could probably set up an 8 card 4070 ti 16gb system for less than the 4 card 4090 system
magicalhippo 22 days ago [-]
The 4060 Ti is hampered by having a narrow memory bus, there's various benchmarks out there, here[1][2] are some examples, and here's[3] one which tests dual 4060 Ti's.
I’ve heard that people buy multiple 24GB P40’s for a bucket of dirt. But that was for inference, not sure about training.
g Tesla p40 llm reddit
sabareesh 22 days ago [-]
I was eyeing 4060 before going with 4090. But it boils down to cuda cores and memory bandwidth
jmward01 22 days ago [-]
The 4090 computer per watt is the best (on paper) between the 4060 ti, 4070 ti and 4090. Best bang for $$ though looks like the 4070ti 16GB. I've been eying that one for a new dual card training rig.
AnarchismIsCool 22 days ago [-]
Couldn't you do better with 2x AGX Orin 64gb?
jsheard 22 days ago [-]
It's probably better to hold out for the 5090 at this point, it's coming very soon as is expected to have 32GB of VRAM.
paxys 22 days ago [-]
Coming soon maybe, but when will you actually be able to get your hands on one?
sabareesh 22 days ago [-]
Yeah depends on the price, definitely 24GB is limiting
nitred 19 days ago [-]
Can someone definitively say for sure that I can just use two independent PSUs? One for GPUs and one for GPUs and motherboard and SATA? No additional hardware?
Bancakes 22 days ago [-]
Anyone care to publish AMD training/inference benchmarks using ROCm? They’re hard to find.
sabareesh 22 days ago [-]
At this point it is still not worth considering AMD but may me this will change soon. I would look into semianalysis report
mcdeltat 21 days ago [-]
Is anyone else concerned with the power usage of recent AI? Computational efficiency doesn't seem to be a strong point... And for what benefit? IMO the usefulness payoff is too low
JacksonDam 22 days ago [-]
Interesting that DLSS 3 is mentioned as an advantage?
Retr0id 22 days ago [-]
Because the article was clearly co-authored by AI
sabareesh 22 days ago [-]
It is co-authored by AI but I left it because it made some indirect sense. I clarified on the parent comment
sabareesh 22 days ago [-]
I clarified bit more on the article regarding this. But basically "Well this may not directly provide benefit but because this is a consumer grade card these features enabled having support for more advanced features such as bfloat16 and event float8 training support also the sheer number of cuda cores."
486sx33 22 days ago [-]
I’d love to hear the dev story of H100 , it seemed to come out of left field !
paxys 22 days ago [-]
Where exactly do you plug in this beast?
m463 21 days ago [-]
"This needs 30 AMP circuit..." lol
master_crab 22 days ago [-]
All you need is 4x 4090 GPUs to Train Your Own Model -- and $12000 to buy them
kristopolous 22 days ago [-]
The GPU rental market is fairly reasonable. There's lots of companies doing it. (I work at one of them). 4x 4090 can be fetched for around $0.40/hour on some platforms ... about $1.20 on others depending on how available you want it. Regardless, all in, you can do an average 10-or-so-day train for < $500.
If you want on-prem, wait a few months. The supply of 5000 series (probably announced at CES in a few days) should push more 4000 on the market and, maybe, for a bit, over-supply and push the price down.
Nvidia stopped manufacturing the 4000 a few months ago because they don't have endless factories. Those resources were reallocated to 5000 series and thus pushed the price for the 4000 up to the ridiculous place it is now (about $2,000 on ebay)
I think the current appetite for crypto and ai is big enough to consume all 4000 and 5000 series cards to a point of scarcity (even 3090s are still fetching about $1000) but there should be a window where things aren't crazy expensive coming up.
There's no evidence supply will continually outstrip demand unless something unusual happens.
22 days ago [-]
whimsicalism 22 days ago [-]
don't you need nvlink? feel like an 80gb a100 would start being worth it at a $1.20/4x 4090 price point
kristopolous 22 days ago [-]
Some suppliers have support for it, some don't. They either use docker or kvm and it depends on how clever their hosting software is. We can do it, but that's a recent thing. it's really hit or miss
whimsicalism 21 days ago [-]
? sorry i really don't understand this reply... some suppliers have support for nvlink on 4090? i doubt that
yieldcrv 22 days ago [-]
How soon could I break even on renting my GPUs out?
We aim at $1200/y for 3090, so around a year given descent electricity prices.
Highly recommend setting a lower power limit (usually 250W for 3090).
kristopolous 21 days ago [-]
Btw, for other people reading this, the main player in the "rentable gamer gpu" space is salad.com who 6 months ago cut a deal with civitai (https://blog.salad.com/civitai-salad/). They're trying to capture enterprise customers to use the extra cycles on teenager's gaming rigs.
The industry is full of effectively "imitation companies" right now. For instance, runpod, quickpod, simplepod and clore are the ones cloning us at vast right now.
We see them in our discord, they try to snipe away customers, get in our comment threads on reddit and twitter with self-promotes, clone our features ... this is the ferocious wild west days of this industry. I've even gotten personal emails from a few who I guess scanned their database looking for registration addresses from other companies in the space.
There's even companies like primeintellect which are trying to become the market of markets - but they have their own program - it's clearly a play to snipe other customers by funneling them through some interface where they'll eventually push out the other companies and promote their own instances.
Then there's interesting insider hype players with their own infra like sfcompute who are trying to pretend like they invented interruptible instances and somehow get a bunch of people treating them like they're innovators. The resellable contracts they talk about are a pretty common feature and especially from the host's programmatic command line controller, it's just usually tucked deep in the documentation. They're doing effectively a re-prioritization play.
I guess my angle is "highest integrity possible". It's certainly a gamble - scammy companies sometimes capture a market then become unscammy - I'll hold my tongue but there's plenty of examples.
It's interesting times.
lostmsu 18 days ago [-]
Wow, I question the ethical side of this comment. It starts praising a company as if it were an unrelated entity, then quietly switches to "us", then makes implications about competing enterpreneural efforts being scams without any evidence. And "clones" (as if everyone knew about them - I didn't until about 1y into mine for instance).
There's also the hypocrisy of complaining about competitors jumping in on "their threads" in a comment on a competitor thread.
Yes, this comment of yours is highly unethical.
yieldcrv 20 days ago [-]
yo, I don’t care bro
I guess what I’m missing is, what’s scammy about them?
even in the web3 space, AI gpu compute markets are oversaturated
but why is an end user supposed to case about the user acquisition strategy?
if they’re cheaper, more profitable for the gpu owner, or solving a need better, that’s all that matters
kristopolous 20 days ago [-]
> what’s scammy about them?
You can multi sell a machine, use qemu to lie about the hardware, have hidden fees... there's a bunch of hustle
> AI gpu compute markets are oversaturated
This is not the case. We see a moving average of over 90% utilization of our network. There's a lot of players, but the demand is outstripping supply
> why is an end user supposed to case about the user acquisition strategy?
Well hn is founder/insider talk but for a more direct answer, more legit institutions get higher retention and easier customers.
We're a two sided marketplace so we need to create a platform where people see integrity.
kristopolous 22 days ago [-]
is your electricity free? Some of these cards probably cost about $0.10/hr to run ... depending on your card/electricity rate etc.
It's probably somewhere between 12months-never depending on how the market shakes out. Maybe 2 years is a good idea ... really, if power is cheap/free and the machine is on and idle then it's free money - that's the way to look at it.
yieldcrv 22 days ago [-]
My electricity is not free, I would be satisfied with partially subsidizing these units too though
kristopolous 22 days ago [-]
Well ok, I guess I'll plug my employer's site for setting up:
There's a lot of competition in the "airbnb gpu" so if you don't like us, the number is around 12 or so globally. We're probably either #2 or #3. Companies don't really disclose these things so it's hard to know.
Some people probably list on more than one platform. There may be some host management software somewhere that helps with that. I haven't actually checked.
I'd be happy to talk more about these privately. Some are better than others and I've got no interest posting less than charitable things about our competitors publicly, regardless of how accurate I think it is. My email is in my profile.
echelon 22 days ago [-]
You can get a used A100 for that cost and have better software support for training.
4090s are too small for training and you'll have to write your own suboptimal batching.
Unless you value the learning, it'd be better to rent GPUs in the cloud for training.
sabareesh 22 days ago [-]
Yup my initial reason behind is to learn all the quirks
echelon 22 days ago [-]
Consumer cards are a very different ecosystem, and you'll hit different use cases and challenges.
This might pull you down a path towards distilling and quantizing models, for instance.
sabareesh 22 days ago [-]
I was contemplating between building rig vs using the cloud but for some reason I want to get hands on. So you can always rent them for a fraction of cost
bfung 22 days ago [-]
Also (at least in Southern California) electricity prices and how long the rig is on. Not as bad as the initial build cost, but run costs will add up over time.
sabareesh 22 days ago [-]
That is real concern especially 4090 is not power efficient , as a100 and h100, h200. I live in Reno so it was ok
KeplerBoy 22 days ago [-]
You can always reduce the clock and voltage to hit better Flops/Joule.
yieldcrv 22 days ago [-]
Thats way less than the 6 or 7 figure sums from a year ago
I’m glad to know
andrewmcwatters 22 days ago [-]
The last time I checked, a modern Threadripper build is a bit over $10,000. So if you have the budget for that but need something GPU-oriented instead, then I could see that being a reasonable option.
KeplerBoy 22 days ago [-]
The thing is you need a threadripper-class build to make use of 4 GPUs in the first place. Ordinary PCs don't have the PCIe lanes necessary for that.
But pricing is okay-ish, have a look at Geohot's Tinybox for turnkey solutions.
andrewmcwatters 22 days ago [-]
Ah, of course. I forgot about PCI-e lane requirements. Yeah, you're not going to casually find 8-slot (12?) PCIe x16 motherboard configurations.
Dylan16807 22 days ago [-]
How much PCIe bandwidth do you need to avoid it being the bottleneck?
KeplerBoy 21 days ago [-]
Depends on the Application. In Bitcoin farming it famously was not an issue at all, manufacturers came up with the weirdest motherboards featuring many x1 pcie slots. Look up the Biostar TB360-BTC PRO 2.0 if you want to see a curiosity.
In Deep Learning it depends on your sharding strategy.
The best build I have seen so far had 6x4090's. Video: https://www.youtube.com/watch?v=C548PLVwjHA
An interesting choice to go with 256GB of DDR5 ECC; if spending so much on the 6x4090's, might as well try to hit 1 TB of RAM as well.The cost of this... not even sure. Astronomical.
Edit: could also just be more-money-than-sense. Never discount stupidity.
The 4090 will likely maintain 50% of its current value due to its memory capacity over the next 12-18 months.
CapEx vs OpEx is a thing even if you are not a business…
The last paragraphs fell totally like AI.
Anyway I'd like a follow up on the curating, cleaning and training part which is far more interesting than how to select hardware which we've been doing for over 25 years.
This jumps right out as written by AI, these features have nothing to do with training LLMs.
Nowadays that's not a lot: a single H100 that you can now rent has 80 GB VRAM, and doesn't have the technical overhead of handling work across GPUs.
Overall, it's something I've seen very often on social media and less technical articles about LLMs. OpenAI would fall into the "almost" category.
I wasn't nitpicking. It is a HUGE differentiation, and I pointed it out specifically because people pick up on terminology so people who might not know better will go forward and just drop in the more super duper hyperparameter, not realizing that it makes them look like they don't know what they're talking about. As I said in the other post, no one who knows anything uses them interchangeably. It is just completely wrong.
I appreciate being corrected, but you are the one who asked for my opinion based on my extensive time in AI, you can choose to believe it or not.
It's a rabbit hole I stay away from for pragmatic reasons.
I see a lot of guides but most focus on getting the toolchain up and running, and not much talk about what kind of dataset do I need to do a good fine tuning.
Any pointers appreciated.
I guess the main question is, do you just prepare training data as if you were training from scratch, or is there some particularities to finetuning that should be considered?
Llama 3.2 Vision is very strictly trained to output a summary at the end which I find difficult to get it stop doing for example.
Another one is that when given a math problem and asked to generate some code that computes the result, most models outputs code fine but insists on doing calculations themselves even if the prompt explicitly say they shouldn't. As expected, sometimes these intermediate calculations are incorrect and hence I don't want the LLM to do that when the produced code would handle it perfectly. If the input prompt contains "four times five" I want the model to generate "4 * 5" rather than "20", consistently.
I've been curious to see if I could tune them to adhere better to the kind of prompts I would be giving.
For LLama 3.2 Vision I've also been curios if I can get it to focus on different details when asked to describe certain images. In many cases it is great but sometimes misses some key aspects.
As for the input training material, that's what I'm trying to figure out what I need. I feel a lot of the guides are like that "how to draw an owl" meme[1], leaving out some crucial aspects of the whole process. Obviously I need input prompts and expected answers, but how many, how much variation on each example, and do I need to include data it was already trained on to avoid overfitting or something like that? None of the guides I've found so far touch on these aspects.
[1]: https://knowyourmeme.com/memes/how-to-draw-an-owl
for one, "full" gpu utilization, one or many, remains an open topic in training workflows. spending efforts towards that, while renting from cloud, is a more accessible and fruitful to me than to finetune for marginal improvements.
this course was a nice source of inspiration - https://efficientml.ai/ - and i highly recommend looking into this to see what to do next with whatever hardware you have to work with.
That just doesn't inspire a lot of confidence in those risers, so now I'm contemplating mcio risers.
Commercial setups are not appropriate for typical 15 amp circuit loads.
His power supplies are 2x1500 Watt. That puts it at 3KW max which is more than a 20A circuit can provide (2400W).
The standard outlet is typically rated at 15 amps or 1800W. And the 15A breaker is on one circuit. You can get 20A circuits but they need to be wired for it, and replacing the breaker won't cut it.
Assuming his GPU is ~450W (his number) and power supplies are 80% efficient, well that means he's pulling close to ~2400 watts which is super close to the limit of a 20A circuit.
4 * 450 / 0.80 efficiency = 2250W.
That doesn't include the power consumed by the CPU or mother board or other things on that circuit. But a 170W CPU would easily push this over 2400W provided by a 20A circuit.
It's over current that causes fires.
You can also train your own model even without GPUs. Just depends on parameter size.
And I say this at the risk of being called pedantic, but a cluster of Mac minis would have zero VRAM.
HN loves it some Apple
[1]: https://www.pugetsystems.com/labs/articles/llm-inference-con... (8GB model tested but it has same bus width and overall bandwidth as 16GB model)
[2]: https://www.reddit.com/r/LocalLLaMA/comments/1b5uwr4/some_gr...
[3]: https://www.reddit.com/r/LocalLLaMA/comments/178gkr0/perform...
g Tesla p40 llm reddit
If you want on-prem, wait a few months. The supply of 5000 series (probably announced at CES in a few days) should push more 4000 on the market and, maybe, for a bit, over-supply and push the price down.
Nvidia stopped manufacturing the 4000 a few months ago because they don't have endless factories. Those resources were reallocated to 5000 series and thus pushed the price for the 4000 up to the ridiculous place it is now (about $2,000 on ebay)
I think the current appetite for crypto and ai is big enough to consume all 4000 and 5000 series cards to a point of scarcity (even 3090s are still fetching about $1000) but there should be a window where things aren't crazy expensive coming up.
There's no evidence supply will continually outstrip demand unless something unusual happens.
We aim at $1200/y for 3090, so around a year given descent electricity prices.
Highly recommend setting a lower power limit (usually 250W for 3090).
The industry is full of effectively "imitation companies" right now. For instance, runpod, quickpod, simplepod and clore are the ones cloning us at vast right now.
We see them in our discord, they try to snipe away customers, get in our comment threads on reddit and twitter with self-promotes, clone our features ... this is the ferocious wild west days of this industry. I've even gotten personal emails from a few who I guess scanned their database looking for registration addresses from other companies in the space.
There's even companies like primeintellect which are trying to become the market of markets - but they have their own program - it's clearly a play to snipe other customers by funneling them through some interface where they'll eventually push out the other companies and promote their own instances.
Then there's interesting insider hype players with their own infra like sfcompute who are trying to pretend like they invented interruptible instances and somehow get a bunch of people treating them like they're innovators. The resellable contracts they talk about are a pretty common feature and especially from the host's programmatic command line controller, it's just usually tucked deep in the documentation. They're doing effectively a re-prioritization play.
I guess my angle is "highest integrity possible". It's certainly a gamble - scammy companies sometimes capture a market then become unscammy - I'll hold my tongue but there's plenty of examples.
It's interesting times.
There's also the hypocrisy of complaining about competitors jumping in on "their threads" in a comment on a competitor thread.
Yes, this comment of yours is highly unethical.
I guess what I’m missing is, what’s scammy about them?
even in the web3 space, AI gpu compute markets are oversaturated
but why is an end user supposed to case about the user acquisition strategy?
if they’re cheaper, more profitable for the gpu owner, or solving a need better, that’s all that matters
You can multi sell a machine, use qemu to lie about the hardware, have hidden fees... there's a bunch of hustle
> AI gpu compute markets are oversaturated
This is not the case. We see a moving average of over 90% utilization of our network. There's a lot of players, but the demand is outstripping supply
> why is an end user supposed to case about the user acquisition strategy?
Well hn is founder/insider talk but for a more direct answer, more legit institutions get higher retention and easier customers.
We're a two sided marketplace so we need to create a platform where people see integrity.
It's probably somewhere between 12months-never depending on how the market shakes out. Maybe 2 years is a good idea ... really, if power is cheap/free and the machine is on and idle then it's free money - that's the way to look at it.
https://cloud.vast.ai/host/setup
There's a lot of competition in the "airbnb gpu" so if you don't like us, the number is around 12 or so globally. We're probably either #2 or #3. Companies don't really disclose these things so it's hard to know.
Some people probably list on more than one platform. There may be some host management software somewhere that helps with that. I haven't actually checked.
I'd be happy to talk more about these privately. Some are better than others and I've got no interest posting less than charitable things about our competitors publicly, regardless of how accurate I think it is. My email is in my profile.
4090s are too small for training and you'll have to write your own suboptimal batching.
Unless you value the learning, it'd be better to rent GPUs in the cloud for training.
This might pull you down a path towards distilling and quantizing models, for instance.
I’m glad to know
But pricing is okay-ish, have a look at Geohot's Tinybox for turnkey solutions.
In Deep Learning it depends on your sharding strategy.