The problem is that performance achievements on AMD consumer-grade GPUs (RX7900XTX) are not representative/transferrable to the Datacenter grade GPUs (MI300X). Consumer GPUs are based on RDNA architecture, while datacenter GPUs are based on the CDNA architecture, and only sometime in ~2026 AMD is expected to release unifying UDNA architecture [1]. At CentML we are currently working on integrating AMD CDNA and HIP support into our Hidet deep learning compiler [2], which will also power inference workloads for all Nvidia GPUs, AMD GPUs, Google TPU and AWS Inf2 chips on our platform [3]
The problem is that the specs of AMD consumer-grade GPUs do not translate to computer performance when you try and chain more than one together.
I have 7 NVidia 4090s under my desk happily chugging along on week long training runs. I once managed to get a Radeon VII to run for six hours without shitting itself.
mpreda 27 days ago [-]
> I have 7 NVidia 4090s under my desk
I have 6 Radeon Pro VII under my desk (in a single system BTW), and they run hard for weeks until I choose to reboot e.g. for Linux kernel updates.
I bought them "new old stock" for $300 apiece. So that's $1800 for all six.
highwaylights 27 days ago [-]
How does the compute performance compare to 4090’s for these workloads?
(I release it will be significantly lower, just try to get as much of a comparison as is possible).
crest 27 days ago [-]
The Radeon VII is special compared to most older (and current) affordable GPUs in that it used HBM giving it memory bandwidth comparable to modern cards ~1TB/s and has reasonable FP64 (1:4) throughput instead of (1:64). So this card can still be pretty interesting for running memory bandwidth intensive FP64 workloads. Anything affordable afterward by either AMD or Nvidia crippled realistic FP64 throughput to below what a AVX-512 many-core CPU can do.
nine_k 27 days ago [-]
If we speak about FP64, are your loads more like fluid dynamics than ML training?
26 days ago [-]
cainxinth 27 days ago [-]
The 4090 offers 82.58 teraflops of single-precision performance compared to the Radeon Pro VII's 13.06 teraflops.
adrian_b 27 days ago [-]
On the other hand, for double precision a Radeon Pro VII is many times faster than a RTX 4090 (due to 1:2 vs. 1:64 FP64:FP32 ratio).
Moreover, for workloads limited by the memory bandwidth, a Radeon Pro VII and a RTX 4090 will have about the same speed, regardless what kind of computations are performed. It is said that speed limitation by memory bandwidth happens frequently for ML/AI inferencing.
ryao 26 days ago [-]
Double precision is not used in either inference or training as far as I know.
adrian_b 26 days ago [-]
Even the single precision given by the previous poster is seldom used for inference or training.
Because the previous poster had mentioned only single precision, where RTX 4090 is better, I had to complete the data with double precision, where RTX 4090 is worse, and memory bandwidth where RTX 4090 is the same, otherwise people may believe that progress in GPUs over 5 years has been much greater than it really is.
Moreover, memory bandwidth is very relevant for inference, much more relevant than FP32 throughput.
llm_trw 24 days ago [-]
For people wondering:
Titan V: 7.8 TFLOPs
AMD Radeon Pro VII: 6.5 TFLOPs
AMD Radeon VII: 3.52 TFLOPs
4090: 1.3 TFLOPs
llm_trw 27 days ago [-]
For inference sure, for training: no.
varelse 27 days ago [-]
[dead]
llm_trw 27 days ago [-]
Are you running ml workloads or solving differential equations?
The two are rather different and one market is worth trillions, the other isn't.
comboy 27 days ago [-]
I think there is some money to be made in machine learning too.
tspng 27 days ago [-]
Wow, are these 7 RTX 4090s in a single setup? Care to share more how you build it (case, cooling, power, ..)?
You might find the journey of Tinycorp's Tinybox interesting, it's a machine with 6 to 8 4090 GPUs and you should be able to track down a lot of their hardware choices including pictures on their Twitter and other info on George his livestreams.
icelancer 26 days ago [-]
EPYC + Supermicro + C-Payne retimers/cabling. 208-240V power typically mandatory for the most affordable power supplies (chain a server/crypto PSU for the GPUs from ParallelMiner to an ATX PSU for general use).
There's a bunch of similar setups and there are a couple of dozen people that have done something similar on /r/localllama.
adakbar 27 days ago [-]
I'd like to know too
archi42 26 days ago [-]
How do you manage heat? I'm looking at a hashcat build with a few 5090, and water cooling seems to be the sensible solution if we scale beyond two cards.
ThinkBeat 26 days ago [-]
What motherboard are you using to have space and ports for 7 of them?
slavik81 26 days ago [-]
The ASRock Rack ROMED8-2T has seven PCIe x16 slots. They're too close together to directly put seven 4090s on the board, but you'd just need some riser cables to mount the cards on a frame.
majke 26 days ago [-]
What software stack you use for training?
rcdwealth 26 days ago [-]
[dead]
zozbot234 27 days ago [-]
It looks like AMD's CDNA gpu's are supported by Mesa, which ought to suffice for Vulkan Compute and SYCL support. So there should be ways to run ML workloads on the hardware without going through HIP/ROCm.
shihab 27 days ago [-]
I have come across quite few startups who are trying a similar idea: break the nvidia monopoly by utilizing AMD GPUs (for inference at least): Felafax, Lamini, tensorwave (partially), SlashML. Even saw optimistic claims like CUDA moat is only 18 months deep from some of them [1]. Let's see.
AMD GPUs are becoming a serious contender for LLM inference. vLLM is already showing impressive performance on AMD [1], even with consumer-grade Radeon cards (even support GGUF) [2]. This could be a game-changer for folks who want to run LLMs without shelling out for expensive NVIDIA hardware.
That is GH200 and it is likely due to an amd64 dependency in vLLM.
adrian_b 27 days ago [-]
That seems like a CPU problem, not a GPU problem (due to Aarch64 replacing x86-64).
treprinum 27 days ago [-]
AMD decided not to release a high-end GPU this cycle so any investment into 7x00 or 6x00 is going to be wasted as Nvidia 5x00 is likely going to destroy any ROI from the older cards and AMD won't have an answer for at least two years, possibly never due to being non-existing in high-end consumer GPUs usable for compute.
BearOso 26 days ago [-]
No high-end consumer RDNA4 GPU this cycle. And it's only missing the very high-end model. So we'll still get at least a 7800xt equivalent and whatever CDNA MI models they come out with.
The market for the extreme high-end consumer is pretty small, so they're only missing out on clout.
treprinum 26 days ago [-]
The top-end RDNA4 GPU will have 16GB RAM. That's a massive regression compared to 7900XTX and performance-wise it should be at best at the 7900XTX level. We are discussing AMD cards for LLM inference where VRAM is arguably the most important aspect of a GPU and AMD just threw in the towel for this cycle.
latchkey 26 days ago [-]
These blog posts were written based on my company, Hot Aisle, donating the compute. =) Super proud of being able to support this.
ryukoposting 27 days ago [-]
Peculiar business model, at a glance. It seems like they're doing work that AMD ought to be doing, and is probably doing behind the scenes. Who is the customer for a third-party GPU driver shim?
dpkirchner 27 days ago [-]
Could be trying to make themselves a target for a big acquihire.
to11mtm 27 days ago [-]
Cynical take: Try to get acquired by Intel for Arc.
dogma1138 27 days ago [-]
Intel is in a vastly better shape than AMD, they have the software pretty much nailed down.
lhl 27 days ago [-]
I've recently been poking around with Intel oneAPI and IPEX-LLM. While there are things that I find refreshing (like their ability to actually respond to bug reports in a timely manner, or at all) on a whole, support/maturity actually doesn't match the current state of ROCm.
PyTorch requires it's own support kit separate from the oneAPI Toolkit (and runs slightly different versions of everything), the vLLM xpu support doesn't work - both source and the docker failed to build/run for me. The IPEX-LLM whisper support is completely borked, etc, etc.
moffkalast 27 days ago [-]
I've recently been trying to get IPEX working as well, apparently picking Ubuntu 24.04 was a mistake, because while things compile, everything fails at runtime. I've tried native, docker, different oneAPI versions, threw away a solid week of afternoons for nothing.
SYCL with llama.cpp is great though, at least at FP16 since it supports nothing else but even Arc iGPUs easily give 2-4x performance compared to CPU inference.
Intel should've just contributed to SYCL instead of trying to make their own thing and then forgot to keep maintaining it halfway through.
lhl 27 days ago [-]
My testing has been w/ a Lunar Lake Core 258V chip (Xe2 - Arc 140V) on Arch Linux. It sounds like you've tried a lot of things already, but case it helps, my notes for installing llama.cpp and PyTorch: https://llm-tracker.info/howto/Intel-GPUs
I have some benchmarks as well, and the IPEX-LLM backend performed a fair bit better than the SYCL llama.cpp backend for me (almost +50% pp512 and almost 2X tg128) so worth getting it working if you plan on using llama.cpp much on an Intel system. SYCL still performs significantly better than Vulkan and CPU backends, though.
As an end-user, I agree that it'd be way better if they could just contribute upstream somehow (whether to the SYCL backend, or if not possible, to a dependency-minized IPEX backend). the IPEX backend is one of the more maintained parts of IPEX-LLM, btw. I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."
moffkalast 27 days ago [-]
Well that's funny, I think we already spoke on Reddit. I'm the guy who was testing the 125H recently. I guess there's like 5 of us who have intel hardware in total and we keep running into each other :P
Honestly I think there's just something seriously broken with the way IPEX expects the GPU driver to be on 24.04 and there's nothing I can really do about it except wait for them to fix it if I want to keep using this OS.
I am vaguely considering adding another drive and installing 22.04 or 20.04 with the exact kernel they want to see if that might finally work in the meantime, but honestly I'm fairly satisfied with the speed I get from SYCL already. The problem is more that it's annoying to integrate it directly through the server endpoint, every projects expects a damn ollama api or llama-cpp-python these days and I'm a fan of neither since it's just another layer of headaches to get those compiled with SYCL.
> I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."
Yeah well the fact that oneAPI 2025 got released, broke IPEX, and they still haven't figured out a way to patch it for months makes me think it's total chaos internally, where teams work against each other instead of talking and coordinating.
0xDEADFED5 26 days ago [-]
Fwiw on 22.04 i can use current kernel but otherwise follow Intel's instructions and the stuff works (old as it is now). I'm currently trying to figure out the best way to finetune Qwen 2.5 3B, the old axolotl ain't up to it. Not sure if I'm gonna work on a fork of axolotl or try something else at this point.
0xDEADFED5 26 days ago [-]
Big agree on Intel working on SYCL. I've millions of tasks thru SYCL llama.cpp at this point, and though SYCL reliably does 5-6x the prompt processing speed of the Vulkan builds, current Vulkan builds are now up to 50% faster at token generation than SYCL on my Intel GPU
indolering 27 days ago [-]
Tell that to the board.
bboygravity 27 days ago [-]
Someone never used intel killer wifi software.
dangero 27 days ago [-]
More cynical take: Trying to get acquired by nvidia
dizhn 27 days ago [-]
Person below says they (the whole team) already joined Nvidia.
shiroiushi 27 days ago [-]
More cynical take: this would be a bad strategy, because Intel hasn't shown much competence in its leadership for a long time, especially in regards to GPUs.
rockskon 27 days ago [-]
They've actually been making positive moves with GPUs lately along with a success story for the B580.
kimixa 27 days ago [-]
B580 being a "success" is purely a business decision as a loss leader to get their name into the market. A larger die on a newer node than either Nvidia or AMD means their per-unit costs are higher, and are selling it at a lower price.
That's not a long-term success strategy. Maybe good for getting your name in the conversation, but not sustainable.
bitmasher9 27 days ago [-]
It’s a long term strategy to release a hardware platform with minimal margins in the beginning to attract software support needed for long term viability.
One of the benefits of being Intel.
rockskon 25 days ago [-]
Well yes, it's in the name "loss leader". It's not meant to be sustainable. It's meant to get their name out there as a good alternative to Radeon cards for the lower-end GPU market.
Profit can come after positive brand recognition for the product.
jvanderbot 27 days ago [-]
I was reading this whole thread as about technical accomplishment and non-nvidia GPU capabilities, not business. So I think you're talking about different definitions of "Success". Definitely counts, but not what I was reading.
7speter 27 days ago [-]
I don’t know if this matters but while the B580 has a die comparable in size to a 4070 (~280mm^2), it has about half the transistors (~17-18 billion), iirc.
ryao 26 days ago [-]
Tom Petersen said in a hardware unboxed video that they only reported “active” transistors, such that there are more transistors in the B580 than what they reported. I do not think this is the correct way to report them since one, TSMC counts all transistors when reporting the density of their process and two, Intel is unlikely to reduce the reported transistor count for the B570, which will certainly have fewer active transistors.
That said, the 4070 die is 294mm^2 while the B580 die is 272mm^2.
ryao 27 days ago [-]
Is it a loss leader? I looked up the price of 16Gbit GDDR6 ICs the other day at dramexchange and the cost of 12GB is $48. Using the gamer nexus die measurements, we can calculate that they get at least 214 dies per wafer. At $12095 per wafer, which is reportedly the price at TSMC for 5nm wafers in 2025, that is $57 per die.
While defects ordinarily reduce yields, Intel put plenty of redundant transistors into the silicon. This is ordinarily not possible to estimate, but Tom Petersen reported in his interview with hardware unboxed that they did not count those when reporting the transistor count. Given that the density based on reported transistors is about 40% less than the density others get from the same process and the silicon in GPUs is already fairly redundant, they likely have a backup component for just about everything on the die. The consequence is that they should be able to use at least 99% of those dies even after tossing unusable dies, such that the $57 per die figure is likely correct.
As for the rest of the card, there is not much in it that would not be part of the price of an $80 Asrock motherboard. The main thing would be the bundled game, which they likely can get in bulk at around $5 per copy. This seems reasonable given how much Epic games pays for their giveaways:
That brings the total cost to $190. If we assume Asrock and the retailer both have a 10% margin on the $80 motherboard used as a substitute for the costs of the rest of the things, then it is $174. Then we need to add margins for board partners and the retailers. If we assume they both get 10% of the $250, then that leaves a $26 profit for Intel, provided that they have economics of scale such that the $80 motherboard approximation for the rest of the cost of the graphics card is accurate.
That is about a 10% margin for Intel. That is not a huge margin, but provided enough sales volume (to match the sales volume Asrock gets on their $80 motherboards), Intel should turn a profit on these versus not selling these at all. Interestingly, their board partners are not able/willing to hit the $250 MSRP and the closest they come to it is $260 so Intel is likely not sharing very much with them.
It should be noted that Tom Petersen claimed during his hardware unboxed interview that they were not making money on these. However, that predated the B580 being a hit and likely relied on expected low production volumes due to low sales projections. Since the B580 is a hit and napkin math says it is profitable as long as they build enough of them, I imagine that they are ramping production to meet demand and reach profitability.
SixtyHurtz 26 days ago [-]
That's just BOM. When you factor in R&D they are clearly still losing money on B580. There's no way they can recoup R&D this generation with a 10% gross margin.
Still, that's to be expected considering this is still only the second generation of Arc. If they can break even on the next gen, that would be an accomplishment.
ryao 26 days ago [-]
To be fair, the R&D is shared with Intel’s integrated graphics as they use the same IP blocks, so they really only need to recoup the R&D that was needed to turn that into a discrete GPU. If that was $50 million and they sell 2 million of these, they would probably recoup it. Even if they fail to recoup their R&D funds, they would be losing more money by not selling these at all, since no sales means 0 dollars of R&D would be recouped.
While this is not an ideal situation, it is a decent foundation on which to build the next generation, which should be able to improve profitability.
schmidtleonard 27 days ago [-]
Yeah but MLID says they are losing money on every one and have been winding down the internal development resources. That doesn't bode well for the future.
I want to believe he's wrong, but on the parts of his show where I am in a position to verify, he generally checks out. Whatever the opposite of Gell-Mann Amnesia is, he's got it going for him.
sodality2 27 days ago [-]
MLID on Intel is starting to become the same as UserBenchmark on AMD (except for the generally reputable sources)... he's beginning to sound like he simply wants Intel to fail, to my insider-info-lacking ears. For competition's sake I really hope that MLID has it wrong (at least the opining about the imminent failure of Intel's GPU division), and that the B series will encourage Intel to push farther to spark more competition in the GPU space.
ryao 26 days ago [-]
My analysis is that the B580 is profitable if they build enough of them:
The margins might be describable as razor thin, but they are there. Whether it can recoup the R&D that they spent designing it is hard to say definitively since I do not have numbers for their R&D costs. However, their iGPUs share the same IP blocks, so the iGPUs should be able to recoup the R&D costs that they have in common with the discrete version. Presumably, Intel can recoup the costs specific to the discrete version if they sell enough discrete cards.
While this is not a great picture, it is not terrible either. As long as Intel keeps improving its graphics technology with each generation, profitability should gradually improve. Although I have no insider knowledge, I noticed a few things that they could change to improve their profitability in the next generation:
* Tom Petersen made a big deal about 16-lane SIMD in Battlemage being what games want rather than the 8-lane SIMD in Alchemist. However, that is not quite true since both Nvidia and AMD graphics use 32-lane SIMD. If the number of lanes really matter and I certainly can see how it would if game shaders have horizontal operations, then a switch to 32-lane SIMD should yield further improvements.
* Tom Petersen said in his interview with Hardware Unboxed that Intel reported the active transistor count for the B580 rather than the total transistor count. This is the contrary to others who report the total transistor count (as evidenced by their density figures being close to what TSMC claims the process can do). Tom Petersen also stated that they would not necessarily be forced by defects to turn dies into B570 cards. This suggests to me that they have substantial redundant logic in the GPU to prevent defects from rendering chips unusable, and that logic is intended to be disabled in production. GPUs are already highly redundant. They could drop much of the planned dark silicon and let defects force a larger percentage of the dies to be usable by only cutdown models.
I could have read too much into things that Tom Petersen said. Then again, he did say that their design team is conservative and the doubling rather than quadrupling of the SIMD lane count and the sheer amount of dark silicon (>40% of the die by my calculation) spent on what should be redundant components strike me as conservative design choices. Hopefully the next generation addresses these things.
Also, they really do have >40% dark silicon when doing density comparisons:
They have 41% less density than Nvidia and 48% less density than TSMC claims the process can obtain. We also know that they have additional transistors on the die that are not active from Tom Petersen’s comments. Presumably, they are for redundancy. Otherwise, there really is no sane explanation that I can see for so much dark silicon. If they are using transistors that are twice the size as the density figure might be interpreted to suggest, they might as well have used TSMC’s 7nm process since while a smaller process can etch larger features, it is a waste of money.
Note that we can rule out the cache lowering the density. The L1 + L2 cache on the 4070 Ti is 79872 KB while it is 59392 KB on the B580. We can also rule out IO logic as lowering the density, as the 4070 Ti has a 256-bit memory bus while the B580 has a 192-bit memory bus.
> Tom Petersen made a big deal about 16-lane SIMD in Battlemage [...]
Where? The only mention I see in that interview is him briefly saying they have native 16 with "simple emulation" for 32 because some games want 32. I see no mention of or comparison to 8.
And it doesn't make sense to me that switching to actual 32 would be an improvement. Wider means less flexible here. I'd say a more accurate framing is whether the control circuitry is 1/8 or 1/16 or 1/32. Faking extra width is the part that is useful and also pretty easy.
ryao 26 days ago [-]
For context, Alchemist was SIMD8. They made a big deal out of this at the alchemist launch if I recall correctly since they thought it would be more efficient. Unfortunately, it turned out to be less efficient.
Tom Petersen did a bunch of interviews right before the Intel B580 launch. In the hardware unboxed interview, he mentioned it, but accidentally misspoke. I must have interpreted his misspeak as meaning games want SIMD16 and noted it that way in my mind, as what he says elsewhere seems to suggest that games want SIMD16. It was only after thinking about what I heard that I realized otherwise. Here is an interview where he talks about native SIMD16 being better:
> But we also have native SIMD support—SIMD16 native support, which is going to say that you don’t have to like recode your computer shader to match a particular topology. You can use the one that you use for everyone else, and it’ll just run well on ARC. So I’m pretty exited about that.
In an interview with gamers nexus, he has a nice slide where he attributes a performance gain directly to SIMD16:
At the start of the gamers nexus video, Steve mentions that Tom‘s slides are from a presentation. I vaguely remember seeing a video of it where he talked more about SIMD16 being an improvement, but I am having trouble finding it.
Having to schedule fewer things is a definite benefit of 32 lanes over a smaller lane count. Interestingly, AMD switched from a 16 lane count to a 32 lane count with RDNA, and RDNA turned out to be a huge improvement in efficiency. The switch is actually somewhat weird since they had been emulating SIMD64 using their SIMD16 hardware, so the hardware simultaneously became wider and narrower at the same time. Their emulation of SIMD64 in SIMD16 is mentioned in this old GCN documentation describing cross lane operations:
That documentation talks about writing to a temporary location and reading form a temporary location in order to do cross lane operations. Contrast this with 12.5.1 of RDNA 3 ISA documentation, where the native SIMD32 units just fetch the values from each others’ registers with no mention of a temporary location:
That strikes me as much more efficient. While I do not write shaders, I have written CUDA kernels and in CUDA kernels, you sometimes need to do what Nvidia calls a parallel reduction across lanes, which are cross lane operations (Intel’s CPU division calls these horizontal operations). For example, you might need to sum across all lanes (e.g. for an average, matrix vector multiplication or dot product). When your thread count matches the SIMD lane count, you can do this without going to shared memory, which is fast. If you need to emulate a higher lane width, you need to use a temporary storage location (like what AMD described), which is not as fast.
If games’ shaders are written with an assumption that SIMD32 is used, then native SIMD32 is going to be more performant than native SIMD16 because of faster cross lane operations. Intel’s slide attributes a 0.3ms reduction in render time to their switch from SIMD8 to SIMD16. I suspect that they would see a further reduction with SIMD32 since that would eliminate the need to emulate SIMD32 for games that expect SIMD32 due to Nvidia (since as late as Turing) and AMD (since RDNA 1) both using SIMD32.
To illustrate this, here are some CUDA kernels that I wrote:
The softmax kernel for example has the hardware emulate SIMD1024, although you would need to look at the kernel invocations in the corresponding rung.c file to know that. The purpose of doing 1024 threads is to ensure that the kernel is memory bandwidth bound since the hardware bottleneck for this operation should be memory bandwidth. In order to efficiently do the parallel reductions to calculate the max and sum values in different parts of softmax, I use the fast SIMD32 reduction in every SIMD32 unit. I then write the results to shared memory from each of the 32 SIMD32 units that performed this (since 32 * 32 = 1024). I then have all 32x SIMD32 units read from shared memory and simultaneously do the same reduction to calculate the final value. Afterward, the leader in each unit tells all others the value and everything continues. Now imagine having a compiler compile this for a native SIMD16.
A naive approach would introduce a trip to shared memory for both reductions, giving us 3 trips to shared memory and 4 reductions. A more clever approach would do 2 trips to shared memory and 3 reductions. Either way, SIMD16 is less efficient. The smart thing to do would be to recognize that 256 threads is likely okay too and just do the same exact thing with a smaller number of threads, but a compiler is not expected to be able to make such a high level optimization, especially since the high level API says “use 1024 threads”. Thus you need the developer to rewrite this for SIMD16 hardware to get it to run at full speed and with Intel’s low marketshare, that is not very likely to happen. Of course, this is CUDA code and not a shader, but a shader is likely in a similar situation.
Dylan16807 26 days ago [-]
> Having to schedule fewer things is a definite benefit of 32 lanes over a smaller lane count.
From a hardware design perspective, it saves you some die size in the scheduler.
From a performance perspective, as long as the hardware designer kept 32 in mind, it can schedule 32 lanes and duplicate the signals to the 16 or 8 wide lanes with no loss of performance.
> That documentation talks about writing to a temporary location and reading form a temporary location in order to do cross lane operations.
> If games’ shaders are written with an assumption that SIMD32 is used, then native SIMD32 is going to be more performant than native SIMD16 because of faster cross lane operations.
So this is a situation where wider lanes actually need more hardware to run at full speed and not having it causes a penalty. I see your point here, but I will note that you can add that criss-cross hardware for 32-wide operations while still having 16-wide be your default.
ryao 26 days ago [-]
> From a performance perspective, as long as the hardware designer kept 32 in mind, it can schedule 32 lanes and duplicate the signals to the 16 or 8 wide lanes with no loss of performance.
I was looking at the things that were said for XE2 in Lunar Lake and it appears that the slides suggest that they had special handling to emulate SIMD32 using SIMD16 in hardware, so you might be right.
> So this is a situation where wider lanes actually need more hardware to run at full speed and not having it causes a penalty. I see your point here, but I will note that you can add that criss-cross hardware for 32-wide operations while still having 16-wide be your default.
To go from SIMD8 to SIMD16, Intel halved the number of units while making them double the width. They could have done that again to avoid the need for additional hardware.
I have not seen the Xe2 instruction set to have any hints about how they are doing these operations in their hardware. I am going to leave it at that since I have spent far too much time analyzing the technical marketing for a GPU architecture that I am not likely to use. No matter how well they made it, it just was not scaled up enough to make it interesting to me as a developer that owns a RTX 3090 Ti. I only looked into it as much as I did since I am excited to see Intel moving forward here. That said, if they launched a 48GB variant, I would buy it in a heartbeat and start writing code to run on it.
ryao 26 days ago [-]
There is a typo in the Tom Petersen quote. He said “compute shader”, not “computer shader”. Autocorrect changed it when I had transcribed it and I did not catch this during the edit window.
oofabz 27 days ago [-]
The die size of the B580 is 272 mm2, which is a lot of silicon for $249. The performance of the GPU is good for its price but bad for its die size. Manufacturing cost is closely tied to die size.
272 mm2 puts the B580 in the same league as the Radeon 7700XT, a $449 card, and the GeForce 4070 Super, which is $599. The idea that Intel is selling these cards at a loss sounds reasonable to me.
tjoff 27 days ago [-]
Though you assume the prices of the competition are reasonable. There are plenty of reasons for them not to be. Availability issues, lack of competition, other more lucrative avenues etc.
Intel has neither, or at least not as much of them.
KeplerBoy 27 days ago [-]
At a loss seems a bit overly dramatic. I'd guess Nvidia sells SKUs for three times their marginal cost. Intel is probably operating at cost without any hopes of recouping R&D with the current SKUs, but that's reasonable for an aspiring competitor.
7speter 27 days ago [-]
It kinda seems they are covering the cost of throwing massive amounts of resources trying to get Arc’s drivers in shape.
KeplerBoy 27 days ago [-]
I really hope they stick with it and become a viable competitor in every market segment a few more years down the line.
ryao 26 days ago [-]
The drivers are shared by their iGPUs, so the cost of improving the drivers is likely shared by those.
ryao 26 days ago [-]
The idea that Intel is selling these at a loss does not sound reasonable to me:
The only way this would be at a loss is if they refuse to raise production to meet demand. That said, I believe their margins on these are unusually low for the industry. They might even fall into razor thin territory.
derektank 27 days ago [-]
Wait, are they losing money on every one in the sense that they haven't broken even on research and development yet? Or in the sense that they cost more to manufacture than they're sold at? Because one is much worse than the other.
That being said, the IP blocks are shared by their iGPUs, so the discrete GPUs do not need to recoup the costs of most of the R&D, as it would have been done anyway for the iGPUs.
rockskon 27 days ago [-]
They're trying to unseat Radeon as the budget card. That means making a more enticing offer than AMD for a temporary period of time.
ryao 26 days ago [-]
That guy’s reasoning is faulty. To start, he has made math mistakes in every video that he has posted recently involving math. To give 3 recent examples:
At 10m3s in the following video, he claims to add a 60% margin by multiplying by 1.6, but in reality is adding a 37.5 margin and needed to multiply by 2.5 to add a 60% margin. This can be calculated by calculating Cost Scaling Factor = 1 / (1 - Normalized Profit Margin):
At 48m13s in the following video, he claims that Intel’s B580 is 80% worse than Nvidia’s hardware. He took the 4070 Ti as being 82% better than the 2080 SUPER, assumed based on leaks from his reviewer friends that the B580 was about at the performance of the 2080 SUPER and then claimed that the B580 would be around 80% worse than the 4070 Ti. Unfortunately for him, that is 45% worse, not 80% worse. His chart is from Techpowerup and if he had taken the time to do some math (1 - 1/(1 + 0.82) ~ 0.45), or clicked to the 2080 SUPER page, he would have seen it has 55% of the performance of the 4070 Ti, which is 45% worse:
At 1m2s in the following video, he makes a similar math mistake by saying that the B580 has 8% better price/performance than the RTX 3060 when in fact it is 9% better. He mistakenly equated the RTX 3060 being 8% worse than the B580 to mean that it is 8% better, but math does not work that way. Luckily for him, the math error is small here, but he still failed to do math correctly and his reasoning grows increasingly faulty with the scale of his math errors. What he should have done that gives the correct normalized factor is:
He not just fails at mathematical reasoning, but lacks a basic understanding of how hardware manufacturing works. He said that if Intel loses $20 per card in low production volumes, then making 10 million cards will result in a $200 million loss. In reality, things become cheaper due to economics of scale and simple napkin math shows that they can turn a profit on these cards:
His behavior is consistent with being on a vendetta rather than being a technology journalist. For example, at 55m13s in the following video, he puts words in Tom Petersen’s mouth and then with a malicious smile on his mouth, cheers while claiming that Tom Petersen declared discrete ARC cards to be dead when Tom Petersen said nothing of the kind. Earlier in the same video at around 44m14s, he calls Tom Petersen a professional liar. However, he sees no problem expecting people to believe words he shoved into the “liar’s” mouth:
If you scrutinize his replies to criticism in his comments section, you would see he is dodging criticism of the actual issues with his coverage while saying “I was right about <insert thing completely unrelated to the complaint here>” or “facts don’t care about your feelings”. You would also notice that he is copy and pasting the same statements rather than writing replies addressing the details of the complaints. To be clear, I am paraphrasing in those two quotes.
He also shows contempt for his viewers that object to his behavior in the following video around 18m53s where he calls them “corporate cheerleaders”:
In short, Tom at MLID is unable to do mathematical reasoning, does not understand how hardware manufacturing works, has a clear vendetta against Intel’s discrete graphics, is unable to take constructive criticism and lashes out at those who try to tell him when he is wrong. I suggest being skeptical of anything he says about Intel’s graphics division.
dboreham 27 days ago [-]
> Could be trying to make themselves a target for a big acquihire.
Is this something anyone sets out to do?
ryukoposting 27 days ago [-]
It definitely is, yes.
seeknotfind 27 days ago [-]
Yes.
tesch1 27 days ago [-]
AMD. Just one more dot to connect ;)
dylan604 27 days ago [-]
It would be interesting to find out AMD is funding these other companies to ensure the shim happens while they focus on not doing it.
bushbaba 27 days ago [-]
AMD is kind of doing that funding by pricing its GPUs low and/or giving them away at cost to these startups
shmerl 27 days ago [-]
Is this effort benefiting everyone? I.e. where is it going / is it open source?
From Lamini, we have a private AMD GPU cluster, ready to serve any one who want to try MI300x or MI250 with inference and tuning.
We just onboarded a customer to move from openai API to on-prem solution, currently evaluating MI300x for inference.
Email me at my profile email.
3abiton 27 days ago [-]
My understanding is that once JAX takes off, the cuda advantage is gone for nvidia. That's a big if/when though.
jsheard 27 days ago [-]
Tinygrad was another one, but they ended up getting frustrated with AMD and semi-pivoted to Nvidia.
noch 27 days ago [-]
> Tinygrad was another one, but they ended up getting frustrated with AMD and semi-pivoted to Nvidia.
From their announcement on 20241219[^0]:
"We are the only company to get AMD on MLPerf, and we have a completely custom driver that's 50x simpler than the stock one. A bit shocked by how little AMD cared, but we'll take the trillions instead of them."
From 20241211[^1]:
"We gave up and soon tinygrad will depend on 0 AMD code except what's required by code signing.
We did this for the 7900XTX (tinybox red). If AMD was thinking strategically, they'd be begging us to take some free MI300s to add support for it."
Interesting. I wonder if focusing on GPUs and CPUs is something that requires two companies instead of one, whether the concentration of resources just leads to one arm of your company being much better than the other.
halJordan 26 days ago [-]
Nvidia maintains a competitive cpu...
kranke155 26 days ago [-]
I had no idea. Thanks for sharing.
latchkey 26 days ago [-]
Hot Aisle (my company) has MI300x compute available for rent too! =)
jroesch 27 days ago [-]
Note: this is old work, and much of the team working on TVM, and MLC were from OctoAI and we have all recently joined NVIDIA.
sebmellen 27 days ago [-]
Is there no hope for AMD anymore? After George Hotz/Tinygrad gave up on AMD I feel there’s no realistic chance of using their chips to break the CUDA dominance.
comex 27 days ago [-]
Maybe from Modular (the company Chris Lattner is working for). In this recent announcement they said they had achieved competitive ML performance… on NVIDIA GPUs, but with their own custom stack completely replacing CUDA. And they’re targeting AMD next.
IMO the hope shouldn't be that AMD specifically wins, rather it's best for consumers that hardware becomes commoditized and prices come down.
And that's what's happening, slowly anyway. Google, Apple and Amazon all have their own AI chips, Intel has Gaudi, AMD had their thing, and the software is at least working on more than just Nvidia. Which is a win. Even if it's not perfect. I'm personally hoping that everyone piles in on a standard like SYCL.
steeve 27 days ago [-]
We (ZML) have AMD MI300X working just fine, in fact, faster than H100
That's almost word for word what geohotz said last year?
refulgentis 27 days ago [-]
What part?
I assume the part where she said there's "gaps in the software stack", because that's the only part that's attributed to her.
But I must be wrong because that hasn't been in dispute or in the news in a decade, it's not a geohot discovery from last year.
Hell I remember a subargument of a subargument re: this being an issue a decade ago in macOS dev (TL;Dr whether to invest in opencl)
bn-l 27 days ago [-]
I went through the thread. There’s an argument to be made in firing Su for being so spaced out as to miss an op for their own CUDA for free.
hedgehog 27 days ago [-]
Not remotely, how did you get to that idea?
refulgentis 27 days ago [-]
Kids this days (shakes fist)
tl;dr there's a non-unsubstantial # of people who learn a lot from geohot. I'd say about 3% of people here will be confused if you thought of him as less than a top technical expert across many comp sci fields.
And he did the geohot thing recently, way tl;dr: acted like there was a scandal being covered up by AMD around drivers that was causing them to "lose" to nVidia.
He then framed AMD not engaging with him on this topic as further covering-up and choosing to lose.
So if you're of a certain set of experiences, you see an anodyne quote from the CEO that would have been utterly unsurprising dating back to when ATI was still a company, and you'd read it as the CEO breezily admitting in public that geohot was right about how there was malfeasance, followed by a cover up, implying extreme dereliction of duty, because she either helped or didn't realize till now.
I'd argue this is partially due to stonk-ification of discussions, there was a vague, yet often communicated, sense there was something illegal happening. Idea was it was financial dereliction of duty to shareholders.
brookst 26 days ago [-]
Like Matt Levine says, “everything is securities fraud”. Company gets hacked? Securities fraud because they failed to disclose the exact probability of this event in their SEC filings. Company’s latest product is a flop? Securities fraud because they failed to disclose the bad decisions leading to the flop. Etc, etc.
quotemstr 27 days ago [-]
The world is bigger than AMD and Nvidia. Plenty of interesting new AI-tuned non-GPU accelerators coming online.
grigio 27 days ago [-]
I hope, name some NPU who can run a 70B model..
fweimer 26 days ago [-]
Isn't AMD rather strong in the HPC space?
Quite frankly, I have difficulty reconciling a lot of comments here with that, and my own experience as an AMD GPU user (although not for compute, and not on Windows).
llm_trw 27 days ago [-]
Not really.
AMD is constitutionally incapable of shipping anything but mid range hardware that requires no innovation.
The only reason why they are doing so well in CPUs right now is that Intel has basically destroyed itself without any outside help.
adrian_b 27 days ago [-]
In CPUs, AMD has made many innovations that have been copied by Intel only after many years and this delay had an important contribution to Intel's downfall.
The most important has been the fact that AMD has predicted correctly that big monolithic CPUs will no longer be feasible in the future CMOS fabrication technologies, so they have designed the Zen family since the beginning with a chiplet-based architecture. Intel had attempted to ridicule them, but after losing many billions they have been forced to copy this strategy.
Also in the microarchitecture of their CPUs AMD has made the right choices since the beginning and then they have improved it constantly with each generation. The result is that now the latest Intel big core, Lion Cove, has a microarchitecture that is much more similar to AMD Zen 5 than to any of the previous Intel cores, because they had to do this to get a competitive core.
In the distant past, AMD has also introduced a lot of innovations long before they were copied by Intel, but it is true that those had not been invented by AMD, but they had been copied by AMD from more expensive CPUs, like DEC Alpha or Cray or IBM POWER, but Intel has also copied them only after being forced by the competition with AMD.
ksec 27 days ago [-]
Everything is comparative. AMD isn't perfect. As an Ex Shareholder I have argued they did well partly because of Intel's downfall. In terms of execution it is far from perfect.
But Nvidia is a different beast. It is a bit like Apple in the late 00s where you take business, forecast, marketing, operation, software, hardware, sales etc You take any part of it and they are all industry leading. And having industry leading capability is only part of the game, having it all work together is completely another thing. And unlike Apple where they lost direction once Steve Jobs passed away and weren't sure about how to deploy capital. Jensen is still here, and they have more resources now making Nvidia even more competitive.
It is often most people underestimate the magnitude of the task required, ( I like to tell the story again about an Intel GPU engineer in 2016 arguing they could take dGPU market shares by 2020, and we are now 2025 ), over estimate the capability of an organisation, under estimate the rival's speed of innovation and execution. These three thing combined is why most people are often off the estimate by an order of magnitude.
llm_trw 27 days ago [-]
Yeah, no.
We are in the middle of a monopoly squeeze by NVidia on the most innovative part of the economy right now. I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.
By comparison if AMD could write a driver that didn't shit itself when it had to multiply more than two matrices in a row they'd be selling cards faster than they can make them. You don't need to sell the best shovels in a gold rush to make mountains of money, but you can't sell teaspoons as premium shovels and expect people to come back.
ksec 27 days ago [-]
>We are in the middle of a monopoly squeeze by NVidia on the most innovative part of the economy right now.
I am not sure which part of Nvidia is monopoly. That is like suggesting TSMC has a monopoly.
vitus 27 days ago [-]
> That is like suggesting TSMC has a monopoly.
They... do have a monopoly on foundry capacity, especially if you're looking at the most advanced nodes? Nobody's going to Intel or Samsung to build 3nm processors. Hell, there have been whispers over the past month that even Samsung might start outsourcing Exynos to TSMC; Intel already did that with Lunar Lake.
Having a monopoly doesn't mean that you are engaging in anticompetitive behavior, just that you are the only real option in town.
brookst 26 days ago [-]
This gets at the classic problem in defining a monopoly: how hou define the market. Every company is a monopoly if you define the market narrowly enough. Ford has a monopoly on F150’s.
I would argue that defining a semiconductor market in terms of node size is too narrow. Just because TSMC is getting the newest nodes first does not mean they have a monopoly in the semiconductor market. We can play semantics, but for any meaningful discussion of monopolistic behaviors, a temporary technical advantage seems a poor way to define the term.
vitus 26 days ago [-]
> Just because TSMC is getting the newest nodes first does not mean they have a monopoly in the semiconductor market.
Sure. Market research also places them as having somewhere around 65% of worldwide foundry sales [0], with Samsung coming in second place with about 12% (mostly first-party production). Fact is that nobody else comes close to providing real competition for TSMC, so they can charge whatever prices they want, whether you're talking about the 3nm node or the 10nm node.
Rounding out the top five... SMIC (6%) is out of the question unless you're based in China due to various sanctions, UMC (5%) mainly sell decade+-old processes (22nm and larger), and Global Foundries explicitly has abandoned keeping up with the latest technologies.
If you exclude the various Chinese foundries and subtract off Samsung's first-party development, TSMC's share of available foundry capacity for third-party contracts likely grows to 70% or more. At what point do you consider this to be a monopoly? Microsoft Windows has about 72% of desktop OS share.
Vecr 27 days ago [-]
Will they? Given the structure of global controls on GPUs, Nvidia is a de-facto self funding US government company.
Maybe the US will do something if GPU price becomes the limit instead of the supply of chips and power.
kadoban 27 days ago [-]
What effect did the DOJ have on MS in the 90s? Didn't all of that get rolled back before they had to pay a dime, and all it amounted to was that browser choice screen that was around for a while? Hardly a crippling blow. If anything that showed the weakness of regulators in fights against big tech, just outlast them and you're fine.
shiroiushi 27 days ago [-]
>I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.
It sounds like you're expecting extreme competence from the DOJ. Given their history with regulating big tech companies, and even worse, the incoming administration, I think this is a very unrealistic expectation.
perching_aix 27 days ago [-]
And I'm supposed to believe that HN is this amazing platform for technology and science discussions, totally unlike its peers...
Also I'd take HN as being being an amazing platform for the overall consistency and quality of moderation. Anything beyond that depends more on who you're talking to than where at.
petesergeant 27 days ago [-]
Maybe be the change you want to see and tell us what the real story is?
perching_aix 27 days ago [-]
We seem to disagree on what the change in the world I'd like to see is like, which is a real shocker I'm sure.
Personally, I think that's when somebody who has no real information to contribute doesn't try to pretend that they do.
So thanks for the offer, but I think I'm already delivering on that realm.
shadowgovt 26 days ago [-]
Oh, there's basically no chance of getting that on the Internet.
The Internet is a machine that highly simplifies the otherwise complex technical challenge of wide-casting ignorance. It wide-casts wisdom too, but it's an exercise for the reader to distinguish them.
llm_trw 27 days ago [-]
I don't really care what you believe.
Everyone whose dug deep into what AMD is doing has left in disgust if they are lucky and bankruptcy if they are not.
If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.
AnthonyMouse 27 days ago [-]
> If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.
This seems like unuseful advice if you've already given up on them.
You tried it and at some point in the past it wasn't ready. But by not being ready they're losing money, so they have a direct incentive to fix it. Which would take a certain amount of time, but once you've given up you no longer know if they've done it yet or not, at which point your advice would be stale.
Meanwhile the people who attempt it apparently seem to get acquired by Nvidia, for some strange reason. Which implies it should be a worthwhile thing to do. If they've fixed it by now which you wouldn't know if you've stopped looking, or they fix it in the near future, you have a competitive advantage because you have access to lower cost GPUs than your rivals. If not, but you've demonstrated a serious attempt to fix it for everyone yourself, Nvidia comes to you with a sack full of money to make sure you don't finish, and then you get a sack full of money. That's win/win, so rather than nobody doing it, it seems like everybody should be doing it.
llm_trw 27 days ago [-]
I've tried it three times.
I've seen people try it every six months for two decades now.
At some point you just have to accept that AMD is not a serious company, but is a second rate copycat and there is no way to change that without firing everyone from middle management up.
I'm deeply worried about stagnation in the CPU space now that they are top dog and Intel is dead in the water.
Here's hoping China and Risk V save us.
>Meanwhile the people who attempt it apparently seem to get acquired by Nvidia
Everyone I've seen base jumping has gotten a sponsorship from redbull, ergo. everyone should basejump.
Ignore the red smears around the parking lot.
AnthonyMouse 27 days ago [-]
> At some point you just have to accept that AMD is not a serious company, but is a second rate copycat and there is no way to change that without firing everyone from middle management up.
AMD has always punched above their weight. Historically their problem was that they were the much smaller company and under heavy resource constraints.
Around the turn of the century the Athlon was faster than the Pentium III and then they made x86 64-bit when Intel was trying to screw everyone with Itanic. But the Pentium 4 was a marketing-optimized design that maximized clock speed at the expense of heat and performance per clock. Intel was outselling them even though the Athlon 64 was at least as good if not better. The Pentium 4 was rubbish for laptops because of the heat problems, so Intel eventually had to design a separate chip for that, but they also had the resources to do it.
That was the point that AMD made their biggest mistake. When they set out to design their next chip the competition was the Pentium 4, so they made a power-hungry monster designed to hit high clock speeds at the expense of performance per clock. But the reason more people didn't buy the Athlon 64 wasn't that they couldn't figure out that a 2.4GHz CPU could be faster than a 2.8GHz CPU, it was all the anti-competitive shenanigans Intel was doing behind closed doors to e.g. keep PC OEMs from featuring systems with AMD CPUs. Meanwhile by then Intel had figured out that the Pentium 4 was, in fact, a bad design, when their own Pentium M laptops started outperforming the Pentium 4 desktops. So the Pentium 4 line got canceled and Bulldozer had to go up against the Pentium M-based Core, which nearly bankrupted AMD and compromised their ability to fund the R&D needed to sustain state of the art fabs.
Since then they've been climbing back out of the hole but it wasn't until Ryzen in 2017 that you could safely conclude they weren't on the verge of bankruptcy, and even then they were saddled with a lot of debt and contracts requiring them to use the uncompetitive Global Foundries fabs for several years. It wasn't until Zen4 in 2022 that they finally got to switch the whole package to TSMC.
So until quite recently the answer to the question "why didn't they do X?" was obvious. They didn't have the money. But now they do.
Dylan16807 26 days ago [-]
> So until quite recently the answer to the question "why didn't they do X?" was obvious. They didn't have the money. But now they do.
Seven and a half years.
The excuse is threadbare at best. They are not doing a reasonable job of making compute work off the shelf.
AnthonyMouse 26 days ago [-]
> Seven and a half years.
Seven and a half years was the 2017 Ryzen release date. Zen 1 took them from being completely hopeless to having something competitive but only just, because they were still having the whole thing fabbed by GF. Their revenue didn't exceed what it was in 2011 until 2019 and didn't exceed Intel's until 2022. It's still less than Nvidia, even though AMD is fielding CPUs competitive with Intel and GPUs competitive with Nvidia at the same time.
They had a pretty good revenue jump in 2021 but much of that was used to pay down debt, because debt taken on when you're almost bankrupt tends to have unfavorable terms. So it wasn't until somewhere in 2022 that they finally got free of GF and the old debt and could start doing something about this. But then it takes some amount of time to actually do it, and you would expect to be seeing the results of that approximately right now. Which seems like a silly time to stop looking.
Also, somewhat counterintuitively, George Hotz et al seem to be employing a strategy in the nature of "say bad things about them in public to shame them into improving", which has the dual result of actually working (they fix a lot of the things he's complaining about) but also making people think that things are worse than they are because there is now a large public archive of rants about things they've already fixed. It's not clear if this is the company not providing a good mechanism for people to complain about things like that in private and have them fixed promptly so it doesn't have take media attention to make it happen, or it's George Hotz seeking publicity as is his custom, or some combination of both.
Dylan16807 26 days ago [-]
It has also been quite a while since Zen+ and Zen 2. Those poured in money, and they absolutely did not need to wait until they had more revenue than some chunk of Intel or until their debt was gone. If you think they got properly started on this in 2022, that's pretty damning.
I'm not basing anything on geohotz, just general discussions from people that have tried, and my own experience of trying to get some popular compute code bases to run. It has been so lacking compared to AMD's own support for games. I'm not going to be "silly" and "stop looking" going forward, but I'm not going to forget how long my card was largely abandoned. It went directly from "not ready yet, working on it" to "obsolete, maybe dregs will be added later".
AnthonyMouse 25 days ago [-]
> It has also been quite a while since Zen+ and Zen 2. Those poured in money
Zen+ and Zen 2 were released in 2019. Their revenue in 2019 was only 2.5% higher than it was in 2011; adjusted for inflation it was still down more than 10%.
> they absolutely did not need to wait until they had more revenue than some chunk of Intel or until their debt was gone.
The premise of the comparison is that it shows the resources they have available. To make the same level of investment as a bigger company you either have to take it out of profit (not possible when your net profit has a minus sign in front of it or is only a single digit) or you have to make more money first.
And carrying high interest debt when you're now at much lower risk of default is pretty foolish. You'd be paying interest that could be going to R&D. Even if you want to borrow money in order to invest it, the thing to do is to pay back the high interest debt and then borrow the money again now that you can get better terms, which seems to be just what they did.
Dylan16807 25 days ago [-]
I never said anything about wanting them to invest the same amount as nvidia or Intel. I think a handful of extra people could have made a big difference, in particular if some of them had the sole task of bringing their consumer cards into the support list.
It is so bad that they had major cards that were never on the support list for compute.
> You'd be paying interest that could be going to R&D.
Getting people to actually consider your datacenter cards, because they know how to use your cards, will get you more R&D money.
Const-me 27 days ago [-]
> I've tried it three times
Have you tried compute shaders instead of that weird HPC-only stuff?
Compute shaders are widely used by millions of gamers every day. GPU vendors have huge incentive to make them reliable and efficient: modern game engines are using them for lots of thing, e.g. UE5 can even render triangle meshes with GPU compute instead of graphics (the tech is called nanite virtualized geometry). In practice they work fine on all GPUs, ML included: https://github.com/Const-me/Cgml
perching_aix 27 days ago [-]
I'd be very concerned if somebody makes a $100K decision based on a comment where the author couldn't even differentiate between the words "constitutionally" and "institutionally", while providing as much substance as any other random techbro on any random forum and being overwhelmingly oblivious to it.
lofaszvanitt 27 days ago [-]
It had to destroy itself. These companies do not act on their own...
zamalek 27 days ago [-]
I have been playing around with Phi-4 Q6 on my 7950x and 7900XT (with HSA_OVERRIDE_GFX_VERSION). It's bloody fast, even with CPU alone - in practical terms it beats hosted models due to the roundtrip time. Obviously perf is more important if you're hosting this stuff, but we've definitely reached AMD usability at home.
slavik81 26 days ago [-]
If you're not using your iGPU, you can disable it in BIOS and you won't need to set HSA_OVERRIDE_GFX_VERSION.
throwaway314155 27 days ago [-]
> Aug 9, 2023
Ignoring the very old (in ML time) date of the article...
What's the catch? People are still struggling with this a year later so I have to assume it doesn't work as well as claimed.
I'm guessing this is buggy in practice and only works for the HF models they chose to test with?
Const-me 27 days ago [-]
It’s not terribly hard to port ML inference to alternative GPU APIs. I did it for D3D11 and the performance is pretty good too: https://github.com/Const-me/Cgml
The only catch is, for some reason developers of ML libraries like PyTorch aren’t interested in open GPU APIs like D3D or Vulkan. Instead, they focus on proprietary ones i.e. CUDA and to lesser extent ROCm. I don’t know why that is.
D3D-based videogames are heavily using GPU compute for more than a decade now. Since Valve shipped SteamDeck, the same now applies to Vulkan on Linux. By now, both technologies are stable, reliable and performant.
jsheard 27 days ago [-]
Isn't part of it because the first-party libraries like cuDNN are only available through CUDA? Nvidia has poured a ton of effort into tuning those libraries so it's hard to justify not using them.
Const-me 27 days ago [-]
Unlike training, ML inference is almost always bound by memory bandwidth as opposed to computations. For this reason, tensor cores, cuDNN, and other advanced shenanigans make very little sense for the use case.
OTOH, general-purpose compute instead of fixed-function blocks used by cuDNN enables custom compression algorithms for these weights which does help, by saving memory bandwidth. For example, I did custom 5 bits/weight quantization which works on all GPUs, no hardware support necessary, just simple HLSL codes: https://github.com/Const-me/Cgml?tab=readme-ov-file#bcml1-co...
boroboro4 27 days ago [-]
Only local (read batch size 1) ML inference is memory bound, production loads are pretty much compute bound. Prefill phase is very compute bound, and with continuous batching generation phase is getting mixed with prefill, which makes whole process altogether to be compute bound too. So no, tensor cores and all other shenanigans absolutely critical for performant inference infrastructure.
Const-me 27 days ago [-]
PyTorch is a project by Linux foundation. The about page with the mission of the foundation contains phrases like “empowering generations of open source innovators”, “democratize code”, and “removing barriers to adoption”.
I would argue running local inference with batch size=1 is more useful for empowering innovators compared to running production loads on shared servers owned by companies. Local inference increases count of potential innovators by orders of magnitude.
BTW, in the long run it may also benefit these companies because in theory, an easy migration path from CUDA puts a downward pressure on nVidia’s prices.
idonotknowwhy 27 days ago [-]
Most people running local inference do so thorough quants with llamacpp (which runs on everything) or awq/exl2/mlx with vllm/tabbyAPI/lmstudio which are much faster to than using pytorch directly
llama.cpp has a much bigger supported model list, as does vLLM and of course PyTorch/HF transformers covers everything else, all of which work w/ ROCm on RDNA3 w/o too much fuss these days.
For inference, the biggest caveat is that Flash Attention is only an aotriton implementation, which besides being less performant sometimes, also doesn't support SWA. For CDNA there is a better CK-based version of FA, but CK doesn't not have RDNA support. There are a couple people at AMD apparently working on native FlexAttention, os I guess we'll how that turns out.
(Note the recent SemiAccurate piece was on training, which I'd agree is in a much worse state (I have personal experience with it being often broken for even the simplest distributed training runs). Funnily enough, if you're running simple fine tunes on a single RDNA3 card, you'll probably have a better time. OOTB, a 7900 XTX will train at about the same speed as an RTX 3090 (4090s blow both of those away, but you'll probably want more cards and VRAM of just move to H100s).
27 days ago [-]
mattfrommars 27 days ago [-]
Great, I have yet to understand why does not the ML community really push or move away from CUDA? To me, it feel like a dinosaur move to build on top of CUDA which is screaming proprietary nothing about it is open source or cross platform.
The reason why I say its dinosaur is, imagine, we as a dev community continued to build on top of Flash or Microsoft Silverlight...
LLM and ML has been out for quiet a while, with AI/LLM advancement, the transition must have been much quicker to move cross platform. But this hasn't yet and not sure when it will happen.
Building a translation layer on top CUDA is not the answer either to this problem.
idonotknowwhy 27 days ago [-]
For me personally, hacking together projects as a hobbiest, 2 reasons :
1. It just works. When i tried to build things on Intel Arcs, i spent way more hours bikeshedding ipex and driver issues than developing
2. LLMs seem to have more cuda code in their training data. I can leverage claude and 4o to help me build things with cuda, but trying to get them to help me do the same things on ipex just doesn't work.
I'd very much love a translation layer for Cuda, like a dxvk or wine equivalent.
Would save a lot of money since Arc gpus are in the bargain bin and nvidia cloud servers are double the price of AMD.
As it stands now, my dual Intel Arc rig is now just a llama.cpp inference server for the family to use.
jeroenhd 27 days ago [-]
If CUDA counts as “just works”, I dread to see the dark, unholy rituals you need to invoke to get ROCm to work. I have spent too many hours browsing the Nvidia forums for obscure error codes and driver messages to ever consider updating my CUDA install and every time I reboot my desktop for an update I dread having to do it all over again.
What kind of model learn and what's its token output on intel gpu's?
dwood_dev 27 days ago [-]
Except I never hear complaints about CUDA from a quality perspective. The complaints are always about lock in to the best GPUs on the market. The desire to shift away is to make cheaper hardware with inferior software quality more usable. Flash was an abomination, CUDA is not.
AnthonyMouse 27 days ago [-]
Flash was popular because it was an attractive platform for the developer. Back then there was no HTML5 and browsers didn't otherwise support a lot of the things Flash did. Flash Player was an abomination, it was crashy and full of security vulnerabilities, but that was a problem for the user rather than the developer and it was the developer choosing what to use to make the site.
This is pretty much exactly what happens with CUDA. Developers like it but then the users have to use expensive hardware with proprietary drivers/firmware, which is the relevant abomination. But users have some ability to influence developers, so as soon as we get the GPU equivalent of HTML5, what happens?
wqaatwt 27 days ago [-]
> users have to use expensive hardware with proprietary drivers/firmware
What do you mean by that? People trying to run their own models are not “the users” they are a tiny insignificant niche segment.
AnthonyMouse 27 days ago [-]
There are far more people running llama.cpp, various image generators, etc. than there are people developing that code. Even when the "users" are corporate entities, they're not necessarily doing any development in excess of integrating the existing code with their other systems.
We're also likely to see a stronger swing away from "do inference in the cloud" because of the aligned incentives of "companies don't want to pay for all that hardware and electricity" and "users have privacy concerns" such that companies doing inference on the local device will have both lower costs and a feature they can advertise over the competition.
What this is waiting for is hardware in the hands of the users that can actually do this for a mass market price, but there is no shortage of companies wanting a piece of that. In particular, Apple is going to be pushing that hard and despite the price they do a lot of volume, and then you're going to start seeing more PCs with high-VRAM GPUs or iGPUs with dedicated GDDR/HBM on the package as their competitors want feature parity for the thing everybody is talking about, the cost of which isn't actually that high, e.g. 40GB of GDDR6 is less than $100.
xedrac 27 days ago [-]
Maybe the situation has gotten better in recent years, but my experience with Nvidia toolchains was a complete nightmare back in 2018.
claytonjy 27 days ago [-]
The cuda situation is definitely better. The nvidia struggles are now with the higher-level software they’re pushing (triton, tensor-llm, riva, etc), tools that are the most performant option when they work, but a garbage developer experience when you step outside the golden path
cameron_b 27 days ago [-]
I want to double-down on this statement, and call attention to the competitive nature of it. Specifically, I have recently tried to set up Triton on arm hardware. One might presume Nvidia would give attention to an architecture they develop, but the way forward is not easy. For some version of Ubuntu, you might have the correct version of python ( usually older than packaged ) but current LTS is out of luck for guidance or packages.
I think you’ve mixed up your Triton’s; I’m talking about Triton Inference Server from NVIDIA while you’re talking about Triton the CUDA replacement from OpenAI
lasermike026 27 days ago [-]
I believe these efforts are very important. If we want this stuff to be practical we are going to have to work on efficiency. Price efficiency is good. Power and compute efficiency would be better.
I have been playing with llama.cpp to run interference on conventional cpus. No conclusions but it's interesting. I need to look at llamafile next.
A used 3090 is $600-900, performs better than 7900, and is much more versatile because CUDA
Uehreka 27 days ago [-]
Reality check for anyone considering this: I just got a used 3090 for $900 last month. It works great.
I would not recommend buying one for $600, it probably either won’t arrive or will be broken. Someone will reply saying they got one for $600 and it works, that doesn’t mean it will happen if you do it.
I’d say the market is realistically $900-1100, maybe $800 if you know the person or can watch the card running first.
All that said, this advice will expire in a month or two when the 5090 comes out.
idonotknowwhy 27 days ago [-]
I've bought 5 used and they're all perfect. But that's what buyer protection on ebay is for. Had to send back an Epyc mobo with bent pins and ebay handled it fine.
fireant 26 days ago [-]
I've bought used 3090 last year for ML and while it works fine, has correct DRAM and stuff, when I tried gaming on it I've noticed that it is significantly slower than my 3080. I'm not sure if the seller has pulled some shenanigans on me or the card actually degraded during whatever mining they did.
Just beware, the card might be "working fine" on a first glance, but actually be damaged.
ryao 26 days ago [-]
I got a refurbished $800 3090 Ti FE earlier this year from microcenter. Sadly, they sold out and never restocked.
coolspot 26 days ago [-]
Zotac official website has refurb 3090 ti for $899
melodyogonna 27 days ago [-]
Modular claims that it achieves 93% GPU utilization on AMD GPUs [1], official preview release coming early next year, we'll see. I must say I'm bullish because of feedback I've seen people give about the performance on Nvidia GPUs
Just an FYI, this is writeup from August 2023 and a lot has changed (for the better!) for RDNA3 AI/ML support.
That being said, I did some very recent inference testing on an W7900 (using the same testing methodology used by Embedded LLM's recent post to compare to vLLM's recently added Radeon GGUF support [1]) and MLC continues to perform quite well. On Llama 3.1 8B, MLC's q4f16_1 (4.21MB weights) performed +35% faster than llama.cpp w/ Q4_K_M w/ their ROCm/HIP backend (4.30MB weights, 2% size difference).
That makes MLC still the generally fastest standalone inference engine for RDNA3 by a country mile. However, you have much less flexibility with quants and by and large have to compile your own for every model, so llama.cpp is probably still more flexible for general use. Also llama.cpp's (recently added to llama-server) speculative decoding can also give some pretty sizable performance gains. Using a 70B Q4_K_M + 1B Q8_0 draft model improves output token throughput by 59% on the same ShareGPT testing. I've also been running tests with Qwen2.5-Coder and using a 0.5-3B draft model for speculative decoding gives even bigger gains on average (depends highly on acceptance rate).
Note, I think for local use, vLLM GGUF is still not suitable at all. When testing w/ a 70B Q4_K_M model (only 40GB), loading, engine warmup, and graph compilation took on avg 40 minutes. llama.cpp takes 7-8s to load the same model.
At this point for RDNA3, basically everything I need works/runs for my use cases (primarily LLM development and local inferencing), but almost always slower than an RTX 3090/A6000 Ampere (a new 24GB 7900 XTX is $850 atm, used or refurbished 24 GB RTX 3090s are in in the same ballpark, about $800 atm; a new 48GB W7900 goes for $3600 while an 48GB A6000 (Ampere) goes for $4600). The efficiency gains can be sizable. Eg, on my standard llama-bench test w/ llama2-7b-q4_0, the RTX 3090 gets a tg128 of 168 t/s while the 7900 XTX only gets 118 t/s even though both have similar memory bandwidth (936.2 GB/s vs 960 GB/s). It's also worth noting that since the beginning of the year, the llama.cpp CUDA implementation has gotten almost 25% faster, while the ROCm version's performance has stayed static.
There is an actively (solo dev) maintained fork of llama.cpp that sticks close to HEAD but basically applies a rocWMMA patch that can improve performance if you use the llama.cpp FA (still performs worse than w/ FA disabled) and in certain long-context inference generations (on llama-bench and w/ this ShareGPT serving test you won't see much difference) here: https://github.com/hjc4869/llama.cpp - The fact that no one from AMD has shown any interest in helping improve llama.cpp performance (despite often citing llama.cpp-based apps in marketing/blog posts, etc is disappointing ... but sadly on brand for AMD GPUs).
Anyway, for those interested in more information and testing for AI/ML setup for RDNA3 (and AMD ROCm in general), I keep a doc with lots of details here: https://llm-tracker.info/howto/AMD-GPUs
Intriguing. I thought AMD GPUs didn't have tensor cores (or matrix multiplication units) like NVidia. I believe they are only dot product / fused multiply and accumulate instructions.
Are these LLMs just absurdly memory bound so it doesn't matter?
boroboro4 27 days ago [-]
They absolutely do have similar cores to tensor cores, it's called matrix cores. And they have particular instructions to utilize them (MFMA).
Note I'm talking about DC compute chips, like MI300.
LLMs aren't memory bound in production loads, they are pretty much compute bound too, at least in prefill phase, but in practice in general too.
almostgotcaught 27 days ago [-]
Ya people in these comments don't know what they're talking about (no one ever does in these threads). AMDGPU has had MMA and WMMA for a while now
They don’t, but GPUs were designed for doing matrix multiplications even without the special hardware instructions for doing matrix multiplication tiles. Also, the forward pass for transformers is memory bound, and that is what does token generation.
dragontamer 27 days ago [-]
Well sure, but in other GPU tasks, like Raytracing, the difference between these GPUs is far more pronounced.
And AMD has passable Raytracing units (NVidias are better but the difference is bigger than these LLM results).
If RAM is the main bottleneck then CPUs should be on the table.
IX-103 27 days ago [-]
> If RAM is the main bottleneck then CPUs should be on the table
That's certainly not the case. The graphics memory model is very different from the CPU memory model. Graphics memory is explicitly designed for multiple simultaneous reads (spread across several different buses) at the cost of generality (only portions of memory may be available on each bus) and speed (the extra complexity means reads are slower). This makes then fast at doing simple operations on a large amount of data.
CPU memory only has one bus, so only a single read can happen at a time (a cache line read), but can happen relatively quickly. So CPUs are better for workloads with high memory locality and frequent reuse of memory locations (as is common in procedural programs).
dragontamer 27 days ago [-]
> CPU memory only has one bus
If people are paying $15,000 or more per GPU, then I can choose $15,000 CPUs like EPYC that have 12-channels or dual-socket 24-channel RAM.
Even desktop CPUs are dual-channel at a minimum, and arguably DDR5 is closer to 2 or 4 buses per channel.
GPUs are about extremely parallel performance, above and beyond what traditional single-threaded (or limited-SIMD) CPUs can do.
But if you're waiting on RAM anyway?? Then the compute-method doesn't matter. Its all about RAM.
ryao 26 days ago [-]
Where are these GPUs with multiple buses? I only know of GPUs with wide buses.
webmaven 27 days ago [-]
RAM is (often) the bottleneck for highly parallel GPUs, but not for CPUs.
Though the distinction between the two categories is blurring.
ryao 27 days ago [-]
Memory bandwidth is the bottleneck for both when running GEMV, which is the main operation used by token generation in inference. It has always been this way.
schmidtleonard 27 days ago [-]
CPUs have pitiful RAM bandwidth compared to GPUs. The speeds aren't so different but GPU RAM busses are wiiiiiiiide.
teleforce 27 days ago [-]
Compute Express Link (CXL) should mostly solve limited RAM with CPU:
Gigabytes per second? What is this, bandwidth for ants?
My years old pleb tier non-HBM GPU has more than 4 times the bandwidth you would get from a PCIe Gen 7 x16 link, which doesn't even officially exist yet.
teleforce 27 days ago [-]
Yes CXL will soon benefit from PCIe Gen 7 x16 with expected 64GB/s in 2025 and the non-HBM bandwidth I/O alternative is increasing rapidly by the day. For most inferences of near real-time LLM it will be feasible. For majority of SME companies and other DIY users (humans or ants) with their localized LLM should not be any issues [1],[2]. In addition new techniques for more efficient LLM are being discover to reduce the memory consumption [3].
[1] Forget ChatGPT: why researchers now run small AIs on their laptops:
The smaller LLM stuff in 1 and 2 is overrated. LLMs get plenty of things wrong and while the capabilities of small LLMs is increasing, they just are never as good as the larger LLMs in my testing. To give an example, between a small LLM that gets things right 20% of the time and a large one that gets things right 40% of the time, you are never going to want to deal the small one. Even when they improvement you will just find new things that they are not able to do well. At least, that is my experience.
Finally, the 75% savings figure in 3 is misleading. It applies to the context, not the LLMs themselves. It is very likely that nobody will use it, since it is a form of lossy compression that will ruin the ability of the LLM to repeat what is in its memory.
schmidtleonard 27 days ago [-]
No. Memory bandwidth is the important factor for LLM inference. 64GB/s is 4x less than the hypothetical I granted you (Gen7x16 = 256GB/s), which is 4x less than the memory bandwidth on my 2 year old pleb GPU (1TB/s), which is 10x less than a state of the art professional GPU (10TB/s), which is what the cloud services will be using.
That's 160x worse than cloud and 16x worse than what I'm using for local LLM. I am keenly aware of the options for compression. I use them every day. The sacrifices I make to run local LLM cut deep compared to the cloud models, and squeezing it down by another factor of 16 will cut deep on top of cutting deep.
Nothing says it can't be useful. My most-used model is running in a microcontroller. Just keep those expectations tempered.
(EDIT: changed the numbers to reflect red team victory over green team on cloud inference.)
Bandwidth between where the LLM is stored and where your matrix*vector multiplies are done is the important figure for inference. You want to measure this in terabytes per second, not gigabytes per second.
A 7900XTX also has 1TB/s on paper, but you'll need awkward workarounds every time you want to do something (see: article) and half of your workloads will stop dead with driver crashes and you need to decide if that's worth $500 to you.
Stacking 3090s is the move if you want to pinch pennies. They have 24GB of memory and 936GB/s of bandwidth each, so almost as good as the 4090, but they're as cheap as the 7900XTX with none of the problems. They aren't as good for gaming or training workloads, but for local inference 3090 is king.
It's not a coincidence that the article lists the same 3 cards. These are the 3 cards you should decide between for local LLM, and these are the 3 cards a true competitor should aim to exceed.
Dylan16807 27 days ago [-]
A 4090 is not "years old pleb tier". Same for 3090 and 7900XTX.
There's a serious gap between CXL and RAM, but it's not nearly as big as it used to be.
ryao 27 days ago [-]
The 3090 Ti and 4090 both have 1.01TB/sec memory bandwidth:
But as I addressed earlier, those are not "years old pleb tier".
adrian_b 27 days ago [-]
Already an ancient Radeon VII from 5 years ago had 1 terabyte per second of memory bandwidth.
Later consumer GPUs have regressed and only RTX 4090 offers the same memory bandwidth in the current NVIDIA generation.
Dylan16807 27 days ago [-]
Radeon VII had HBM.
So I can understand a call for returning to HBM, but it's an expensive choice and doesn't fit the description.
ryao 27 days ago [-]
That seems unlikely given that the full HBM supply for the next year has been earmarked for enterprise GPUs. That said, it would be definitely nice if HBM became available for consumer GPUs.
fc417fc802 27 days ago [-]
RTX 4090 comes to mind. Dunno that I'd consider that a "years old pleb tier non-HBM GPU" though.
ryao 27 days ago [-]
The main bottleneck is memory bandwidth. CPUs have less memory bandwidth than GPUs.
throwaway314155 27 days ago [-]
> Are these LLMs just absurdly memory bound so it doesn't matter?
During inference? Definitely. Training is another story.
mrcsharp 27 days ago [-]
I will only consider AMD GPUs for LLM when I can easily make my AMD GPU available within WSL and Docker on Windows.
For now, it is as if AMD does not exist in this field for me.
e-max 26 days ago [-]
Isn't it already available somehow? I didn't test it seriously, I just needed to quickly run Whisper but
I got a "gaming" PC for LLM inference with an RTX 3060. I could have gotten more VRAM for my buck with AMD, but didn't because at the time alot of inference needed CUDA.
As soon AMD is as good as Nvidia for inference, I'll switch over.
But I've read on here that their hardware engineers aren't even given enough hardware to test with...
Sparkyte 27 days ago [-]
More players in the market the better. AI shouldn't be owned by one business.
sroussey 27 days ago [-]
[2023]
Btw, this is from MLC-LLM which makes WebLLM and other good stuff.
guerrilla 27 days ago [-]
So, does ollama use this work or does it do something else? How does it compare?
starlite-5008 27 days ago [-]
[dead]
varelse 26 days ago [-]
[dead]
27 days ago [-]
leonewton253 27 days ago [-]
This benchmark doest look right. Is it using the tensor cores in the Nvidia gpu? AMD does not have AI cores so should run noticeably slower.
[1] https://www.jonpeddie.com/news/amd-to-integrate-cdna-and-rdn.... [2] https://centml.ai/hidet/ [3] https://centml.ai/platform/
I have 7 NVidia 4090s under my desk happily chugging along on week long training runs. I once managed to get a Radeon VII to run for six hours without shitting itself.
I have 6 Radeon Pro VII under my desk (in a single system BTW), and they run hard for weeks until I choose to reboot e.g. for Linux kernel updates.
I bought them "new old stock" for $300 apiece. So that's $1800 for all six.
(I release it will be significantly lower, just try to get as much of a comparison as is possible).
Moreover, for workloads limited by the memory bandwidth, a Radeon Pro VII and a RTX 4090 will have about the same speed, regardless what kind of computations are performed. It is said that speed limitation by memory bandwidth happens frequently for ML/AI inferencing.
Because the previous poster had mentioned only single precision, where RTX 4090 is better, I had to complete the data with double precision, where RTX 4090 is worse, and memory bandwidth where RTX 4090 is the same, otherwise people may believe that progress in GPUs over 5 years has been much greater than it really is.
Moreover, memory bandwidth is very relevant for inference, much more relevant than FP32 throughput.
Titan V: 7.8 TFLOPs
AMD Radeon Pro VII: 6.5 TFLOPs
AMD Radeon VII: 3.52 TFLOPs
4090: 1.3 TFLOPs
The two are rather different and one market is worth trillions, the other isn't.
Beyond that, not much else.
There's a bunch of similar setups and there are a couple of dozen people that have done something similar on /r/localllama.
[1] https://www.linkedin.com/feed/update/urn:li:activity:7275885...
[1] https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html [2] https://embeddedllm.com/blog/vllm-now-supports-running-gguf-...
https://x.com/nisten/status/1871325538335486049
The market for the extreme high-end consumer is pretty small, so they're only missing out on clout.
PyTorch requires it's own support kit separate from the oneAPI Toolkit (and runs slightly different versions of everything), the vLLM xpu support doesn't work - both source and the docker failed to build/run for me. The IPEX-LLM whisper support is completely borked, etc, etc.
SYCL with llama.cpp is great though, at least at FP16 since it supports nothing else but even Arc iGPUs easily give 2-4x performance compared to CPU inference.
Intel should've just contributed to SYCL instead of trying to make their own thing and then forgot to keep maintaining it halfway through.
I have some benchmarks as well, and the IPEX-LLM backend performed a fair bit better than the SYCL llama.cpp backend for me (almost +50% pp512 and almost 2X tg128) so worth getting it working if you plan on using llama.cpp much on an Intel system. SYCL still performs significantly better than Vulkan and CPU backends, though.
As an end-user, I agree that it'd be way better if they could just contribute upstream somehow (whether to the SYCL backend, or if not possible, to a dependency-minized IPEX backend). the IPEX backend is one of the more maintained parts of IPEX-LLM, btw. I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."
Honestly I think there's just something seriously broken with the way IPEX expects the GPU driver to be on 24.04 and there's nothing I can really do about it except wait for them to fix it if I want to keep using this OS.
I am vaguely considering adding another drive and installing 22.04 or 20.04 with the exact kernel they want to see if that might finally work in the meantime, but honestly I'm fairly satisfied with the speed I get from SYCL already. The problem is more that it's annoying to integrate it directly through the server endpoint, every projects expects a damn ollama api or llama-cpp-python these days and I'm a fan of neither since it's just another layer of headaches to get those compiled with SYCL.
> I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."
Yeah well the fact that oneAPI 2025 got released, broke IPEX, and they still haven't figured out a way to patch it for months makes me think it's total chaos internally, where teams work against each other instead of talking and coordinating.
That's not a long-term success strategy. Maybe good for getting your name in the conversation, but not sustainable.
One of the benefits of being Intel.
Profit can come after positive brand recognition for the product.
That said, the 4070 die is 294mm^2 while the B580 die is 272mm^2.
While defects ordinarily reduce yields, Intel put plenty of redundant transistors into the silicon. This is ordinarily not possible to estimate, but Tom Petersen reported in his interview with hardware unboxed that they did not count those when reporting the transistor count. Given that the density based on reported transistors is about 40% less than the density others get from the same process and the silicon in GPUs is already fairly redundant, they likely have a backup component for just about everything on the die. The consequence is that they should be able to use at least 99% of those dies even after tossing unusable dies, such that the $57 per die figure is likely correct.
As for the rest of the card, there is not much in it that would not be part of the price of an $80 Asrock motherboard. The main thing would be the bundled game, which they likely can get in bulk at around $5 per copy. This seems reasonable given how much Epic games pays for their giveaways:
https://x.com/simoncarless/status/1389297530341519362
That brings the total cost to $190. If we assume Asrock and the retailer both have a 10% margin on the $80 motherboard used as a substitute for the costs of the rest of the things, then it is $174. Then we need to add margins for board partners and the retailers. If we assume they both get 10% of the $250, then that leaves a $26 profit for Intel, provided that they have economics of scale such that the $80 motherboard approximation for the rest of the cost of the graphics card is accurate.
That is about a 10% margin for Intel. That is not a huge margin, but provided enough sales volume (to match the sales volume Asrock gets on their $80 motherboards), Intel should turn a profit on these versus not selling these at all. Interestingly, their board partners are not able/willing to hit the $250 MSRP and the closest they come to it is $260 so Intel is likely not sharing very much with them.
It should be noted that Tom Petersen claimed during his hardware unboxed interview that they were not making money on these. However, that predated the B580 being a hit and likely relied on expected low production volumes due to low sales projections. Since the B580 is a hit and napkin math says it is profitable as long as they build enough of them, I imagine that they are ramping production to meet demand and reach profitability.
Still, that's to be expected considering this is still only the second generation of Arc. If they can break even on the next gen, that would be an accomplishment.
While this is not an ideal situation, it is a decent foundation on which to build the next generation, which should be able to improve profitability.
I want to believe he's wrong, but on the parts of his show where I am in a position to verify, he generally checks out. Whatever the opposite of Gell-Mann Amnesia is, he's got it going for him.
https://news.ycombinator.com/item?id=42505496
The margins might be describable as razor thin, but they are there. Whether it can recoup the R&D that they spent designing it is hard to say definitively since I do not have numbers for their R&D costs. However, their iGPUs share the same IP blocks, so the iGPUs should be able to recoup the R&D costs that they have in common with the discrete version. Presumably, Intel can recoup the costs specific to the discrete version if they sell enough discrete cards.
While this is not a great picture, it is not terrible either. As long as Intel keeps improving its graphics technology with each generation, profitability should gradually improve. Although I have no insider knowledge, I noticed a few things that they could change to improve their profitability in the next generation:
I could have read too much into things that Tom Petersen said. Then again, he did say that their design team is conservative and the doubling rather than quadrupling of the SIMD lane count and the sheer amount of dark silicon (>40% of the die by my calculation) spent on what should be redundant components strike me as conservative design choices. Hopefully the next generation addresses these things.Also, they really do have >40% dark silicon when doing density comparisons:
They have 41% less density than Nvidia and 48% less density than TSMC claims the process can obtain. We also know that they have additional transistors on the die that are not active from Tom Petersen’s comments. Presumably, they are for redundancy. Otherwise, there really is no sane explanation that I can see for so much dark silicon. If they are using transistors that are twice the size as the density figure might be interpreted to suggest, they might as well have used TSMC’s 7nm process since while a smaller process can etch larger features, it is a waste of money.Note that we can rule out the cache lowering the density. The L1 + L2 cache on the 4070 Ti is 79872 KB while it is 59392 KB on the B580. We can also rule out IO logic as lowering the density, as the 4070 Ti has a 256-bit memory bus while the B580 has a 192-bit memory bus.
https://www.techpowerup.com/gpu-specs/arc-b580.c4244
https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti.c3...
https://en.wikipedia.org/wiki/5_nm_process#Nodes
The hardware unboxed interview of Tom Petersen is here:
https://youtu.be/XYZyai-xjNM
Where? The only mention I see in that interview is him briefly saying they have native 16 with "simple emulation" for 32 because some games want 32. I see no mention of or comparison to 8.
And it doesn't make sense to me that switching to actual 32 would be an improvement. Wider means less flexible here. I'd say a more accurate framing is whether the control circuitry is 1/8 or 1/16 or 1/32. Faking extra width is the part that is useful and also pretty easy.
Tom Petersen did a bunch of interviews right before the Intel B580 launch. In the hardware unboxed interview, he mentioned it, but accidentally misspoke. I must have interpreted his misspeak as meaning games want SIMD16 and noted it that way in my mind, as what he says elsewhere seems to suggest that games want SIMD16. It was only after thinking about what I heard that I realized otherwise. Here is an interview where he talks about native SIMD16 being better:
https://www.youtube.com/live/z7mjKeck7k0?t=35m38s
In specific, he says:
> But we also have native SIMD support—SIMD16 native support, which is going to say that you don’t have to like recode your computer shader to match a particular topology. You can use the one that you use for everyone else, and it’ll just run well on ARC. So I’m pretty exited about that.
In an interview with gamers nexus, he has a nice slide where he attributes a performance gain directly to SIMD16:
https://youtu.be/ACOlBthEFUw?t=16m35s
At the start of the gamers nexus video, Steve mentions that Tom‘s slides are from a presentation. I vaguely remember seeing a video of it where he talked more about SIMD16 being an improvement, but I am having trouble finding it.
Having to schedule fewer things is a definite benefit of 32 lanes over a smaller lane count. Interestingly, AMD switched from a 16 lane count to a 32 lane count with RDNA, and RDNA turned out to be a huge improvement in efficiency. The switch is actually somewhat weird since they had been emulating SIMD64 using their SIMD16 hardware, so the hardware simultaneously became wider and narrower at the same time. Their emulation of SIMD64 in SIMD16 is mentioned in this old GCN documentation describing cross lane operations:
https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operat...
That documentation talks about writing to a temporary location and reading form a temporary location in order to do cross lane operations. Contrast this with 12.5.1 of RDNA 3 ISA documentation, where the native SIMD32 units just fetch the values from each others’ registers with no mention of a temporary location:
https://www.amd.com/content/dam/amd/en/documents/radeon-tech...
That strikes me as much more efficient. While I do not write shaders, I have written CUDA kernels and in CUDA kernels, you sometimes need to do what Nvidia calls a parallel reduction across lanes, which are cross lane operations (Intel’s CPU division calls these horizontal operations). For example, you might need to sum across all lanes (e.g. for an average, matrix vector multiplication or dot product). When your thread count matches the SIMD lane count, you can do this without going to shared memory, which is fast. If you need to emulate a higher lane width, you need to use a temporary storage location (like what AMD described), which is not as fast.
If games’ shaders are written with an assumption that SIMD32 is used, then native SIMD32 is going to be more performant than native SIMD16 because of faster cross lane operations. Intel’s slide attributes a 0.3ms reduction in render time to their switch from SIMD8 to SIMD16. I suspect that they would see a further reduction with SIMD32 since that would eliminate the need to emulate SIMD32 for games that expect SIMD32 due to Nvidia (since as late as Turing) and AMD (since RDNA 1) both using SIMD32.
To illustrate this, here are some CUDA kernels that I wrote:
https://github.com/ryao/llama3.c/blob/master/rung.cu#L15
The softmax kernel for example has the hardware emulate SIMD1024, although you would need to look at the kernel invocations in the corresponding rung.c file to know that. The purpose of doing 1024 threads is to ensure that the kernel is memory bandwidth bound since the hardware bottleneck for this operation should be memory bandwidth. In order to efficiently do the parallel reductions to calculate the max and sum values in different parts of softmax, I use the fast SIMD32 reduction in every SIMD32 unit. I then write the results to shared memory from each of the 32 SIMD32 units that performed this (since 32 * 32 = 1024). I then have all 32x SIMD32 units read from shared memory and simultaneously do the same reduction to calculate the final value. Afterward, the leader in each unit tells all others the value and everything continues. Now imagine having a compiler compile this for a native SIMD16.
A naive approach would introduce a trip to shared memory for both reductions, giving us 3 trips to shared memory and 4 reductions. A more clever approach would do 2 trips to shared memory and 3 reductions. Either way, SIMD16 is less efficient. The smart thing to do would be to recognize that 256 threads is likely okay too and just do the same exact thing with a smaller number of threads, but a compiler is not expected to be able to make such a high level optimization, especially since the high level API says “use 1024 threads”. Thus you need the developer to rewrite this for SIMD16 hardware to get it to run at full speed and with Intel’s low marketshare, that is not very likely to happen. Of course, this is CUDA code and not a shader, but a shader is likely in a similar situation.
From a hardware design perspective, it saves you some die size in the scheduler.
From a performance perspective, as long as the hardware designer kept 32 in mind, it can schedule 32 lanes and duplicate the signals to the 16 or 8 wide lanes with no loss of performance.
> That documentation talks about writing to a temporary location and reading form a temporary location in order to do cross lane operations.
> If games’ shaders are written with an assumption that SIMD32 is used, then native SIMD32 is going to be more performant than native SIMD16 because of faster cross lane operations.
So this is a situation where wider lanes actually need more hardware to run at full speed and not having it causes a penalty. I see your point here, but I will note that you can add that criss-cross hardware for 32-wide operations while still having 16-wide be your default.
I was looking at the things that were said for XE2 in Lunar Lake and it appears that the slides suggest that they had special handling to emulate SIMD32 using SIMD16 in hardware, so you might be right.
> So this is a situation where wider lanes actually need more hardware to run at full speed and not having it causes a penalty. I see your point here, but I will note that you can add that criss-cross hardware for 32-wide operations while still having 16-wide be your default.
To go from SIMD8 to SIMD16, Intel halved the number of units while making them double the width. They could have done that again to avoid the need for additional hardware.
I have not seen the Xe2 instruction set to have any hints about how they are doing these operations in their hardware. I am going to leave it at that since I have spent far too much time analyzing the technical marketing for a GPU architecture that I am not likely to use. No matter how well they made it, it just was not scaled up enough to make it interesting to me as a developer that owns a RTX 3090 Ti. I only looked into it as much as I did since I am excited to see Intel moving forward here. That said, if they launched a 48GB variant, I would buy it in a heartbeat and start writing code to run on it.
272 mm2 puts the B580 in the same league as the Radeon 7700XT, a $449 card, and the GeForce 4070 Super, which is $599. The idea that Intel is selling these cards at a loss sounds reasonable to me.
Intel has neither, or at least not as much of them.
https://news.ycombinator.com/item?id=42505496
The only way this would be at a loss is if they refuse to raise production to meet demand. That said, I believe their margins on these are unusually low for the industry. They might even fall into razor thin territory.
https://news.ycombinator.com/item?id=42505496
That being said, the IP blocks are shared by their iGPUs, so the discrete GPUs do not need to recoup the costs of most of the R&D, as it would have been done anyway for the iGPUs.
At 10m3s in the following video, he claims to add a 60% margin by multiplying by 1.6, but in reality is adding a 37.5 margin and needed to multiply by 2.5 to add a 60% margin. This can be calculated by calculating Cost Scaling Factor = 1 / (1 - Normalized Profit Margin):
2.5 = 1 / (1 - 0.6)
1.6 = 1 / (1 - 0.375)
https://youtu.be/pq5G4mPOOPQ
At 48m13s in the following video, he claims that Intel’s B580 is 80% worse than Nvidia’s hardware. He took the 4070 Ti as being 82% better than the 2080 SUPER, assumed based on leaks from his reviewer friends that the B580 was about at the performance of the 2080 SUPER and then claimed that the B580 would be around 80% worse than the 4070 Ti. Unfortunately for him, that is 45% worse, not 80% worse. His chart is from Techpowerup and if he had taken the time to do some math (1 - 1/(1 + 0.82) ~ 0.45), or clicked to the 2080 SUPER page, he would have seen it has 55% of the performance of the 4070 Ti, which is 45% worse:
https://youtu.be/-lv52n078dw
At 1m2s in the following video, he makes a similar math mistake by saying that the B580 has 8% better price/performance than the RTX 3060 when in fact it is 9% better. He mistakenly equated the RTX 3060 being 8% worse than the B580 to mean that it is 8% better, but math does not work that way. Luckily for him, the math error is small here, but he still failed to do math correctly and his reasoning grows increasingly faulty with the scale of his math errors. What he should have done that gives the correct normalized factor is:
1.09 ~ 1 / (1 - 0.08)
A factor of 1.09 better is 9% better.
https://youtu.be/3jy6GDGzgbg
He not just fails at mathematical reasoning, but lacks a basic understanding of how hardware manufacturing works. He said that if Intel loses $20 per card in low production volumes, then making 10 million cards will result in a $200 million loss. In reality, things become cheaper due to economics of scale and simple napkin math shows that they can turn a profit on these cards:
https://news.ycombinator.com/item?id=42505496
His $20 loss per card remark is at 11m40s:
https://youtu.be/3jy6GDGzgbg
His behavior is consistent with being on a vendetta rather than being a technology journalist. For example, at 55m13s in the following video, he puts words in Tom Petersen’s mouth and then with a malicious smile on his mouth, cheers while claiming that Tom Petersen declared discrete ARC cards to be dead when Tom Petersen said nothing of the kind. Earlier in the same video at around 44m14s, he calls Tom Petersen a professional liar. However, he sees no problem expecting people to believe words he shoved into the “liar’s” mouth:
https://youtu.be/xVKcmGKQyXU
If you scrutinize his replies to criticism in his comments section, you would see he is dodging criticism of the actual issues with his coverage while saying “I was right about <insert thing completely unrelated to the complaint here>” or “facts don’t care about your feelings”. You would also notice that he is copy and pasting the same statements rather than writing replies addressing the details of the complaints. To be clear, I am paraphrasing in those two quotes.
He also shows contempt for his viewers that object to his behavior in the following video around 18m53s where he calls them “corporate cheerleaders”:
https://youtu.be/pq5G4mPOOPQ
In short, Tom at MLID is unable to do mathematical reasoning, does not understand how hardware manufacturing works, has a clear vendetta against Intel’s discrete graphics, is unable to take constructive criticism and lashes out at those who try to tell him when he is wrong. I suggest being skeptical of anything he says about Intel’s graphics division.
Is this something anyone sets out to do?
We just onboarded a customer to move from openai API to on-prem solution, currently evaluating MI300x for inference.
Email me at my profile email.
From their announcement on 20241219[^0]:
"We are the only company to get AMD on MLPerf, and we have a completely custom driver that's 50x simpler than the stock one. A bit shocked by how little AMD cared, but we'll take the trillions instead of them."
From 20241211[^1]:
"We gave up and soon tinygrad will depend on 0 AMD code except what's required by code signing.
We did this for the 7900XTX (tinybox red). If AMD was thinking strategically, they'd be begging us to take some free MI300s to add support for it."
---
[^0]: https://x.com/__tinygrad__/status/1869620002015572023
[^1]: https://x.com/__tinygrad__/status/1866889544299319606
[1] https://youtube.com/watch?v=dNrTrx42DGQ&t=3218
https://www.modular.com/blog/introducing-max-24-6-a-gpu-nati...
But that is irrelevant to the conversation because this is not about Mojo but something they call MAX. [1]
1. https://www.modular.com/max
And that's what's happening, slowly anyway. Google, Apple and Amazon all have their own AI chips, Intel has Gaudi, AMD had their thing, and the software is at least working on more than just Nvidia. Which is a win. Even if it's not perfect. I'm personally hoping that everyone piles in on a standard like SYCL.
I assume the part where she said there's "gaps in the software stack", because that's the only part that's attributed to her.
But I must be wrong because that hasn't been in dispute or in the news in a decade, it's not a geohot discovery from last year.
Hell I remember a subargument of a subargument re: this being an issue a decade ago in macOS dev (TL;Dr whether to invest in opencl)
tl;dr there's a non-unsubstantial # of people who learn a lot from geohot. I'd say about 3% of people here will be confused if you thought of him as less than a top technical expert across many comp sci fields.
And he did the geohot thing recently, way tl;dr: acted like there was a scandal being covered up by AMD around drivers that was causing them to "lose" to nVidia.
He then framed AMD not engaging with him on this topic as further covering-up and choosing to lose.
So if you're of a certain set of experiences, you see an anodyne quote from the CEO that would have been utterly unsurprising dating back to when ATI was still a company, and you'd read it as the CEO breezily admitting in public that geohot was right about how there was malfeasance, followed by a cover up, implying extreme dereliction of duty, because she either helped or didn't realize till now.
I'd argue this is partially due to stonk-ification of discussions, there was a vague, yet often communicated, sense there was something illegal happening. Idea was it was financial dereliction of duty to shareholders.
Quite frankly, I have difficulty reconciling a lot of comments here with that, and my own experience as an AMD GPU user (although not for compute, and not on Windows).
AMD is constitutionally incapable of shipping anything but mid range hardware that requires no innovation.
The only reason why they are doing so well in CPUs right now is that Intel has basically destroyed itself without any outside help.
The most important has been the fact that AMD has predicted correctly that big monolithic CPUs will no longer be feasible in the future CMOS fabrication technologies, so they have designed the Zen family since the beginning with a chiplet-based architecture. Intel had attempted to ridicule them, but after losing many billions they have been forced to copy this strategy.
Also in the microarchitecture of their CPUs AMD has made the right choices since the beginning and then they have improved it constantly with each generation. The result is that now the latest Intel big core, Lion Cove, has a microarchitecture that is much more similar to AMD Zen 5 than to any of the previous Intel cores, because they had to do this to get a competitive core.
In the distant past, AMD has also introduced a lot of innovations long before they were copied by Intel, but it is true that those had not been invented by AMD, but they had been copied by AMD from more expensive CPUs, like DEC Alpha or Cray or IBM POWER, but Intel has also copied them only after being forced by the competition with AMD.
But Nvidia is a different beast. It is a bit like Apple in the late 00s where you take business, forecast, marketing, operation, software, hardware, sales etc You take any part of it and they are all industry leading. And having industry leading capability is only part of the game, having it all work together is completely another thing. And unlike Apple where they lost direction once Steve Jobs passed away and weren't sure about how to deploy capital. Jensen is still here, and they have more resources now making Nvidia even more competitive.
It is often most people underestimate the magnitude of the task required, ( I like to tell the story again about an Intel GPU engineer in 2016 arguing they could take dGPU market shares by 2020, and we are now 2025 ), over estimate the capability of an organisation, under estimate the rival's speed of innovation and execution. These three thing combined is why most people are often off the estimate by an order of magnitude.
We are in the middle of a monopoly squeeze by NVidia on the most innovative part of the economy right now. I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.
By comparison if AMD could write a driver that didn't shit itself when it had to multiply more than two matrices in a row they'd be selling cards faster than they can make them. You don't need to sell the best shovels in a gold rush to make mountains of money, but you can't sell teaspoons as premium shovels and expect people to come back.
I am not sure which part of Nvidia is monopoly. That is like suggesting TSMC has a monopoly.
They... do have a monopoly on foundry capacity, especially if you're looking at the most advanced nodes? Nobody's going to Intel or Samsung to build 3nm processors. Hell, there have been whispers over the past month that even Samsung might start outsourcing Exynos to TSMC; Intel already did that with Lunar Lake.
Having a monopoly doesn't mean that you are engaging in anticompetitive behavior, just that you are the only real option in town.
I would argue that defining a semiconductor market in terms of node size is too narrow. Just because TSMC is getting the newest nodes first does not mean they have a monopoly in the semiconductor market. We can play semantics, but for any meaningful discussion of monopolistic behaviors, a temporary technical advantage seems a poor way to define the term.
Sure. Market research also places them as having somewhere around 65% of worldwide foundry sales [0], with Samsung coming in second place with about 12% (mostly first-party production). Fact is that nobody else comes close to providing real competition for TSMC, so they can charge whatever prices they want, whether you're talking about the 3nm node or the 10nm node.
[0] https://www.counterpointresearch.com/insights/global-semicon...
Rounding out the top five... SMIC (6%) is out of the question unless you're based in China due to various sanctions, UMC (5%) mainly sell decade+-old processes (22nm and larger), and Global Foundries explicitly has abandoned keeping up with the latest technologies.
If you exclude the various Chinese foundries and subtract off Samsung's first-party development, TSMC's share of available foundry capacity for third-party contracts likely grows to 70% or more. At what point do you consider this to be a monopoly? Microsoft Windows has about 72% of desktop OS share.
Maybe the US will do something if GPU price becomes the limit instead of the supply of chips and power.
It sounds like you're expecting extreme competence from the DOJ. Given their history with regulating big tech companies, and even worse, the incoming administration, I think this is a very unrealistic expectation.
Also I'd take HN as being being an amazing platform for the overall consistency and quality of moderation. Anything beyond that depends more on who you're talking to than where at.
Personally, I think that's when somebody who has no real information to contribute doesn't try to pretend that they do.
So thanks for the offer, but I think I'm already delivering on that realm.
The Internet is a machine that highly simplifies the otherwise complex technical challenge of wide-casting ignorance. It wide-casts wisdom too, but it's an exercise for the reader to distinguish them.
Everyone whose dug deep into what AMD is doing has left in disgust if they are lucky and bankruptcy if they are not.
If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.
This seems like unuseful advice if you've already given up on them.
You tried it and at some point in the past it wasn't ready. But by not being ready they're losing money, so they have a direct incentive to fix it. Which would take a certain amount of time, but once you've given up you no longer know if they've done it yet or not, at which point your advice would be stale.
Meanwhile the people who attempt it apparently seem to get acquired by Nvidia, for some strange reason. Which implies it should be a worthwhile thing to do. If they've fixed it by now which you wouldn't know if you've stopped looking, or they fix it in the near future, you have a competitive advantage because you have access to lower cost GPUs than your rivals. If not, but you've demonstrated a serious attempt to fix it for everyone yourself, Nvidia comes to you with a sack full of money to make sure you don't finish, and then you get a sack full of money. That's win/win, so rather than nobody doing it, it seems like everybody should be doing it.
I've seen people try it every six months for two decades now.
At some point you just have to accept that AMD is not a serious company, but is a second rate copycat and there is no way to change that without firing everyone from middle management up.
I'm deeply worried about stagnation in the CPU space now that they are top dog and Intel is dead in the water.
Here's hoping China and Risk V save us.
>Meanwhile the people who attempt it apparently seem to get acquired by Nvidia
Everyone I've seen base jumping has gotten a sponsorship from redbull, ergo. everyone should basejump.
Ignore the red smears around the parking lot.
AMD has always punched above their weight. Historically their problem was that they were the much smaller company and under heavy resource constraints.
Around the turn of the century the Athlon was faster than the Pentium III and then they made x86 64-bit when Intel was trying to screw everyone with Itanic. But the Pentium 4 was a marketing-optimized design that maximized clock speed at the expense of heat and performance per clock. Intel was outselling them even though the Athlon 64 was at least as good if not better. The Pentium 4 was rubbish for laptops because of the heat problems, so Intel eventually had to design a separate chip for that, but they also had the resources to do it.
That was the point that AMD made their biggest mistake. When they set out to design their next chip the competition was the Pentium 4, so they made a power-hungry monster designed to hit high clock speeds at the expense of performance per clock. But the reason more people didn't buy the Athlon 64 wasn't that they couldn't figure out that a 2.4GHz CPU could be faster than a 2.8GHz CPU, it was all the anti-competitive shenanigans Intel was doing behind closed doors to e.g. keep PC OEMs from featuring systems with AMD CPUs. Meanwhile by then Intel had figured out that the Pentium 4 was, in fact, a bad design, when their own Pentium M laptops started outperforming the Pentium 4 desktops. So the Pentium 4 line got canceled and Bulldozer had to go up against the Pentium M-based Core, which nearly bankrupted AMD and compromised their ability to fund the R&D needed to sustain state of the art fabs.
Since then they've been climbing back out of the hole but it wasn't until Ryzen in 2017 that you could safely conclude they weren't on the verge of bankruptcy, and even then they were saddled with a lot of debt and contracts requiring them to use the uncompetitive Global Foundries fabs for several years. It wasn't until Zen4 in 2022 that they finally got to switch the whole package to TSMC.
So until quite recently the answer to the question "why didn't they do X?" was obvious. They didn't have the money. But now they do.
Seven and a half years.
The excuse is threadbare at best. They are not doing a reasonable job of making compute work off the shelf.
Seven and a half years was the 2017 Ryzen release date. Zen 1 took them from being completely hopeless to having something competitive but only just, because they were still having the whole thing fabbed by GF. Their revenue didn't exceed what it was in 2011 until 2019 and didn't exceed Intel's until 2022. It's still less than Nvidia, even though AMD is fielding CPUs competitive with Intel and GPUs competitive with Nvidia at the same time.
They had a pretty good revenue jump in 2021 but much of that was used to pay down debt, because debt taken on when you're almost bankrupt tends to have unfavorable terms. So it wasn't until somewhere in 2022 that they finally got free of GF and the old debt and could start doing something about this. But then it takes some amount of time to actually do it, and you would expect to be seeing the results of that approximately right now. Which seems like a silly time to stop looking.
Also, somewhat counterintuitively, George Hotz et al seem to be employing a strategy in the nature of "say bad things about them in public to shame them into improving", which has the dual result of actually working (they fix a lot of the things he's complaining about) but also making people think that things are worse than they are because there is now a large public archive of rants about things they've already fixed. It's not clear if this is the company not providing a good mechanism for people to complain about things like that in private and have them fixed promptly so it doesn't have take media attention to make it happen, or it's George Hotz seeking publicity as is his custom, or some combination of both.
I'm not basing anything on geohotz, just general discussions from people that have tried, and my own experience of trying to get some popular compute code bases to run. It has been so lacking compared to AMD's own support for games. I'm not going to be "silly" and "stop looking" going forward, but I'm not going to forget how long my card was largely abandoned. It went directly from "not ready yet, working on it" to "obsolete, maybe dregs will be added later".
Zen+ and Zen 2 were released in 2019. Their revenue in 2019 was only 2.5% higher than it was in 2011; adjusted for inflation it was still down more than 10%.
> they absolutely did not need to wait until they had more revenue than some chunk of Intel or until their debt was gone.
The premise of the comparison is that it shows the resources they have available. To make the same level of investment as a bigger company you either have to take it out of profit (not possible when your net profit has a minus sign in front of it or is only a single digit) or you have to make more money first.
And carrying high interest debt when you're now at much lower risk of default is pretty foolish. You'd be paying interest that could be going to R&D. Even if you want to borrow money in order to invest it, the thing to do is to pay back the high interest debt and then borrow the money again now that you can get better terms, which seems to be just what they did.
It is so bad that they had major cards that were never on the support list for compute.
> You'd be paying interest that could be going to R&D.
Getting people to actually consider your datacenter cards, because they know how to use your cards, will get you more R&D money.
Have you tried compute shaders instead of that weird HPC-only stuff?
Compute shaders are widely used by millions of gamers every day. GPU vendors have huge incentive to make them reliable and efficient: modern game engines are using them for lots of thing, e.g. UE5 can even render triangle meshes with GPU compute instead of graphics (the tech is called nanite virtualized geometry). In practice they work fine on all GPUs, ML included: https://github.com/Const-me/Cgml
Ignoring the very old (in ML time) date of the article...
What's the catch? People are still struggling with this a year later so I have to assume it doesn't work as well as claimed.
I'm guessing this is buggy in practice and only works for the HF models they chose to test with?
The only catch is, for some reason developers of ML libraries like PyTorch aren’t interested in open GPU APIs like D3D or Vulkan. Instead, they focus on proprietary ones i.e. CUDA and to lesser extent ROCm. I don’t know why that is.
D3D-based videogames are heavily using GPU compute for more than a decade now. Since Valve shipped SteamDeck, the same now applies to Vulkan on Linux. By now, both technologies are stable, reliable and performant.
OTOH, general-purpose compute instead of fixed-function blocks used by cuDNN enables custom compression algorithms for these weights which does help, by saving memory bandwidth. For example, I did custom 5 bits/weight quantization which works on all GPUs, no hardware support necessary, just simple HLSL codes: https://github.com/Const-me/Cgml?tab=readme-ov-file#bcml1-co...
I would argue running local inference with batch size=1 is more useful for empowering innovators compared to running production loads on shared servers owned by companies. Local inference increases count of potential innovators by orders of magnitude.
BTW, in the long run it may also benefit these companies because in theory, an easy migration path from CUDA puts a downward pressure on nVidia’s prices.
llama.cpp has a much bigger supported model list, as does vLLM and of course PyTorch/HF transformers covers everything else, all of which work w/ ROCm on RDNA3 w/o too much fuss these days.
For inference, the biggest caveat is that Flash Attention is only an aotriton implementation, which besides being less performant sometimes, also doesn't support SWA. For CDNA there is a better CK-based version of FA, but CK doesn't not have RDNA support. There are a couple people at AMD apparently working on native FlexAttention, os I guess we'll how that turns out.
(Note the recent SemiAccurate piece was on training, which I'd agree is in a much worse state (I have personal experience with it being often broken for even the simplest distributed training runs). Funnily enough, if you're running simple fine tunes on a single RDNA3 card, you'll probably have a better time. OOTB, a 7900 XTX will train at about the same speed as an RTX 3090 (4090s blow both of those away, but you'll probably want more cards and VRAM of just move to H100s).
The reason why I say its dinosaur is, imagine, we as a dev community continued to build on top of Flash or Microsoft Silverlight...
LLM and ML has been out for quiet a while, with AI/LLM advancement, the transition must have been much quicker to move cross platform. But this hasn't yet and not sure when it will happen.
Building a translation layer on top CUDA is not the answer either to this problem.
1. It just works. When i tried to build things on Intel Arcs, i spent way more hours bikeshedding ipex and driver issues than developing
2. LLMs seem to have more cuda code in their training data. I can leverage claude and 4o to help me build things with cuda, but trying to get them to help me do the same things on ipex just doesn't work.
I'd very much love a translation layer for Cuda, like a dxvk or wine equivalent.
Would save a lot of money since Arc gpus are in the bargain bin and nvidia cloud servers are double the price of AMD.
As it stands now, my dual Intel Arc rig is now just a llama.cpp inference server for the family to use.
https://github.com/vosen/ZLUDA
https://github.com/ROCm/HIPIFY
This is pretty much exactly what happens with CUDA. Developers like it but then the users have to use expensive hardware with proprietary drivers/firmware, which is the relevant abomination. But users have some ability to influence developers, so as soon as we get the GPU equivalent of HTML5, what happens?
What do you mean by that? People trying to run their own models are not “the users” they are a tiny insignificant niche segment.
We're also likely to see a stronger swing away from "do inference in the cloud" because of the aligned incentives of "companies don't want to pay for all that hardware and electricity" and "users have privacy concerns" such that companies doing inference on the local device will have both lower costs and a feature they can advertise over the competition.
What this is waiting for is hardware in the hands of the users that can actually do this for a mass market price, but there is no shortage of companies wanting a piece of that. In particular, Apple is going to be pushing that hard and despite the price they do a lot of volume, and then you're going to start seeing more PCs with high-VRAM GPUs or iGPUs with dedicated GDDR/HBM on the package as their competitors want feature parity for the thing everybody is talking about, the cost of which isn't actually that high, e.g. 40GB of GDDR6 is less than $100.
https://github.com/triton-lang/triton/issues/4978
I have been playing with llama.cpp to run interference on conventional cpus. No conclusions but it's interesting. I need to look at llamafile next.
Making AMD GPUs competitive for LLM inference https://news.ycombinator.com/item?id=37066522 (August 9, 2023 — 354 points, 132 comments)
I would not recommend buying one for $600, it probably either won’t arrive or will be broken. Someone will reply saying they got one for $600 and it works, that doesn’t mean it will happen if you do it.
I’d say the market is realistically $900-1100, maybe $800 if you know the person or can watch the card running first.
All that said, this advice will expire in a month or two when the 5090 comes out.
Just beware, the card might be "working fine" on a first glance, but actually be damaged.
1.https://www.modular.com/max
That being said, I did some very recent inference testing on an W7900 (using the same testing methodology used by Embedded LLM's recent post to compare to vLLM's recently added Radeon GGUF support [1]) and MLC continues to perform quite well. On Llama 3.1 8B, MLC's q4f16_1 (4.21MB weights) performed +35% faster than llama.cpp w/ Q4_K_M w/ their ROCm/HIP backend (4.30MB weights, 2% size difference).
That makes MLC still the generally fastest standalone inference engine for RDNA3 by a country mile. However, you have much less flexibility with quants and by and large have to compile your own for every model, so llama.cpp is probably still more flexible for general use. Also llama.cpp's (recently added to llama-server) speculative decoding can also give some pretty sizable performance gains. Using a 70B Q4_K_M + 1B Q8_0 draft model improves output token throughput by 59% on the same ShareGPT testing. I've also been running tests with Qwen2.5-Coder and using a 0.5-3B draft model for speculative decoding gives even bigger gains on average (depends highly on acceptance rate).
Note, I think for local use, vLLM GGUF is still not suitable at all. When testing w/ a 70B Q4_K_M model (only 40GB), loading, engine warmup, and graph compilation took on avg 40 minutes. llama.cpp takes 7-8s to load the same model.
At this point for RDNA3, basically everything I need works/runs for my use cases (primarily LLM development and local inferencing), but almost always slower than an RTX 3090/A6000 Ampere (a new 24GB 7900 XTX is $850 atm, used or refurbished 24 GB RTX 3090s are in in the same ballpark, about $800 atm; a new 48GB W7900 goes for $3600 while an 48GB A6000 (Ampere) goes for $4600). The efficiency gains can be sizable. Eg, on my standard llama-bench test w/ llama2-7b-q4_0, the RTX 3090 gets a tg128 of 168 t/s while the 7900 XTX only gets 118 t/s even though both have similar memory bandwidth (936.2 GB/s vs 960 GB/s). It's also worth noting that since the beginning of the year, the llama.cpp CUDA implementation has gotten almost 25% faster, while the ROCm version's performance has stayed static.
There is an actively (solo dev) maintained fork of llama.cpp that sticks close to HEAD but basically applies a rocWMMA patch that can improve performance if you use the llama.cpp FA (still performs worse than w/ FA disabled) and in certain long-context inference generations (on llama-bench and w/ this ShareGPT serving test you won't see much difference) here: https://github.com/hjc4869/llama.cpp - The fact that no one from AMD has shown any interest in helping improve llama.cpp performance (despite often citing llama.cpp-based apps in marketing/blog posts, etc is disappointing ... but sadly on brand for AMD GPUs).
Anyway, for those interested in more information and testing for AI/ML setup for RDNA3 (and AMD ROCm in general), I keep a doc with lots of details here: https://llm-tracker.info/howto/AMD-GPUs
[1] https://embeddedllm.com/blog/vllm-now-supports-running-gguf-...
Are these LLMs just absurdly memory bound so it doesn't matter?
LLMs aren't memory bound in production loads, they are pretty much compute bound too, at least in prefill phase, but in practice in general too.
https://rocm.docs.amd.com/projects/rocWMMA/en/latest/what-is...
And AMD has passable Raytracing units (NVidias are better but the difference is bigger than these LLM results).
If RAM is the main bottleneck then CPUs should be on the table.
That's certainly not the case. The graphics memory model is very different from the CPU memory model. Graphics memory is explicitly designed for multiple simultaneous reads (spread across several different buses) at the cost of generality (only portions of memory may be available on each bus) and speed (the extra complexity means reads are slower). This makes then fast at doing simple operations on a large amount of data.
CPU memory only has one bus, so only a single read can happen at a time (a cache line read), but can happen relatively quickly. So CPUs are better for workloads with high memory locality and frequent reuse of memory locations (as is common in procedural programs).
If people are paying $15,000 or more per GPU, then I can choose $15,000 CPUs like EPYC that have 12-channels or dual-socket 24-channel RAM.
Even desktop CPUs are dual-channel at a minimum, and arguably DDR5 is closer to 2 or 4 buses per channel.
Now yes, GPU RAM can be faster, but guess what?
https://www.tomshardware.com/pc-components/cpus/amd-crafts-c...
GPUs are about extremely parallel performance, above and beyond what traditional single-threaded (or limited-SIMD) CPUs can do.
But if you're waiting on RAM anyway?? Then the compute-method doesn't matter. Its all about RAM.
Though the distinction between the two categories is blurring.
1) Compute Express Link (CXL):
https://en.wikipedia.org/wiki/Compute_Express_Link
PCIe vs. CXL for Memory and Storage:
https://news.ycombinator.com/item?id=38125885
My years old pleb tier non-HBM GPU has more than 4 times the bandwidth you would get from a PCIe Gen 7 x16 link, which doesn't even officially exist yet.
[1] Forget ChatGPT: why researchers now run small AIs on their laptops:
https://news.ycombinator.com/item?id=41609393
[2] Welcome to LLMflation – LLM inference cost is going down fast:
https://a16z.com/llmflation-llm-inference-cost/
[3] New LLM optimization technique slashes memory costs up to 75%:
https://news.ycombinator.com/item?id=42411409
https://github.com/ryao/llama3.c/blob/master/run.c
First, CXL is useless as far as I am concerned.
The smaller LLM stuff in 1 and 2 is overrated. LLMs get plenty of things wrong and while the capabilities of small LLMs is increasing, they just are never as good as the larger LLMs in my testing. To give an example, between a small LLM that gets things right 20% of the time and a large one that gets things right 40% of the time, you are never going to want to deal the small one. Even when they improvement you will just find new things that they are not able to do well. At least, that is my experience.
Finally, the 75% savings figure in 3 is misleading. It applies to the context, not the LLMs themselves. It is very likely that nobody will use it, since it is a form of lossy compression that will ruin the ability of the LLM to repeat what is in its memory.
That's 160x worse than cloud and 16x worse than what I'm using for local LLM. I am keenly aware of the options for compression. I use them every day. The sacrifices I make to run local LLM cut deep compared to the cloud models, and squeezing it down by another factor of 16 will cut deep on top of cutting deep.
Nothing says it can't be useful. My most-used model is running in a microcontroller. Just keep those expectations tempered.
(EDIT: changed the numbers to reflect red team victory over green team on cloud inference.)
https://en.wikipedia.org/wiki/PCI_Express#PCI_Express_7.0
So you have a full terabyte per second of bandwidth? What GPU is that?
(The 64GB/s number is an x4 link. If you meant you have over four times that, then it sounds like CXL would be pretty competitive.)
A 7900XTX also has 1TB/s on paper, but you'll need awkward workarounds every time you want to do something (see: article) and half of your workloads will stop dead with driver crashes and you need to decide if that's worth $500 to you.
Stacking 3090s is the move if you want to pinch pennies. They have 24GB of memory and 936GB/s of bandwidth each, so almost as good as the 4090, but they're as cheap as the 7900XTX with none of the problems. They aren't as good for gaming or training workloads, but for local inference 3090 is king.
It's not a coincidence that the article lists the same 3 cards. These are the 3 cards you should decide between for local LLM, and these are the 3 cards a true competitor should aim to exceed.
There's a serious gap between CXL and RAM, but it's not nearly as big as it used to be.
https://www.techpowerup.com/gpu-specs/geforce-rtx-3090-ti.c3...
Later consumer GPUs have regressed and only RTX 4090 offers the same memory bandwidth in the current NVIDIA generation.
So I can understand a call for returning to HBM, but it's an expensive choice and doesn't fit the description.
During inference? Definitely. Training is another story.
For now, it is as if AMD does not exist in this field for me.
[1] https://rocm.docs.amd.com/projects/radeon/en/latest/docs/ins...
As soon AMD is as good as Nvidia for inference, I'll switch over.
But I've read on here that their hardware engineers aren't even given enough hardware to test with...
Btw, this is from MLC-LLM which makes WebLLM and other good stuff.