@KingRandomGuy

KingRandomGuy@lemmy.world · 8 hours ago

You can actually get kind of acceptable performance on CPU alone, but you need rather specific CPUs, like SPR or newer Intel Xeons. These support AMX, which is almost like a mini tensor core, so you can actually get decent throughput in TFLOPs out of GNR Xeons. Memory bandwidth with max channels is also acceptable, something like ~800 GB/s per socket with maxed out MRDIMMs, which is not too far behind consumer GPUs like 3090 and 4090.

Not anywhere near the performance of real GPUs of course, and not something acceptable for scale or production workloads, but good enough for local inference.

KingRandomGuy@lemmy.world · 10 hours ago

Makes sense, even Flash is fairly sizable! KTransformers also has a “llamafile” backend which uses GGUFs, but ik_llama will almost certainly perform better if you’re not on a NUMA setup. In my case, I’m using a dual socket motherboard, so KTransformers performs quite a bit better (I think ik_llama hasn’t implemented extensive NUMA optimizations quite yet, but sounds like it’s coming), though I normally use KTransformers for native FP8 weights.

KingRandomGuy@lemmy.world · 12 hours ago

Yeah, I’d expect KTransformers to add support eventually, especially considering their existing support for previous DeepSeek models. One of the tricky parts is that backends need both FP8 and MXFP4 support. As far as I’m aware no inference engine supports both on CPU at the moment (llama.cpp added fp4 support recently, but doesn’t have fp8, while kt-kernel doesn’t support fp4 yet).

KingRandomGuy@lemmy.world · 12 hours ago

To be fair, the raw FLOPs count doesn’t tell the whole story. On a lot of workloads (including token generation during LLM inference), you’re bound by the memory bandwidth rather than throughput/FLOPs. On H100/H200, keeping the tensor cores fully occupied is surprisingly difficult, and that’s with 3+ TB/s of memory bandwidth. And I believe those cards have much higher throughput (at least at FP8, Ascend wins at FP4 since H100/200 don’t support it) compared to Ascend.

The Ascend 950PR units have far lower memory bandwidth, reportedly at 1.4 TB/s. Compare that to Blackwell, which has something like 8TB/s of bandwidth. I believe they’re manufacturing their own kind of HBM, so that’s still really impressive considering this is a fairly recent push into manufacturing accelerators. But I’m a bit skeptical it actually outperforms NVIDIA at scale.