I’ve spent the last months researching and building a dedicated inference workstation that can handle some serious local AI tasks. This is the second part in that series. The first article where I covered the hardware can be found here:
Since I put together my local LLM inference rig with dual Radeon AI Pro R9700 cards I kept wondering how they actually stack up against the consumer RX 9700 XT, Nvidia RTX 5080 and Apple Silicon machines I had access to. Rather than keep guessing I built a small benchmark suite so I could measure it properly and re-run it whenever a new card shows up. For the test I only used a single R9700. I will do a future article on performance for single vs dual cards and the pros & cons.
There are a few things I think is important to understand and to keep in mind when reading this post. This was just a quick test I wanted to perform on different hardware to see what differences there were and to have something to compare it against. This is no mean a perfect test but its something and I can continue to work on it and improve the test suite over time. What I learned during my testing was that I really enjoyed the testing part. It was fun and interesting to work on the benchmark and see the results and also work on how to visualize it.
Also important to mention is that the test and numbers does not take quality in to account as thats more model specific than performance. So higher number might be faster but it does not mean its more or less accurate. Different models are trained and optimized for different things. Some are aimed towards coding and others might be vision or general chat. What a model is good at depends on what its been trained on and for what purpose its trying to achieve.
Test setup and LLM models
Everything runs through llama-bench from llama.cpp. Its the standard tool for this kind of thing and it outputs the same numbers regardless of platform, so the results are comparable across AMD, Nvidia, Apple and whatever I add later.
I went with Qwen3 as the model to perform the tests on. 8B, 14B and 32B parameters seemed to me like a good way of testing different workloads as the size of the model increase.
These three models are all Q4_K_M. I went with Quantization of 4 as its a good middle ground. Below 4 you generally loose too much quality and above 4 its marginally better but at the cost of size.
– Qwen3-8B
– Qwen3-14B
– Qwen3-32B
Two things get measured:
- Prompt processing (pp512) is how fast the card reads a prompt. This is compute bound.
- Token generation (tg128) is how fast it writes the answer. This is memory bandwidth bound, and its the number you actually feel when you sit and chat with a model.
Each test runs at context depths of 0, 2048 and 8192 tokens, 5 repetitions, flash attention on, all layers on the GPU.
Quick overview and handpicked dataset
Here’s an overview for anyone who doesn’t want to read through everything and just want to get a quick glance on what to expect with the different GPUs at a handpicked dataset that gives a good understanding on the results.
How the ratings are decided
Every rating comes down to two questions: does the model fit in memory, and once it fits, how fast does it generate?
1. Does it fit in memory?
A model that doesn’t fit can’t run at all. That’s an automatic “Won’t fit”, whatever the speed. Each bar is the usable memory a card can give a model; unified-memory machines lose some to the operating system as the system shares memory between RAM and the GPU.
A model fits only if its dashed line sits inside the coloured bar. The 32B needs more than 23 GB, so only the R9700 and the 32 GB M4 reach it
2. How fast does it generate?
Once a model fits, generation speed sets the experience. Tokens per second, measured at tg128.
Reading speed sits around 6 tok/s, so even “Limited” is faster than you read. It just feels sluggish for back-and-forth. The same bands apply at every model size: 25 tok/s on a 32B and 25 tok/s on an 8B both land in “Good”. Bigger models do not get a curve. For the long-context column (Document RAG), the band is applied to tg128 @ depth 8192 instead of depth 0. Same bands, deeper context.
What can each GPU actually run?
Local LLM inference benchmarked on Qwen3 at Q4_K_M. Ratings come from measured throughput, not spec sheets.
| GPU / workload | Quick chat8B · real-time | Code assistant14B · IDE context | Document RAG14B · 16K context | Long-form output14B · sustained | Heavy reasoning32B · slower OK |
|---|---|---|---|---|---|
| AMD AI Pro R970032 GB · workstation | Excellent95 tok/s | Excellent53 tok/s | Excellent45 tok/s @ depth 8192 | Excellent53 tok/s | OK24.9 tok/s |
| AMD RX 9700 XT16 GB · consumer | Excellent96 tok/s | Excellent54 tok/s | Excellent47 tok/s @ depth 8192 | Excellent54 tok/s | Won’t fitout of memory |
| Nvidia RTX 508016 GB · consumer | Excellent160 tok/s | Excellent93 tok/s | Excellent81 tok/s @ depth 8192 | Excellent93 tok/s | Won’t fitout of memory |
| Apple Macbook Pro M524 GB unified · base | Good25 tok/s | Limited14 tok/s | Limited12 tok/s @ depth 8192 | Limited14 tok/s | Won’t fitmarginal on 24 GB |
| Apple Macbook Air M432 GB unified · Air | OK21 tok/s | Limited8.7 tok/s | Limited7 tok/s @ depth 8192 | Limited8.7 tok/s | Limited3.7 tok/s |
Updated June 2026 · timmyit.com / LLM-Benchmark
Token generation
This is the one most people care about.
Token generation is how fast an LLM produces its answer, measured in tokens per second where a token is roughly half a short word. It’s the speed you actually feel sitting there waiting for a response, so it matters more than the headline specs. A card pushing 40-50 tok/s feels instant, while anything under 10 starts to drag when you’re going back and forth with the model. Personally around 25 tok/s feels okay-ish but anything under 15 tok/s to me feels unusable.
Tokens per second at empty context
Token generation, by model size
Tokens per second at empty context (tg128, depth 0). Deeper green is faster; greyed cells are models that don’t fit.
Each AMD card shows its ROCm and Vulkan results on separate rows. They land on the same colour because they’re within 1% of each other. That’s the point: the backend barely matters on RDNA 4.
tg128 at depth 0 · Qwen3 Q4_K_M · timmyit.com / LLM-Benchmark
Prompt processing
Prompt processing is the time the model spends reading your input before it starts answering, measured in tokens per second like generation but covering everything you send in. It matters most when you’re feeding it big context, a long document, a chunk of code, or a chat history that keeps growing, since that’s where the wait piles up. The good thing is you only pay it once at the start of the turn, so a slower number here stings less than slow token generation does.
Tokens per second processing a 512 token prompt at empty context:
Prompt processing, by model size
Tokens per second ingesting the prompt (pp512, depth 0). This is how fast a card reads your input before it starts answering. Deeper green is faster; greyed cells are models that don’t fit.
As with token generation, each AMD card’s ROCm and Vulkan rows land on the same colour: within about 2% of each other, so the backend choice barely moves prompt processing either.
Prompt processing decides how long before the answer starts. A 2,000-word file is roughly 2,800 tokens: the RTX 5080 ingests it in about a third of a second on the 8B, the R9700 takes under a second, and the M4 takes around 14 seconds on the 8B and over a minute and a half on the 32B. This is the compute-bound side of inference, and it is where the discrete cards pull furthest ahead.
pp512 at depth 0 · Qwen3 Q4_K_M · timmyit.com / LLM-Benchmark
Model size specific data
Now lets have a look at the specific models 8B,14B & 32B and the performance in relation to different workloads:
Quick chat — Casual back-and-forth with a small model: ask, answer, ask again.
Code assistant — Coding help in your editor, where the speed needs to keep up with how fast you type and think.
Document RAG — Asking questions about your own documents (PDFs, notes, internal docs), so the model has to read a lot of text before it can answer.
Long-form output — Generating longer pieces like articles, reports, summaries without the speed dropping off partway through.
Heavy reasoning — Hard problems where you’d rather wait for a good answer than rush a bad one.
Qwen3-8B Q4_K_M
Qwen3-8B Q4_K_M (~7 GB): The small one. It loads in a second or two and leaves almost all of the VRAM free for context, so it’s what you can run when you just want a fast answer for summaries or quick drafts and don’t need it to be the best one.
Qwen3-8B Q4_K_M — workload fitness
How each GPU handles the five canonical workloads with the 8B model.
| GPU / workload | Quick chat8B · real-time | Code assistant8B · IDE context | Document RAG8B · 16K context | Long-form output8B · sustained | Heavy reasoning8B · slower OK |
|---|---|---|---|---|---|
| AMD AI Pro R970032 GB · workstation | Excellent94 tok/s | Excellent94 tok/s | Excellent80 tok/s @ depth 8192 | Excellent94 tok/s sustained | Excellent94 tok/s |
| AMD RX 9700 XT16 GB · consumer | Excellent97 tok/s | Excellent97 tok/s | Excellent82 tok/s @ depth 8192 | Excellent97 tok/s sustained | Excellent97 tok/s |
| Nvidia RTX 508016 GB · consumer | Excellent160 tok/s | Excellent160 tok/s | Excellent130 tok/s @ depth 8192 | Excellent160 tok/s sustained | Excellent160 tok/s |
| Apple Macbook Pro M524 GB unified · base | Good25 tok/s | Good25 tok/s | OK20 tok/s @ depth 8192 | Good25 tok/s sustained | Good25 tok/s |
| Apple Macbook Air M432 GB unified · Air | OK21 tok/s | OK21 tok/s | Limited13 tok/s @ depth 8192 | OK21 tok/s sustained | OK21 tok/s |
Updated June 2026 · timmyit.com / LLM-Benchmark
Qwen3-8B Q4_K_M — measured throughput
Raw numbers. All values in tok/s.
| GPU / backend | Prompt processing @ depth 0 | Token generation (tg128) by depth | |||
|---|---|---|---|---|---|
| pp512 | pp2048 | 0 | 2048 | 8192 | |
| R9700rocm · 32 GB | 3891 | 3721 | 94.3 | 90.2 | 79.9 |
| R9700vulkan · 32 GB | 3978 | 3795 | 95.3 | 91.1 | 80.6 |
| RX 9700 XTrocm · 16 GB | 4053 | 3878 | 96.6 | 92.3 | 81.5 |
| RX 9700 XTvulkan · 16 GB | 4039 | 3868 | 96.6 | 92.3 | 81.6 |
| RTX 5080cuda · 16 GB | 8342 | 8153 | 159.6 | 149.8 | 130.2 |
| MacBook Pro M5metal · 24 GB unified | 664 | 616 | 25.3 | 23.9 | 20.4 |
| MacBook Air M4metal · 32 GB unified | 198 | 171 | 20.8 | 15.4 | 12.5 |
5 reps + 1 warmup discarded · flash attention on · ngl=99. Updated June 2026 · timmyit.com / LLM-Benchmark
Qwen3-14B Q4_K_M
Qwen3-14B Q4_K_M (~11 GB): The middle one. Better reasoning than the 8B but still small enough to stay fast, so this is the one you can use when you want a decent answer without the slower turnaround of the 32B.
Qwen3-14B Q4_K_M — workload fitness
How each GPU handles the five canonical workloads with the 14B model.
| GPU / workload | Quick chat14B · real-time | Code assistant14B · IDE context | Document RAG14B · 16K context | Long-form output14B · sustained | Heavy reasoning14B · slower OK |
|---|---|---|---|---|---|
| AMD AI Pro R970032 GB · workstation | Excellent53 tok/s | Excellent53 tok/s | Excellent45 tok/s @ depth 8192 | Excellent53 tok/s sustained | Excellent53 tok/s |
| AMD RX 9700 XT16 GB · consumer | Excellent54 tok/s | Excellent54 tok/s | Excellent47 tok/s @ depth 8192 | Excellent54 tok/s sustained | Excellent54 tok/s |
| Nvidia RTX 508016 GB · consumer | Excellent93 tok/s | Excellent93 tok/s | Excellent81 tok/s @ depth 8192 | Excellent93 tok/s sustained | Excellent93 tok/s |
| Apple Macbook Pro M524 GB unified · base | Limited14 tok/s | Limited14 tok/s | Limited12 tok/s @ depth 8192 | Limited14 tok/s sustained | Limited14 tok/s |
| Apple Macbook Air M432 GB unified · Air | Limited8.7 tok/s | Limited8.7 tok/s | Limited7 tok/s @ depth 8192 | Limited8.7 tok/s sustained | Limited8.7 tok/s |
Updated June 2026 · timmyit.com / LLM-Benchmark
Qwen3-14B Q4_K_M — measured throughput
Raw numbers. All values in tok/s.
| GPU / backend | Prompt processing @ depth 0 | Token generation (tg128) by depth | |||
|---|---|---|---|---|---|
| pp512 | pp2048 | 0 | 2048 | 8192 | |
| AMD AI Pro R9700rocm · 32 GB | 2139 | 1919 | 52.7 | 50.6 | 45.3 |
| AMD AI Pro R9700vulkan · 32 GB | 2168 | 1943 | 53.1 | 50.8 | 45.4 |
| AMD RX 9700 XTrocm · 16 GB | 2235 | 2007 | 54.5 | 52.3 | 47.1 |
| AMD RX 9700 XTvulkan · 16 GB | 2226 | 2003 | 54.5 | 52.3 | 47.1 |
| Nvidia RTX 5080cuda · 16 GB | 4638 | 4493 | 93.2 | 89.6 | 81.3 |
| Apple Macbook Pro M5metal · 24 GB unified | 343 | 307 | 13.9 | 13.4 | 12.0 |
| Apple Macbook Air M4metal · 32 GB unified | 72 | 71 | 8.7 | 8.3 | 7.1 |
5 reps + 1 warmup discarded · flash attention on · ngl=99. Updated June 2026 · timmyit.com / LLM-Benchmark
Qwen3-32B Q4_K_M
Qwen3-32B Q4_K_M (~23 GB): The biggest dense Qwen3 of the three, and the one I pick when I care more about the quality of the answer than how fast it comes back. At around 23 GB it still fits on a single 32 GB card with room left over for context, so there’s not much reason to skip it when you’re not in a hurry.
Qwen3-32B Q4_K_M — workload fitness
How each GPU handles the five canonical workloads with the 32B model. The 32B is ~18 GB before KV cache, so 16 GB cards and ~17 GB-usable unified-memory Macs cannot load it, those rows are Won’t fit across the board.
| GPU / workload | Quick chat32B · real-time | Code assistant32B · IDE context | Document RAG32B · 16K context | Long-form output32B · sustained | Heavy reasoning32B · slower OK |
|---|---|---|---|---|---|
| AMD AI Pro R970032 GB · workstation | OK24.9 tok/s | OK24.9 tok/s | OK22.9 tok/s @ depth 8192 | OK24.9 tok/s sustained | OK24.9 tok/s |
| AMD RX 9700 XT16 GB · consumer | Won’t fitout of memory | Won’t fitout of memory | Won’t fitout of memory | Won’t fitout of memory | Won’t fitout of memory |
| Nvidia RTX 508016 GB · consumer | Won’t fitout of memory | Won’t fitout of memory | Won’t fitout of memory | Won’t fitout of memory | Won’t fitout of memory |
| Apple Macbook Pro M524 GB unified · base | Won’t fitmarginal on 24 GB | Won’t fitmarginal on 24 GB | Won’t fitmarginal on 24 GB | Won’t fitmarginal on 24 GB | Won’t fitmarginal on 24 GB |
| Apple Macbook Air M432 GB unified · Air | Limited3.7 tok/s | Limited3.7 tok/s | Limited3.2 tok/s @ depth 8192 | Limited3.7 tok/s sustained | Limited3.7 tok/s |
Updated June 2026 · timmyit.com / LLM-Benchmark
Qwen3-32B Q4_K_M — measured throughput
Raw numbers. All values in tok/s. The 32B is ~18 GB before KV cache, so 16 GB cards and ~17 GB-usable unified-memory Macs cannot load it, those rows show 0.
| GPU / backend | Prompt processing @ depth 0 | Token generation (tg128) by depth | |||
|---|---|---|---|---|---|
| pp512 | pp2048 | 0 | 2048 | 8192 | |
| AMD AI Pro R9700rocm · 32 GB | 921 | 852 | 24.9 | 24.4 | 22.9 |
| AMD AI Pro R9700vulkan · 32 GB | 927 | 856 | 25.0 | 24.4 | 22.9 |
| AMD RX 9700 XTrocm · 16 GB | 0 | 0 | 0 | 0 | 0 |
| AMD RX 9700 XTvulkan · 16 GB | 0 | 0 | 0 | 0 | 0 |
| Nvidia RTX 5080cuda · 16 GB | 0 | 0 | 0 | 0 | 0 |
| Apple Macbook Pro M5metal · 24 GB unified | 0 | 0 | 0 | 0 | 0 |
| Apple Macbook Air M4metal · 32 GB unified | 29 | 28 | 3.7 | 3.7 | 3.2 |
5 reps + 1 warmup discarded · flash attention on · ngl=99. 0 = model did not fit in available VRAM. Updated June 2026 · timmyit.com / LLM-Benchmark
Few interesting observations
The R9700 and the RX 9700 XT use the same chip
The numbers show it. The RX 9700 XT is 2 to 4 percent faster across every test, which is just the consumer card running slightly higher clocks. There is no meaningful inference advantage to the workstation card on raw speed.
Where the R9700 earns its place is the 32GB. Its the only AMD card here that loads the 32B model at all, and it runs it at a reasonable 24.9 tokens per second. The RX 9700 XT runs out of memory and cant load it. So the R9700 is not the faster card, its the card that can handle larger data. For local inference bigger means being able to store larger context and or handle larger datasets.
ROCm and Vulkan perform the same on RDNA4
This one surprised me a little. I expected ROCm to have a clear lead, but on both cards Vulkan lands within about 1 percent of ROCm and is occasionally ahead on prompt processing.
For anyone who struggled with ROCm install, that is good news. On these cards you can run the Vulkan backend and lose effectively nothing, which is a lot less setup than getting ROCm working.
Comparing the two 9700 with Nvidia RTX 5080
Is this a fair comparison? No its not. But it shows a few interesting things when it comes to memory. The RX 9700 XT have the same amount of 16GB memory as the 5080. These cards are aimed towards different segments of the market. the RX 9700 XT is a mid-tier consumer card and the RTX 5080 is a high-tier consumer card. The 9700 XT is half the price of the 5080.
The AI Pro R9700 32GB is a workstations class card but cost around the same as the 5080. It has double the memory amount so larger models and larger context can fit.
The M5 is a big jump over the M4 on compute, but not on bandwidth
The Apple side is where it gets interesting. Look at prompt processing. The M5 is 3.4 times faster than the M4 on the 8B (664 vs 198) and almost 5 times faster on the 14B (343 vs 72). That is a huge generational jump.
But token generation tells a different story. There the M5 is only around 20 to 60 percent faster than the M4. Prompt processing is compute bound and token generation is bandwidth bound, so what this says is the M5 made a large leap in GPU compute while memory bandwidth went up far less. The M5 feels much quicker at reading a prompt, only a bit quicker at writing the answer.
With that said, there are other models out there that are optimized for Apple silicon and you would not necessary run these models I’ve tested here on Macs anyway.
The 32B reality check
The R9700 runs the 32B at 24.9 tokens per second, which is fine for actually use. The M4 Air with its 32GB technically loads it too, but at 3.68 tokens per second its not worth even running. The M5 base and the RX 9700 XT cant load it at all because of the 16GB ceiling.
Theres a small irony in there. The older M4 Air with 32GB runs a model the newer M5 Pro cant, purely because it has more memory. Capacity and speed are two different things and its worth keeping that in mind when shopping around.
That’s it for this time. If this was useful and you want more of the same, you can find me on X (twitter) @timmyitdotcom, BlueSky @timmyit.com, or over on LinkedIn
One comment