Local LLM Server with Dual AMD R9700 32GB – Part 2: Performance

I’ve spent the last months researching and building a dedicated inference workstation that can handle some serious local AI tasks. This is the second part in that series. The first article where I covered the hardware can be found here:


Since I put together my local LLM inference rig with dual Radeon AI Pro R9700 cards I kept wondering how they actually stack up against the consumer RX 9700 XT, Nvidia RTX 5080 and Apple Silicon machines I had access to. Rather than keep guessing I built a small benchmark suite so I could measure it properly and re-run it whenever a new card shows up. For the test I only used a single R9700. I will do a future article on performance for single vs dual cards and the pros & cons.

There are a few things I think is important to understand and to keep in mind when reading this post. This was just a quick test I wanted to perform on different hardware to see what differences there were and to have something to compare it against. This is no mean a perfect test but its something and I can continue to work on it and improve the test suite over time. What I learned during my testing was that I really enjoyed the testing part. It was fun and interesting to work on the benchmark and see the results and also work on how to visualize it.

Also important to mention is that the test and numbers does not take quality in to account as thats more model specific than performance. So higher number might be faster but it does not mean its more or less accurate. Different models are trained and optimized for different things. Some are aimed towards coding and others might be vision or general chat. What a model is good at depends on what its been trained on and for what purpose its trying to achieve.

Test setup and LLM models

Everything runs through llama-bench from llama.cpp. Its the standard tool for this kind of thing and it outputs the same numbers regardless of platform, so the results are comparable across AMD, Nvidia, Apple and whatever I add later.

I went with Qwen3 as the model to perform the tests on. 8B, 14B and 32B parameters seemed to me like a good way of testing different workloads as the size of the model increase.
These three models are all Q4_K_M. I went with Quantization of 4 as its a good middle ground. Below 4 you generally loose too much quality and above 4 its marginally better but at the cost of size.

Qwen3-8B

Qwen3-14B

Qwen3-32B

Two things get measured:

  • Prompt processing (pp512) is how fast the card reads a prompt. This is compute bound.
  • Token generation (tg128) is how fast it writes the answer. This is memory bandwidth bound, and its the number you actually feel when you sit and chat with a model.

Each test runs at context depths of 0, 2048 and 8192 tokens, 5 repetitions, flash attention on, all layers on the GPU.

Quick overview and handpicked dataset

Here’s an overview for anyone who doesn’t want to read through everything and just want to get a quick glance on what to expect with the different GPUs at a handpicked dataset that gives a good understanding on the results.

How the ratings are decided

Every rating comes down to two questions: does the model fit in memory, and once it fits, how fast does it generate?

1.  Does it fit in memory?

A model that doesn’t fit can’t run at all. That’s an automatic “Won’t fit”, whatever the speed. Each bar is the usable memory a card can give a model; unified-memory machines lose some to the operating system as the system shares memory between RAM and the GPU.

8B~7 GB 14B~11 GB 32B~23 GB
AMD AI Pro R970032GB
32 GB
AMD RX 9700 XT16GB
16 GB
Nvidia RTX 508016GB
16 GB
Apple Macbook Pro M524 GB unified
~17 GB
Apple Macbook Air M432 GB unified
~25 GB

A model fits only if its dashed line sits inside the coloured bar. The 32B needs more than 23 GB, so only the R9700 and the 32 GB M4 reach it

2.  How fast does it generate?

Once a model fits, generation speed sets the experience. Tokens per second, measured at tg128.

Limitedunder 15
OK15–24.9
Good25–39.9
Excellent40+
0 15 25 40 60+ tok/s reading speed

Reading speed sits around 6 tok/s, so even “Limited” is faster than you read. It just feels sluggish for back-and-forth. The same bands apply at every model size: 25 tok/s on a 32B and 25 tok/s on an 8B both land in “Good”. Bigger models do not get a curve. For the long-context column (Document RAG), the band is applied to tg128 @ depth 8192 instead of depth 0. Same bands, deeper context.

What can each GPU actually run?

Local LLM inference benchmarked on Qwen3 at Q4_K_M. Ratings come from measured throughput, not spec sheets.

GPU / workload Quick chat8B · real-time Code assistant14B · IDE context Document RAG14B · 16K context Long-form output14B · sustained Heavy reasoning32B · slower OK
AMD AI Pro R970032 GB · workstation Excellent95 tok/s Excellent53 tok/s Excellent45 tok/s @ depth 8192 Excellent53 tok/s OK24.9 tok/s
AMD RX 9700 XT16 GB · consumer Excellent96 tok/s Excellent54 tok/s Excellent47 tok/s @ depth 8192 Excellent54 tok/s Won’t fitout of memory
Nvidia RTX 508016 GB · consumer Excellent160 tok/s Excellent93 tok/s Excellent81 tok/s @ depth 8192 Excellent93 tok/s Won’t fitout of memory
Apple Macbook Pro M524 GB unified · base Good25 tok/s Limited14 tok/s Limited12 tok/s @ depth 8192 Limited14 tok/s Won’t fitmarginal on 24 GB
Apple Macbook Air M432 GB unified · Air OK21 tok/s Limited8.7 tok/s Limited7 tok/s @ depth 8192 Limited8.7 tok/s Limited3.7 tok/s
Excellent Good OK Limited Won’t fit

Updated June 2026 · timmyit.com / LLM-Benchmark

Token generation

This is the one most people care about.

Token generation is how fast an LLM produces its answer, measured in tokens per second where a token is roughly half a short word. It’s the speed you actually feel sitting there waiting for a response, so it matters more than the headline specs. A card pushing 40-50 tok/s feels instant, while anything under 10 starts to drag when you’re going back and forth with the model. Personally around 25 tok/s feels okay-ish but anything under 15 tok/s to me feels unusable.

Tokens per second at empty context

Token generation, by model size

Tokens per second at empty context (tg128, depth 0). Deeper green is faster; greyed cells are models that don’t fit.

GPU / backend
8BQwen3 Q4_K_M
14BQwen3 Q4_K_M
32BQwen3 Q4_K_M
R9700 32GBROCm
94.3
52.7
24.9
R9700 32GBVulkan
95.3
53.1
25.0
RX 9700 XT 16GBROCm
96.6
54.5
RX 9700 XT 16GBVulkan
96.6
54.5
RTX 5080 16GBCUDA
159.6
93.2
M5 16GBMetal
25.3
13.9
M4 32GBMetal
20.8
8.7
3.7

Each AMD card shows its ROCm and Vulkan results on separate rows. They land on the same colour because they’re within 1% of each other. That’s the point: the backend barely matters on RDNA 4.

0 25 50 75 100 tok/s reading speed faster than you read

tg128 at depth 0 · Qwen3 Q4_K_M · timmyit.com / LLM-Benchmark

Prompt processing

Prompt processing is the time the model spends reading your input before it starts answering, measured in tokens per second like generation but covering everything you send in. It matters most when you’re feeding it big context, a long document, a chunk of code, or a chat history that keeps growing, since that’s where the wait piles up. The good thing is you only pay it once at the start of the turn, so a slower number here stings less than slow token generation does.

Tokens per second processing a 512 token prompt at empty context:

Prompt processing, by model size

Tokens per second ingesting the prompt (pp512, depth 0). This is how fast a card reads your input before it starts answering. Deeper green is faster; greyed cells are models that don’t fit.

GPU / backend
8BQwen3 Q4_K_M
14BQwen3 Q4_K_M
32BQwen3 Q4_K_M
AMD AI Pro R9700ROCm
3,891
2,139
921
AMD AI Pro R9700Vulkan
3,978
2,168
927
AMD RX 9700 XTROCm
4,053
2,235
AMD RX 9700 XTVulkan
4,039
2,226
Nvidia RTX 5080CUDA
8,342
4,638
Apple Macbook Pro M524 GB, Metal
664
343
Apple Macbook Air M432 GB, Metal
198
72
29

As with token generation, each AMD card’s ROCm and Vulkan rows land on the same colour: within about 2% of each other, so the backend choice barely moves prompt processing either.

0 1,000 2,000 3,000 4,000 tok/s

Prompt processing decides how long before the answer starts. A 2,000-word file is roughly 2,800 tokens: the RTX 5080 ingests it in about a third of a second on the 8B, the R9700 takes under a second, and the M4 takes around 14 seconds on the 8B and over a minute and a half on the 32B. This is the compute-bound side of inference, and it is where the discrete cards pull furthest ahead.

pp512 at depth 0 · Qwen3 Q4_K_M · timmyit.com / LLM-Benchmark

Model size specific data

Now lets have a look at the specific models 8B,14B & 32B and the performance in relation to different workloads:

Quick chat — Casual back-and-forth with a small model: ask, answer, ask again.

Code assistant — Coding help in your editor, where the speed needs to keep up with how fast you type and think.

Document RAG — Asking questions about your own documents (PDFs, notes, internal docs), so the model has to read a lot of text before it can answer.

Long-form output — Generating longer pieces like articles, reports, summaries without the speed dropping off partway through.

Heavy reasoning — Hard problems where you’d rather wait for a good answer than rush a bad one.

Qwen3-8B Q4_K_M

Qwen3-8B Q4_K_M (~7 GB): The small one. It loads in a second or two and leaves almost all of the VRAM free for context, so it’s what you can run when you just want a fast answer for summaries or quick drafts and don’t need it to be the best one.

Qwen3-8B Q4_K_M — workload fitness

How each GPU handles the five canonical workloads with the 8B model.

GPU / workload Quick chat8B · real-time Code assistant8B · IDE context Document RAG8B · 16K context Long-form output8B · sustained Heavy reasoning8B · slower OK
AMD AI Pro R970032 GB · workstation Excellent94 tok/s Excellent94 tok/s Excellent80 tok/s @ depth 8192 Excellent94 tok/s sustained Excellent94 tok/s
AMD RX 9700 XT16 GB · consumer Excellent97 tok/s Excellent97 tok/s Excellent82 tok/s @ depth 8192 Excellent97 tok/s sustained Excellent97 tok/s
Nvidia RTX 508016 GB · consumer Excellent160 tok/s Excellent160 tok/s Excellent130 tok/s @ depth 8192 Excellent160 tok/s sustained Excellent160 tok/s
Apple Macbook Pro M524 GB unified · base Good25 tok/s Good25 tok/s OK20 tok/s @ depth 8192 Good25 tok/s sustained Good25 tok/s
Apple Macbook Air M432 GB unified · Air OK21 tok/s OK21 tok/s Limited13 tok/s @ depth 8192 OK21 tok/s sustained OK21 tok/s
Excellent Good OK Limited Won’t fit TBD

Updated June 2026 · timmyit.com / LLM-Benchmark

Qwen3-8B Q4_K_M — measured throughput

Raw numbers. All values in tok/s.

GPU / backend Prompt processing @ depth 0 Token generation (tg128) by depth
pp512 pp2048 0 2048 8192
R9700rocm · 32 GB 3891 3721 94.3 90.2 79.9
R9700vulkan · 32 GB 3978 3795 95.3 91.1 80.6
RX 9700 XTrocm · 16 GB 4053 3878 96.6 92.3 81.5
RX 9700 XTvulkan · 16 GB 4039 3868 96.6 92.3 81.6
RTX 5080cuda · 16 GB 8342 8153 159.6 149.8 130.2
MacBook Pro M5metal · 24 GB unified 664 616 25.3 23.9 20.4
MacBook Air M4metal · 32 GB unified 198 171 20.8 15.4 12.5

5 reps + 1 warmup discarded · flash attention on · ngl=99. Updated June 2026 · timmyit.com / LLM-Benchmark

Qwen3-14B Q4_K_M

Qwen3-14B Q4_K_M (~11 GB): The middle one. Better reasoning than the 8B but still small enough to stay fast, so this is the one you can use when you want a decent answer without the slower turnaround of the 32B.

Qwen3-14B Q4_K_M — workload fitness

How each GPU handles the five canonical workloads with the 14B model.

GPU / workload Quick chat14B · real-time Code assistant14B · IDE context Document RAG14B · 16K context Long-form output14B · sustained Heavy reasoning14B · slower OK
AMD AI Pro R970032 GB · workstation Excellent53 tok/s Excellent53 tok/s Excellent45 tok/s @ depth 8192 Excellent53 tok/s sustained Excellent53 tok/s
AMD RX 9700 XT16 GB · consumer Excellent54 tok/s Excellent54 tok/s Excellent47 tok/s @ depth 8192 Excellent54 tok/s sustained Excellent54 tok/s
Nvidia RTX 508016 GB · consumer Excellent93 tok/s Excellent93 tok/s Excellent81 tok/s @ depth 8192 Excellent93 tok/s sustained Excellent93 tok/s
Apple Macbook Pro M524 GB unified · base Limited14 tok/s Limited14 tok/s Limited12 tok/s @ depth 8192 Limited14 tok/s sustained Limited14 tok/s
Apple Macbook Air M432 GB unified · Air Limited8.7 tok/s Limited8.7 tok/s Limited7 tok/s @ depth 8192 Limited8.7 tok/s sustained Limited8.7 tok/s
Excellent Good OK Limited Won’t fit TBD

Updated June 2026 · timmyit.com / LLM-Benchmark

Qwen3-14B Q4_K_M — measured throughput

Raw numbers. All values in tok/s.

GPU / backend Prompt processing @ depth 0 Token generation (tg128) by depth
pp512 pp2048 0 2048 8192
AMD AI Pro R9700rocm · 32 GB 2139 1919 52.7 50.6 45.3
AMD AI Pro R9700vulkan · 32 GB 2168 1943 53.1 50.8 45.4
AMD RX 9700 XTrocm · 16 GB 2235 2007 54.5 52.3 47.1
AMD RX 9700 XTvulkan · 16 GB 2226 2003 54.5 52.3 47.1
Nvidia RTX 5080cuda · 16 GB 4638 4493 93.2 89.6 81.3
Apple Macbook Pro M5metal · 24 GB unified 343 307 13.9 13.4 12.0
Apple Macbook Air M4metal · 32 GB unified 72 71 8.7 8.3 7.1

5 reps + 1 warmup discarded · flash attention on · ngl=99. Updated June 2026 · timmyit.com / LLM-Benchmark

Qwen3-32B Q4_K_M

Qwen3-32B Q4_K_M (~23 GB): The biggest dense Qwen3 of the three, and the one I pick when I care more about the quality of the answer than how fast it comes back. At around 23 GB it still fits on a single 32 GB card with room left over for context, so there’s not much reason to skip it when you’re not in a hurry.

Qwen3-32B Q4_K_M — workload fitness

How each GPU handles the five canonical workloads with the 32B model. The 32B is ~18 GB before KV cache, so 16 GB cards and ~17 GB-usable unified-memory Macs cannot load it, those rows are Won’t fit across the board.

GPU / workload Quick chat32B · real-time Code assistant32B · IDE context Document RAG32B · 16K context Long-form output32B · sustained Heavy reasoning32B · slower OK
AMD AI Pro R970032 GB · workstation OK24.9 tok/s OK24.9 tok/s OK22.9 tok/s @ depth 8192 OK24.9 tok/s sustained OK24.9 tok/s
AMD RX 9700 XT16 GB · consumer Won’t fitout of memory Won’t fitout of memory Won’t fitout of memory Won’t fitout of memory Won’t fitout of memory
Nvidia RTX 508016 GB · consumer Won’t fitout of memory Won’t fitout of memory Won’t fitout of memory Won’t fitout of memory Won’t fitout of memory
Apple Macbook Pro M524 GB unified · base Won’t fitmarginal on 24 GB Won’t fitmarginal on 24 GB Won’t fitmarginal on 24 GB Won’t fitmarginal on 24 GB Won’t fitmarginal on 24 GB
Apple Macbook Air M432 GB unified · Air Limited3.7 tok/s Limited3.7 tok/s Limited3.2 tok/s @ depth 8192 Limited3.7 tok/s sustained Limited3.7 tok/s
Excellent Good OK Limited Won’t fit TBD

Updated June 2026 · timmyit.com / LLM-Benchmark

Qwen3-32B Q4_K_M — measured throughput

Raw numbers. All values in tok/s. The 32B is ~18 GB before KV cache, so 16 GB cards and ~17 GB-usable unified-memory Macs cannot load it, those rows show 0.

GPU / backend Prompt processing @ depth 0 Token generation (tg128) by depth
pp512 pp2048 0 2048 8192
AMD AI Pro R9700rocm · 32 GB 921 852 24.9 24.4 22.9
AMD AI Pro R9700vulkan · 32 GB 927 856 25.0 24.4 22.9
AMD RX 9700 XTrocm · 16 GB 0 0 0 0 0
AMD RX 9700 XTvulkan · 16 GB 0 0 0 0 0
Nvidia RTX 5080cuda · 16 GB 0 0 0 0 0
Apple Macbook Pro M5metal · 24 GB unified 0 0 0 0 0
Apple Macbook Air M4metal · 32 GB unified 29 28 3.7 3.7 3.2

5 reps + 1 warmup discarded · flash attention on · ngl=99. 0 = model did not fit in available VRAM. Updated June 2026 · timmyit.com / LLM-Benchmark

Few interesting observations

The R9700 and the RX 9700 XT use the same chip

The numbers show it. The RX 9700 XT is 2 to 4 percent faster across every test, which is just the consumer card running slightly higher clocks. There is no meaningful inference advantage to the workstation card on raw speed.

Where the R9700 earns its place is the 32GB. Its the only AMD card here that loads the 32B model at all, and it runs it at a reasonable 24.9 tokens per second. The RX 9700 XT runs out of memory and cant load it. So the R9700 is not the faster card, its the card that can handle larger data. For local inference bigger means being able to store larger context and or handle larger datasets.

ROCm and Vulkan perform the same on RDNA4

This one surprised me a little. I expected ROCm to have a clear lead, but on both cards Vulkan lands within about 1 percent of ROCm and is occasionally ahead on prompt processing.

For anyone who struggled with ROCm install, that is good news. On these cards you can run the Vulkan backend and lose effectively nothing, which is a lot less setup than getting ROCm working.

Comparing the two 9700 with Nvidia RTX 5080

Is this a fair comparison? No its not. But it shows a few interesting things when it comes to memory. The RX 9700 XT have the same amount of 16GB memory as the 5080. These cards are aimed towards different segments of the market. the RX 9700 XT is a mid-tier consumer card and the RTX 5080 is a high-tier consumer card. The 9700 XT is half the price of the 5080.


The AI Pro R9700 32GB is a workstations class card but cost around the same as the 5080. It has double the memory amount so larger models and larger context can fit.

The M5 is a big jump over the M4 on compute, but not on bandwidth

The Apple side is where it gets interesting. Look at prompt processing. The M5 is 3.4 times faster than the M4 on the 8B (664 vs 198) and almost 5 times faster on the 14B (343 vs 72). That is a huge generational jump.

But token generation tells a different story. There the M5 is only around 20 to 60 percent faster than the M4. Prompt processing is compute bound and token generation is bandwidth bound, so what this says is the M5 made a large leap in GPU compute while memory bandwidth went up far less. The M5 feels much quicker at reading a prompt, only a bit quicker at writing the answer.

With that said, there are other models out there that are optimized for Apple silicon and you would not necessary run these models I’ve tested here on Macs anyway.

The 32B reality check

The R9700 runs the 32B at 24.9 tokens per second, which is fine for actually use. The M4 Air with its 32GB technically loads it too, but at 3.68 tokens per second its not worth even running. The M5 base and the RX 9700 XT cant load it at all because of the 16GB ceiling.

Theres a small irony in there. The older M4 Air with 32GB runs a model the newer M5 Pro cant, purely because it has more memory. Capacity and speed are two different things and its worth keeping that in mind when shopping around.

That’s it for this time. If this was useful and you want more of the same, you can find me on X (twitter) @timmyitdotcom, BlueSky @timmyit.com, or over on LinkedIn

One comment

Leave a Reply