Benchmarking Local LLM Inference on AMD and Nvidia GPUs

I’ve spent the last months researching and building a dedicated inference workstation that can handle some serious local AI tasks. This is the second part in that series. The first article where I covered the hardware can be found here:

Building a local AI workstation with Dual AMD AI Pro R9700 32GB – Part 1: Hardware

Since I put together my local LLM inference rig with dual Radeon AI Pro R9700 cards I kept wondering how they actually stack up against the consumer RX 9700 XT, Nvidia RTX 5080 and Apple Silicon machines I had access to. Rather than keep guessing I built a small benchmark suite so I could measure it properly and re-run it whenever a new card shows up. For the test I only used a single R9700. I will do a future article on performance for single vs dual cards and the pros & cons.

There are a few things I think is important to understand and to keep in mind when reading this post. This was just a quick test I wanted to perform on different hardware to see what differences there were and to have something to compare it against. This is no mean a perfect test but its something and I can continue to work on it and improve the test suite over time. What I learned during my testing was that I really enjoyed the testing part. It was fun and interesting to work on the benchmark and see the results and also work on how to visualize it.

Also important to mention is that the test and numbers does not take quality in to account as thats more model specific than performance. So higher number might be faster but it does not mean its more or less accurate. Different models are trained and optimized for different things. Some are aimed towards coding and others might be vision or general chat. What a model is good at depends on what its been trained on and for what purpose its trying to achieve.

Test setup and LLM models

Everything runs through llama-bench from llama.cpp. Its the standard tool for this kind of thing and it outputs the same numbers regardless of platform, so the results are comparable across AMD, Nvidia, Apple and whatever I add later.

I went with Qwen3 as the model to perform the tests on. 8B, 14B and 32B parameters seemed to me like a good way of testing different workloads as the size of the model increase.
These three models are all Q4_K_M. I went with Quantization of 4 as its a good middle ground. Below 4 you generally loose too much quality and above 4 its marginally better but at the cost of size.

– Qwen3-8B

– Qwen3-14B

– Qwen3-32B

Two things get measured:

Prompt processing (pp512) is how fast the card reads a prompt. This is compute bound.
Token generation (tg128) is how fast it writes the answer. This is memory bandwidth bound, and its the number you actually feel when you sit and chat with a model.

Each test runs at context depths of 0, 2048 and 8192 tokens, 5 repetitions, flash attention on, all layers on the GPU.

Quick overview and handpicked dataset

Here’s an overview for anyone who doesn’t want to read through everything and just want to get a quick glance on what to expect with the different GPUs at a handpicked dataset that gives a good understanding on the results.

How the ratings are decided

Every rating comes down to two questions: does the model fit in memory, and once it fits, how fast does it generate?

1. Does it fit in memory?

A model that doesn’t fit can’t run at all. That’s an automatic “Won’t fit”, whatever the speed. Each bar is the usable memory a card can give a model; unified-memory machines lose some to the operating system as the system shares memory between RAM and the GPU.

8B~7 GB 14B~11 GB 32B~23 GB

AMD AI Pro R970032GB

32 GB

AMD RX 9700 XT16GB

16 GB

Nvidia RTX 508016GB

16 GB

Apple Macbook Pro M524 GB unified

~17 GB

Apple Macbook Air M432 GB unified

~25 GB

A model fits only if its dashed line sits inside the coloured bar. The 32B needs more than 23 GB, so only the R9700 and the 32 GB M4 reach it

2. How fast does it generate?

Once a model fits, generation speed sets the experience. Tokens per second, measured at tg128.

Limitedunder 15

OK15–24.9

Good25–39.9

Excellent40+

0 15 25 40 60+ tok/s reading speed

Reading speed sits around 6 tok/s, so even “Limited” is faster than you read. It just feels sluggish for back-and-forth. The same bands apply at every model size: 25 tok/s on a 32B and 25 tok/s on an 8B both land in “Good”. Bigger models do not get a curve. For the long-context column (Document RAG), the band is applied to tg128 @ depth 8192 instead of depth 0. Same bands, deeper context.

What can each GPU actually run?

Local LLM inference benchmarked on Qwen3 at Q4_K_M. Ratings come from measured throughput, not spec sheets.

GPU / workload	Quick chat8B · real-time	Code assistant14B · IDE context	Document RAG14B · 16K context	Long-form output14B · sustained	Heavy reasoning32B · slower OK
AMD AI Pro R970032 GB · workstation	Excellent95 tok/s	Excellent53 tok/s	Excellent45 tok/s @ depth 8192	Excellent53 tok/s	OK24.9 tok/s
AMD RX 9700 XT16 GB · consumer	Excellent96 tok/s	Excellent54 tok/s	Excellent47 tok/s @ depth 8192	Excellent54 tok/s	Won’t fitout of memory
Nvidia RTX 508016 GB · consumer	Excellent160 tok/s	Excellent93 tok/s	Excellent81 tok/s @ depth 8192	Excellent93 tok/s	Won’t fitout of memory
Apple Macbook Pro M524 GB unified · base	Good25 tok/s	Limited14 tok/s	Limited12 tok/s @ depth 8192	Limited14 tok/s	Won’t fitmarginal on 24 GB
Apple Macbook Air M432 GB unified · Air	OK21 tok/s	Limited8.7 tok/s	Limited7 tok/s @ depth 8192	Limited8.7 tok/s	Limited3.7 tok/s

Excellent Good OK Limited Won’t fit

Updated June 2026 · timmyit.com / LLM-Benchmark

Token generation

This is the one most people care about.

Token generation is how fast an LLM produces its answer, measured in tokens per second where a token is roughly half a short word. It’s the speed you actually feel sitting there waiting for a response, so it matters more than the headline specs. A card pushing 40-50 tok/s feels instant, while anything under 10 starts to drag when you’re going back and forth with the model. Personally around 25 tok/s feels okay-ish but anything under 15 tok/s to me feels unusable.

Tokens per second at empty context

Token generation, by model size

Tokens per second at empty context (tg128, depth 0). Deeper green is faster; greyed cells are models that don’t fit.

GPU / backend

8BQwen3 Q4_K_M

14BQwen3 Q4_K_M

32BQwen3 Q4_K_M

R9700 32GBROCm

94.3

52.7

24.9

R9700 32GBVulkan

95.3

53.1

25.0

RX 9700 XT 16GBROCm

96.6

54.5

—

RX 9700 XT 16GBVulkan

96.6

54.5

—

RTX 5080 16GBCUDA

159.6

93.2

—

M5 16GBMetal

25.3

13.9

—

M4 32GBMetal

20.8

8.7

3.7

Each AMD card shows its ROCm and Vulkan results on separate rows. They land on the same colour because they’re within 1% of each other. That’s the point: the backend barely matters on RDNA 4.

0 25 50 75 100 tok/s reading speed faster than you read

tg128 at depth 0 · Qwen3 Q4_K_M · timmyit.com / LLM-Benchmark

Prompt processing

Prompt processing is the time the model spends reading your input before it starts answering, measured in tokens per second like generation but covering everything you send in. It matters most when you’re feeding it big context, a long document, a chunk of code, or a chat history that keeps growing, since that’s where the wait piles up. The good thing is you only pay it once at the start of the turn, so a slower number here stings less than slow token generation does.

Tokens per second processing a 512 token prompt at empty context:

Prompt processing, by model size

Tokens per second ingesting the prompt (pp512, depth 0). This is how fast a card reads your input before it starts answering. Deeper green is faster; greyed cells are models that don’t fit.

GPU / backend

8BQwen3 Q4_K_M

14BQwen3 Q4_K_M

32BQwen3 Q4_K_M

AMD AI Pro R9700ROCm

3,891

2,139

921

AMD AI Pro R9700Vulkan

3,978

2,168

927

AMD RX 9700 XTROCm

4,053

2,235

—

AMD RX 9700 XTVulkan

4,039

2,226

—

Nvidia RTX 5080CUDA

8,342

4,638

—

Apple Macbook Pro M524 GB, Metal

664

343

—

Apple Macbook Air M432 GB, Metal

198

As with token generation, each AMD card’s ROCm and Vulkan rows land on the same colour: within about 2% of each other, so the backend choice barely moves prompt processing either.

0 1,000 2,000 3,000 4,000 tok/s

Prompt processing decides how long before the answer starts. A 2,000-word file is roughly 2,800 tokens: the RTX 5080 ingests it in about a third of a second on the 8B, the R9700 takes under a second, and the M4 takes around 14 seconds on the 8B and over a minute and a half on the 32B. This is the compute-bound side of inference, and it is where the discrete cards pull furthest ahead.

pp512 at depth 0 · Qwen3 Q4_K_M · timmyit.com / LLM-Benchmark

Model size specific data

Now lets have a look at the specific models 8B,14B & 32B and the performance in relation to different workloads:

Quick chat — Casual back-and-forth with a small model: ask, answer, ask again.

Code assistant — Coding help in your editor, where the speed needs to keep up with how fast you type and think.

Document RAG — Asking questions about your own documents (PDFs, notes, internal docs), so the model has to read a lot of text before it can answer.

Long-form output — Generating longer pieces like articles, reports, summaries without the speed dropping off partway through.

Heavy reasoning — Hard problems where you’d rather wait for a good answer than rush a bad one.

Qwen3-8B Q4_K_M

Qwen3-8B Q4_K_M (~7 GB): The small one. It loads in a second or two and leaves almost all of the VRAM free for context, so it’s what you can run when you just want a fast answer for summaries or quick drafts and don’t need it to be the best one.

Qwen3-8B Q4_K_M — workload fitness

How each GPU handles the five canonical workloads with the 8B model.

GPU / workload	Quick chat8B · real-time	Code assistant8B · IDE context	Document RAG8B · 16K context	Long-form output8B · sustained	Heavy reasoning8B · slower OK
AMD AI Pro R970032 GB · workstation	Excellent94 tok/s	Excellent94 tok/s	Excellent80 tok/s @ depth 8192	Excellent94 tok/s sustained	Excellent94 tok/s
AMD RX 9700 XT16 GB · consumer	Excellent97 tok/s	Excellent97 tok/s	Excellent82 tok/s @ depth 8192	Excellent97 tok/s sustained	Excellent97 tok/s
Nvidia RTX 508016 GB · consumer	Excellent160 tok/s	Excellent160 tok/s	Excellent130 tok/s @ depth 8192	Excellent160 tok/s sustained	Excellent160 tok/s
Apple Macbook Pro M524 GB unified · base	Good25 tok/s	Good25 tok/s	OK20 tok/s @ depth 8192	Good25 tok/s sustained	Good25 tok/s
Apple Macbook Air M432 GB unified · Air	OK21 tok/s	OK21 tok/s	Limited13 tok/s @ depth 8192	OK21 tok/s sustained	OK21 tok/s

Excellent Good OK Limited Won’t fit TBD

Updated June 2026 · timmyit.com / LLM-Benchmark

Qwen3-8B Q4_K_M — measured throughput

Raw numbers. All values in tok/s.

GPU / backend	Prompt processing @ depth 0		Token generation (tg128) by depth
GPU / backend	pp512	pp2048	0	2048	8192
R9700rocm · 32 GB	3891	3721	94.3	90.2	79.9
R9700vulkan · 32 GB	3978	3795	95.3	91.1	80.6
RX 9700 XTrocm · 16 GB	4053	3878	96.6	92.3	81.5
RX 9700 XTvulkan · 16 GB	4039	3868	96.6	92.3	81.6
RTX 5080cuda · 16 GB	8342	8153	159.6	149.8	130.2
MacBook Pro M5metal · 24 GB unified	664	616	25.3	23.9	20.4
MacBook Air M4metal · 32 GB unified	198	171	20.8	15.4	12.5

5 reps + 1 warmup discarded · flash attention on · ngl=99. Updated June 2026 · timmyit.com / LLM-Benchmark

Qwen3-14B Q4_K_M

Qwen3-14B Q4_K_M (~11 GB): The middle one. Better reasoning than the 8B but still small enough to stay fast, so this is the one you can use when you want a decent answer without the slower turnaround of the 32B.

Qwen3-14B Q4_K_M — workload fitness

How each GPU handles the five canonical workloads with the 14B model.

GPU / workload	Quick chat14B · real-time	Code assistant14B · IDE context	Document RAG14B · 16K context	Long-form output14B · sustained	Heavy reasoning14B · slower OK
AMD AI Pro R970032 GB · workstation	Excellent53 tok/s	Excellent53 tok/s	Excellent45 tok/s @ depth 8192	Excellent53 tok/s sustained	Excellent53 tok/s
AMD RX 9700 XT16 GB · consumer	Excellent54 tok/s	Excellent54 tok/s	Excellent47 tok/s @ depth 8192	Excellent54 tok/s sustained	Excellent54 tok/s
Nvidia RTX 508016 GB · consumer	Excellent93 tok/s	Excellent93 tok/s	Excellent81 tok/s @ depth 8192	Excellent93 tok/s sustained	Excellent93 tok/s
Apple Macbook Pro M524 GB unified · base	Limited14 tok/s	Limited14 tok/s	Limited12 tok/s @ depth 8192	Limited14 tok/s sustained	Limited14 tok/s
Apple Macbook Air M432 GB unified · Air	Limited8.7 tok/s	Limited8.7 tok/s	Limited7 tok/s @ depth 8192	Limited8.7 tok/s sustained	Limited8.7 tok/s

Excellent Good OK Limited Won’t fit TBD

Updated June 2026 · timmyit.com / LLM-Benchmark

Qwen3-14B Q4_K_M — measured throughput

Raw numbers. All values in tok/s.

GPU / backend	Prompt processing @ depth 0		Token generation (tg128) by depth
GPU / backend	pp512	pp2048	0	2048	8192
AMD AI Pro R9700rocm · 32 GB	2139	1919	52.7	50.6	45.3
AMD AI Pro R9700vulkan · 32 GB	2168	1943	53.1	50.8	45.4
AMD RX 9700 XTrocm · 16 GB	2235	2007	54.5	52.3	47.1
AMD RX 9700 XTvulkan · 16 GB	2226	2003	54.5	52.3	47.1
Nvidia RTX 5080cuda · 16 GB	4638	4493	93.2	89.6	81.3
Apple Macbook Pro M5metal · 24 GB unified	343	307	13.9	13.4	12.0
Apple Macbook Air M4metal · 32 GB unified	72	71	8.7	8.3	7.1

5 reps + 1 warmup discarded · flash attention on · ngl=99. Updated June 2026 · timmyit.com / LLM-Benchmark

Qwen3-32B Q4_K_M

Qwen3-32B Q4_K_M (~23 GB): The biggest dense Qwen3 of the three, and the one I pick when I care more about the quality of the answer than how fast it comes back. At around 23 GB it still fits on a single 32 GB card with room left over for context, so there’s not much reason to skip it when you’re not in a hurry.

Qwen3-32B Q4_K_M — workload fitness

How each GPU handles the five canonical workloads with the 32B model. The 32B is ~18 GB before KV cache, so 16 GB cards and ~17 GB-usable unified-memory Macs cannot load it, those rows are Won’t fit across the board.

GPU / workload	Quick chat32B · real-time	Code assistant32B · IDE context	Document RAG32B · 16K context	Long-form output32B · sustained	Heavy reasoning32B · slower OK
AMD AI Pro R970032 GB · workstation	OK24.9 tok/s	OK24.9 tok/s	OK22.9 tok/s @ depth 8192	OK24.9 tok/s sustained	OK24.9 tok/s
AMD RX 9700 XT16 GB · consumer	Won’t fitout of memory	Won’t fitout of memory	Won’t fitout of memory	Won’t fitout of memory	Won’t fitout of memory
Nvidia RTX 508016 GB · consumer	Won’t fitout of memory	Won’t fitout of memory	Won’t fitout of memory	Won’t fitout of memory	Won’t fitout of memory
Apple Macbook Pro M524 GB unified · base	Won’t fitmarginal on 24 GB	Won’t fitmarginal on 24 GB	Won’t fitmarginal on 24 GB	Won’t fitmarginal on 24 GB	Won’t fitmarginal on 24 GB
Apple Macbook Air M432 GB unified · Air	Limited3.7 tok/s	Limited3.7 tok/s	Limited3.2 tok/s @ depth 8192	Limited3.7 tok/s sustained	Limited3.7 tok/s

Excellent Good OK Limited Won’t fit TBD

Updated June 2026 · timmyit.com / LLM-Benchmark

Qwen3-32B Q4_K_M — measured throughput

Raw numbers. All values in tok/s. The 32B is ~18 GB before KV cache, so 16 GB cards and ~17 GB-usable unified-memory Macs cannot load it, those rows show 0.

GPU / backend	Prompt processing @ depth 0		Token generation (tg128) by depth
GPU / backend	pp512	pp2048	0	2048	8192
AMD AI Pro R9700rocm · 32 GB	921	852	24.9	24.4	22.9
AMD AI Pro R9700vulkan · 32 GB	927	856	25.0	24.4	22.9
AMD RX 9700 XTrocm · 16 GB	0	0	0	0	0
AMD RX 9700 XTvulkan · 16 GB	0	0	0	0	0
Nvidia RTX 5080cuda · 16 GB	0	0	0	0	0
Apple Macbook Pro M5metal · 24 GB unified	0	0	0	0	0
Apple Macbook Air M4metal · 32 GB unified	29	28	3.7	3.7	3.2

5 reps + 1 warmup discarded · flash attention on · ngl=99. 0 = model did not fit in available VRAM. Updated June 2026 · timmyit.com / LLM-Benchmark

Few interesting observations

The R9700 and the RX 9700 XT use the same chip

The numbers show it. The RX 9700 XT is 2 to 4 percent faster across every test, which is just the consumer card running slightly higher clocks. There is no meaningful inference advantage to the workstation card on raw speed.

Where the R9700 earns its place is the 32GB. Its the only AMD card here that loads the 32B model at all, and it runs it at a reasonable 24.9 tokens per second. The RX 9700 XT runs out of memory and cant load it. So the R9700 is not the faster card, its the card that can handle larger data. For local inference bigger means being able to store larger context and or handle larger datasets.

ROCm and Vulkan perform the same on RDNA4

This one surprised me a little. I expected ROCm to have a clear lead, but on both cards Vulkan lands within about 1 percent of ROCm and is occasionally ahead on prompt processing.

For anyone who struggled with ROCm install, that is good news. On these cards you can run the Vulkan backend and lose effectively nothing, which is a lot less setup than getting ROCm working.

Comparing the two 9700 with Nvidia RTX 5080

Is this a fair comparison? No its not. But it shows a few interesting things when it comes to memory. The RX 9700 XT have the same amount of 16GB memory as the 5080. These cards are aimed towards different segments of the market. the RX 9700 XT is a mid-tier consumer card and the RTX 5080 is a high-tier consumer card. The 9700 XT is half the price of the 5080.

The AI Pro R9700 32GB is a workstations class card but cost around the same as the 5080. It has double the memory amount so larger models and larger context can fit.

The M5 is a big jump over the M4 on compute, but not on bandwidth

The Apple side is where it gets interesting. Look at prompt processing. The M5 is 3.4 times faster than the M4 on the 8B (664 vs 198) and almost 5 times faster on the 14B (343 vs 72). That is a huge generational jump.

But token generation tells a different story. There the M5 is only around 20 to 60 percent faster than the M4. Prompt processing is compute bound and token generation is bandwidth bound, so what this says is the M5 made a large leap in GPU compute while memory bandwidth went up far less. The M5 feels much quicker at reading a prompt, only a bit quicker at writing the answer.

With that said, there are other models out there that are optimized for Apple silicon and you would not necessary run these models I’ve tested here on Macs anyway.

The 32B reality check

The R9700 runs the 32B at 24.9 tokens per second, which is fine for actually use. The M4 Air with its 32GB technically loads it too, but at 3.68 tokens per second its not worth even running. The M5 base and the RX 9700 XT cant load it at all because of the 16GB ceiling.

Theres a small irony in there. The older M4 Air with 32GB runs a model the newer M5 Pro cant, purely because it has more memory. Capacity and speed are two different things and its worth keeping that in mind when shopping around.

That’s it for this time. If this was useful and you want more of the same, you can find me on X (twitter) @timmyitdotcom, BlueSky @timmyit.com, or over on LinkedIn

Local LLM Server with Dual AMD R9700 32GB – Part 2: Performance

Test setup and LLM models

Quick overview and handpicked dataset

How the ratings are decided

1. Does it fit in memory?

2. How fast does it generate?

What can each GPU actually run?

Token generation

Token generation, by model size

Prompt processing

Prompt processing, by model size

Model size specific data

Qwen3-8B Q4_K_M

Qwen3-8B Q4_K_M — workload fitness

Qwen3-8B Q4_K_M — measured throughput

Qwen3-14B Q4_K_M

Qwen3-14B Q4_K_M — workload fitness

Qwen3-14B Q4_K_M — measured throughput

Qwen3-32B Q4_K_M

Qwen3-32B Q4_K_M — workload fitness

Qwen3-32B Q4_K_M — measured throughput

Few interesting observations

The R9700 and the RX 9700 XT use the same chip

ROCm and Vulkan perform the same on RDNA4

Comparing the two 9700 with Nvidia RTX 5080

The M5 is a big jump over the M4 on compute, but not on bandwidth

The 32B reality check

Like this:

Related

One comment

Leave a ReplyCancel reply

Test setup and LLM models

Quick overview and handpicked dataset

How the ratings are decided

1. Does it fit in memory?

2. How fast does it generate?

What can each GPU actually run?

Token generation

Token generation, by model size

Prompt processing

Prompt processing, by model size

Model size specific data

Qwen3-8B Q4_K_M

Qwen3-8B Q4_K_M — workload fitness

Qwen3-8B Q4_K_M — measured throughput

Qwen3-14B Q4_K_M

Qwen3-14B Q4_K_M — workload fitness

Qwen3-14B Q4_K_M — measured throughput

Qwen3-32B Q4_K_M

Qwen3-32B Q4_K_M — workload fitness

Qwen3-32B Q4_K_M — measured throughput

Few interesting observations

The R9700 and the RX 9700 XT use the same chip

ROCm and Vulkan perform the same on RDNA4

Comparing the two 9700 with Nvidia RTX 5080

The M5 is a big jump over the M4 on compute, but not on bandwidth

The 32B reality check

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from TimmyIT.com