The Gemma 4 26B-A4B is getting serious attention right now because it’s a Mixture-of-Experts model that actually fits on hardware most developers already own. Instead of activating all 26 billion parameters for every token, it routes each token through roughly 4 billion active parameters. That’s why it runs fast on a single GPU while still feeling like a much larger model.
This guide covers the full setup — from installing Ollama to running your first prompt — plus what to do when things break, and rough performance numbers so you know what to expect before committing to an 18GB download.
Not sure if the 26B-A4B is the right model for your use case? Read the Gemma 4 model comparison first.
What you need before starting
For a comfortable experience you want at least 16GB of VRAM. A 24GB card like the RTX 3090 or 4090 is the sweet spot — you get clean 4-bit or 5-bit quantization with room left for longer contexts. On 12GB cards it works but you’ll be limited to short context lengths and slower generation.
CPU-only is possible with llama.cpp but expect very slow speeds — around 1–3 tokens per second depending on your RAM and CPU. Fine for experimenting, not for anything production-facing.
You’ll also need:
- NVIDIA GPU: CUDA 12.1 or higher (check with
nvidia-smi) - AMD GPU: ROCm 5.6+ (Linux only)
- Apple Silicon: M1 or newer, any RAM configuration works but 32GB+ is comfortable
- About 20GB of free disk space for the model files
Option 1: Ollama (recommended starting point)
Ollama is the fastest way to go from zero to running. It handles downloading, quantization, and serving — no configuration files, no Python environment to set up.
Step 1 — Install Ollama
On Linux:
curl -fsSL https://ollama.com/install.sh | shOn macOS, download the app from ollama.com and move it to your Applications folder. On Windows, grab the installer from the same site.
Verify it’s running:
ollama --versionStep 2 — Pull and run the model
ollama run gemma4:26bThis pulls the Q4_K_M quantized version — 18GB — and drops you straight into a chat prompt when it’s done. If you want to be explicit about the instruction-tuned quantized version:
ollama run gemma4:26b-a4b-it-q4_K_MStep 3 — Test it with a real prompt
Once the model loads, try something from your actual project. Here’s a quick example of what output looks like:
Prompt: Explain how Mixture-of-Experts models differ from dense transformers in two paragraphs, for a technical audience.
Output (trimmed): In a dense transformer, every token is processed by all parameters in every layer — attention heads, FFN weights, everything. This gives consistent behaviour but is computationally expensive because you’re paying the full cost on every forward pass regardless of how simple or complex the input is…
Response quality at Q4_K_M is solid. You’ll notice it on complex reasoning chains if you push it, but for most builder workflows it’s completely usable.
Rough performance numbers
| GPU | VRAM | Quant | Tokens/sec (approx) | Max practical context |
|---|---|---|---|---|
| RTX 4090 | 24GB | Q5_K_M | ~35–45 t/s | 64K |
| RTX 3090 | 24GB | Q4_K_M | ~25–35 t/s | 32K |
| RTX 4080 | 16GB | Q4_K_M | ~20–28 t/s | 16K |
| RTX 4070 Ti | 12GB | Q4_K_M | ~12–18 t/s | 8K |
| Apple M3 Max | 36GB unified | Q5_K_M | ~20–30 t/s | 64K |
| CPU only | — | Q4_K_M | ~1–3 t/s | 4K |
These are community estimates from early testing — your numbers will vary based on system RAM, cooling, and what else is running.
Option 2: LM Studio (no terminal required)
If you’d rather not touch the command line, LM Studio gives you a full GUI — model browser, chat interface, and a local server you can point other tools at. Download it from lmstudio.ai, search for “gemma-4-26b” in the model browser, and pick the Q4_K_M or Q5_K_M GGUF from the bartowski or unsloth repos.
LM Studio also lets you enable the local server (Settings → Local Server) which exposes an OpenAI-compatible endpoint at http://localhost:1234/v1. Point Cursor or any other tool there and it works out of the box.
Option 3: vLLM (OpenAI-compatible API server)
For a production-grade local API — connecting to Cursor, a custom frontend, or anything expecting OpenAI format — vLLM is the cleanest option:
vllm serve google/gemma-4-26B-A4B-it \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--tensor-parallel-size 1The --gpu-memory-utilization 0.85 leaves headroom for longer contexts. If you hit out-of-memory errors, drop it to 0.75 and reduce --max-model-len to 16384. The server starts on port 8000 by default and accepts standard OpenAI API calls.
Option 4: llama.cpp (maximum speed and control)
For the lowest latency or when running on limited hardware, llama.cpp with a GGUF file is the most efficient option. Grab the quantized files from the unsloth or bartowski repos on Hugging Face:
./llama-server \
-m gemma-4-26b-a4b-it-Q5_K_M.gguf \
--port 8080 \
--ctx-size 32768 \
-ngl 99The -ngl 99 flag offloads all layers to the GPU. Without it you’ll get CPU fallback and much slower speeds. For 12GB VRAM cards, try -ngl 40 first and adjust up until you hit memory limits.
Quantization levels to choose from: Q4_K_M is fastest with the most quality loss, Q5_K_M is the best balance for most people, Q6_K keeps more of the original quality at the cost of extra memory.
Troubleshooting common problems
“CUDA out of memory” error
Reduce your context length first — that’s the biggest VRAM consumer. In Ollama, set OLLAMA_MAX_LOADED_MODELS=1 and make sure no other models are loaded. In vLLM, lower --gpu-memory-utilization to 0.7 and drop --max-model-len. In llama.cpp, reduce -ngl to move some layers to CPU.
“Model not found” in Ollama
Run ollama list to see what’s actually downloaded. If the model isn’t there, run ollama pull gemma4:26b before trying to run it. Tags are case-sensitive — gemma4:26b works, Gemma4:26B doesn’t.
CUDA not detected
Check your driver version with nvidia-smi. You need CUDA 12.1 or higher for most current frameworks. If your driver is older, update it from NVIDIA’s site before reinstalling Ollama or vLLM. On WSL2, make sure you’re using the Windows GPU driver, not a Linux one.
Very slow generation even with GPU
Check if the model is actually using the GPU with nvidia-smi while it’s generating — you should see GPU utilization above 90%. If it’s near zero, your framework is falling back to CPU. In Ollama, reinstall and make sure CUDA is detected during install. In llama.cpp, confirm you compiled with LLAMA_CUDA=1.
Model gives repetitive or low-quality output
This usually means the quantization is too aggressive for your use case. Try stepping up from Q4_K_M to Q5_K_M if your VRAM allows. Also check that you’re using the instruction-tuned version (the -it suffix in the model name) — the base model without instruction tuning needs manual prompting structure to produce useful output.
Frequently asked questions
Can I run it with only 8GB VRAM?
Technically yes with very aggressive quantization (Q2 or Q3) and short context, but the quality drop is significant enough that it’s probably not worth it for serious use. You’d be better off using a smaller model like Gemma 4 E4B on 8GB.
Does it work without a GPU?
Yes, via llama.cpp on CPU. Expect 1–3 tokens per second on a modern desktop CPU. Usable for testing and quick experiments, not for anything that needs fast responses.
Is the model censored?
The instruction-tuned version has safety tuning applied, similar to other Google models. It’s not heavily restricted — most coding, writing, and analysis tasks work fine — but it will decline some requests. The base model (without -it) has less filtering.
What’s the actual context window?
The model supports 256K tokens in theory. In practice, most people run it at 8K–32K because higher values consume VRAM quickly and slow generation noticeably. For most coding and writing tasks, 32K is more than enough.
Can I connect it to Cursor or VS Code?
Yes. Use vLLM or LM Studio’s local server to expose an OpenAI-compatible endpoint, then point Cursor to http://localhost:8000/v1 (vLLM) or http://localhost:1234/v1 (LM Studio). Set the model name to whatever you used when starting the server.
Where to go from here
Start with Ollama and run the model against a few real prompts from your actual project. That 10-minute test tells you more than any benchmark — you’ll know immediately whether the speed and quality work for your use case, or whether you need to step up to the 31B or try a different quantization level.
If you want to go deeper on Gemma 4 — fine-tuning, multimodal inputs, building on top of it — the Gemma 4 complete guide covers all of that.