Gemma 4 dropped on April 2, 2026, and if you’ve been waiting for a genuinely capable open model that you can run locally, fine-tune, and build a product on without paying anyone a licensing fee – this is the one worth paying attention to. I’m going to walk you through everything you need to actually get started: what the models are, how to pick the right one for your situation, the exact code to run them, how thinking mode works, how to pass images and audio, how function calling works for agents, how to fine-tune, and where to go deeper. No fluff.
Start here — which model do you actually need?
Before anything else, answer these two questions: what hardware do you have, and what are you building? Those two answers determine everything else.
I want to test it right now with zero setup → Go to aistudio.google.com and load the 31B or 26B model. It’s free, it’s live today, and you don’t install anything. This is your fastest path to understanding what the model can actually do before you commit to any infrastructure.
I’m on a phone or building a mobile app → E2B or E4B. Download the Google AI Edge Gallery app on Android to test them right now on actual device hardware. These are the only two variants with audio input.
I’m on a Mac (Apple Silicon) → Pull the model via Ollama or use MLX. More on the MLX path below.
I have a consumer GPU (RTX 3090, 4090, etc.) → The 26B MoE quantized is your sweet spot. It only activates 3.8B parameters during inference so it’s fast, and it fits on a 24 GB card with Q4 quantization.
I have a 40 GB or 80 GB GPU (A100, H100) → Run the 31B unquantized for maximum quality, or use the 26B MoE at higher precision for speed.
I’m building a SaaS and want to replace my OpenAI API bill → Rent an H100 on RunPod or Lambda Labs and serve the 31B via vLLM. I’ll show you exactly how.
I want to fine-tune on my own data → 31B dense is the right base. QLoRA on a single A100 is the practical path. More on this at the end.
The four models explained plainly
There are four models. Here’s what each one actually is under the hood, not just the marketing description.
E2B and E4B – The “E” stands for effective parameters. These use a technique called Per-Layer Embeddings (PLE), where each decoder layer gets its own small embedding table per token. The total parameter count including those embedding tables is 5.1B (E2B) and 8B (E4B), but the actively computed “effective” parameters during inference are 2.3B and 4.5B. That’s how they stay fast enough for phone-class hardware. They have a 128K token context window, handle text, images, video, and audio, and run completely offline. Audio max length is 30 seconds. Video max is 60 seconds at one frame per second.
26B A4B MoE – This is a Mixture-of-Experts architecture. Total parameter count is 25.2B, but here’s the thing: MoE means the model has 128 expert subnetworks plus 1 shared expert, and during inference it only routes computation through 8 of those experts at once, giving you 3.8B active parameters. So it generates tokens almost as fast as a 4B model despite having 26B total weights. That’s the practical win – fast generation, strong quality, 256K context window. No audio input on this one – text and images only.
31B Dense – All 31 billion parameters fire on every token. No routing, no sparsity tricks. This is the highest quality option, the slowest generator, and the best base if you’re fine-tuning because dense models give you cleaner, more predictable gradient updates than MoE. Also 256K context. Text and images, no audio. Fits unquantized on a single 80 GB H100.
The benchmark numbers – what’s real and what it means for you
I’m pulling from Google’s official model card and independently verified Artificial Analysis numbers, not press releases.
On AIME 2026 (brutal multi-step math, no tools): 31B scores 89.2%, 26B scores 88.3%, E4B scores 42.5%, E2B scores 37.5%. Compare that to Gemma 3 27B at 20.8% – that’s a massive generational jump in reasoning.
On LiveCodeBench v6 (real competitive coding problems): 31B hits 80%, 26B hits 77.1%. These are numbers competitive with models several times larger.
On GPQA Diamond (expert-level scientific reasoning): Google’s own model card puts the 31B at 84.3% and the 26B at 82.3%. Artificial Analysis’s independent evaluation puts the 31B at 85.7% – a slight difference explained by different evaluation conditions, but both numbers are legitimate. Either way, it’s the second-best result among all open models under 40B parameters, one tenth of a point behind Qwen3.5 27B at 85.8% on the Artificial Analysis leaderboard.
On Codeforces ELO (competitive programming): 31B scores 2150, 26B scores 1718. For reference, a Codeforces ELO of 2150 puts you in Master territory – the 2100–2299 range on the official rating scale. That’s genuinely elite. The next tier up, International Master, starts at 2300. Not a model that writes bad code.
The honest caveat: all the headline numbers are with thinking mode enabled. In non-thinking mode the scores drop substantially. And benchmark performance on generic tasks doesn’t guarantee performance on your specific domain – always test on your actual workload before committing to an architecture.
Running it locally – the complete setup
Option 1: Ollama (fastest for local testing)
One command to install, one command to run. Ollama handles the download, quantization selection, and serving automatically.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Run the 26B MoE (recommended for most consumer GPUs)
ollama run gemma4:26b
# Run the 31B Dense (for 80GB GPU or quantized on consumer GPU)
ollama run gemma4:31b
# Run E4B (for laptops and low-VRAM machines)
ollama run gemma4:e4b
# Run E2B (smallest, for very constrained hardware)
ollama run gemma4:e2bNote: the default ollama run gemma4 with no tag loads the E4B model. Always specify the tag explicitly so you know what you’re running.
Once it’s running, Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Your existing app that uses the OpenAI Python SDK needs one change:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but not validated
)
response = client.chat.completions.create(
model="gemma4:26b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain RAG in plain English"}
]
)
print(response.choices[0].message.content)That’s it. LangChain, n8n, any OpenAI-compatible tool – they all work against this endpoint unchanged.
One heads-up on audio via Ollama: the official model card confirms E2B and E4B support audio input, but Ollama’s current UI only lists “Text, Image” for those models. Audio support through Ollama may lag behind the model card – if you specifically need audio, test it directly or use the Hugging Face Transformers path below which gives you full control.
Option 2: Hugging Face Transformers (for Python developers)
This is the path if you want full control, want to do custom preprocessing, or are building inference into a Python application directly. First install the dependencies:
pip install -U transformers torch accelerateThen load the model. The official model IDs on Hugging Face follow this pattern – note the capitalisation matches the official naming:
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
# Official model IDs (use exactly as shown):
# "google/gemma-4-E2B-it"
# "google/gemma-4-E4B-it"
# "google/gemma-4-26b-a4b-it"
# "google/gemma-4-31b-it"
MODEL_ID = "google/gemma-4-26b-a4b-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
dtype=torch.bfloat16,
device_map="auto" # automatically uses available GPUs
)Basic text generation:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the key risks of deploying LLMs in healthcare."}
]
# enable_thinking=False for fast responses, True for hard problems
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=1.0, # recommended by Google
top_p=0.95, # recommended by Google
top_k=64 # recommended by Google
)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
print(processor.parse_response(response))Option 3: Apple Silicon via MLX
If you’re on an M1/M2/M3/M4 Mac, MLX is the right path – it uses the unified memory architecture properly and is significantly faster than running through the CPU path.
pip install mlx-lm
# Run the E4B (fits easily in 16GB unified memory)
python -m mlx_lm.generate --model google/gemma-4-E4B-it \
--prompt "Explain how attention mechanisms work"
# Run the 26B MoE quantized (needs 32GB+ unified memory)
python -m mlx_lm.generate --model mlx-community/gemma-4-26b-a4b-it-4bit \
--prompt "Write a Python function to chunk text for RAG"For the larger models on Mac, check the mlx-community org on Hugging Face – they maintain pre-quantized MLX versions of most major models.
Option 4: vLLM for production (SaaS / high throughput)
If you’re replacing an OpenAI API dependency in a real product with actual concurrent users, vLLM is what you want. It handles batching, continuous batching, and paged attention – which means it handles multiple users efficiently instead of processing them one at a time. Spin up a GPU instance on RunPod, Lambda Labs, or Vast.ai (check current pricing on each platform before committing – GPU spot rates fluctuate significantly based on demand), then:
pip install vllm
# Serve the 26B MoE on a single A100 80GB
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-4-26b-a4b-it \
--tensor-parallel-size 1 \
--port 8000 \
--max-model-len 65536
# For the 31B on a single H100 80GB
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-4-31b-it \
--tensor-parallel-size 1 \
--port 8000Same OpenAI-compatible endpoint, same client code as above – just change the base URL to your server’s IP. If you’re behind Nginx, add basic auth in front of port 8000 before exposing it to the internet.
Thinking mode – the exact syntax and when to actually use it
This is the feature with the most confusion around it, so let me be specific about how it actually works.
Thinking mode lets the model internally reason through a problem step by step before generating its answer. You don’t see the thinking steps in the output by default – the processor’s parse_response() function strips them and returns just the final answer. When enabled, the model uses <|channel>thought\n[reasoning]<channel|> tags internally before outputting its response.
Turn it on via the enable_thinking parameter in apply_chat_template:
# Thinking OFF — fast, good for simple tasks, chat, summarization
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
# Thinking ON — slower, significantly better on hard reasoning,
# math, coding problems, multi-step analysis
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)One important gotcha for multi-turn conversations: when you’re building up a conversation history, do not include the model’s internal thinking from previous turns in the history. Only include the final parsed response. If you pipe the raw output (thinking tags included) back as assistant history, the model gets confused. Always store processor.parse_response(raw_output) as the assistant turn, not the raw generation.
When to use thinking: hard math, complex code generation, multi-step reasoning, analysis tasks where getting the right answer matters more than speed. When to skip it: customer support chatbots, FAQ answering, text summarization, anything conversational where latency is visible to the user.
Passing images – the code you actually need
All four models handle images. One thing Google specifically recommends: put your image content before the text in your prompt. Multimodal models process input left-to-right and visual context set before the question performs better than the reverse.
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch
import requests
from io import BytesIO
MODEL_ID = "google/gemma-4-26b-a4b-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, dtype=torch.bfloat16, device_map="auto"
)
# Load an image (local file or URL)
image = Image.open("invoice.png")
# or from URL:
# response = requests.get(url)
# image = Image.open(BytesIO(response.content))
# Image goes BEFORE the text question
messages = [
{
"role": "user",
"content": [
{"type": "image"}, # image placeholder first
{"type": "text", "text": "Extract all line items, quantities, and totals from this invoice as JSON"}
]
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(
text=text,
images=[image],
return_tensors="pt"
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=True)
print(response)One practical setting worth knowing: the model has configurable visual token budgets that control how many tokens are used to represent an image. The supported values are 70, 140, 280, 560, and 1120. Higher budget = more visual detail preserved = more compute. Lower budget = faster inference but loses fine detail. The rule of thumb from Google: use lower budgets (70–280) for classification, captioning, or video frame analysis. Use higher budgets (560–1120) for OCR, document parsing, or reading small text in images.
Audio input – E2B and E4B only
Remember: audio is only supported on the two edge models. If you pass audio to the 26B or 31B, it won’t work. Audio max length is 30 seconds. Here’s the basic pattern for speech recognition:
import soundfile as sf
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
MODEL_ID = "google/gemma-4-E4B-it" # E2B or E4B only
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, dtype=torch.bfloat16, device_map="auto"
)
# Load audio file (WAV format)
audio_data, sample_rate = sf.read("recording.wav")
# Audio goes BEFORE text in the prompt
messages = [
{
"role": "user",
"content": [
{"type": "audio"}, # audio placeholder first
{"type": "text", "text": (
"Transcribe the following speech segment in English into English text. "
"Only output the transcription, with no newlines. "
"When transcribing numbers, write the digits."
)}
]
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(
text=text,
audios=[(audio_data, sample_rate)],
return_tensors="pt"
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=True))Function calling – how to build agents with it
This is the capability that makes Gemma 4 genuinely useful for SaaS products and automation tools, and it’s the one most articles completely skip. Native function calling means you can give the model a list of tools – functions it can call – and it will decide when to call them and with what arguments, structured as JSON you can parse and execute.
Here’s a practical example: a simple agent that can look up weather and search the web.
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
import json
MODEL_ID = "google/gemma-4-26b-a4b-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, dtype=torch.bfloat16, device_map="auto"
)
# Define your tools as JSON schema
tools = [
{
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city name"
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature units"
}
},
"required": ["city"]
}
},
{
"name": "search_web",
"description": "Search the web for current information",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
}
},
"required": ["query"]
}
}
]
messages = [
{"role": "system", "content": "You are a helpful assistant with access to tools."},
{"role": "user", "content": "What's the weather in Karachi right now?"}
]
# Pass tools to the template
text = processor.apply_chat_template(
messages,
tools=tools,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
raw_response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
# The model outputs a function call like:
# {"name": "get_weather", "arguments": {"city": "Karachi", "units": "celsius"}}
# Parse it, execute your actual function, then feed the result back
parsed = processor.parse_response(raw_response)
print(parsed)In a real agent loop, you’d parse the function call, execute your actual function (call a real weather API, run a real web search, query your database), then add the result back to the conversation as a tool result message and call the model again to get its final response. That’s the full agentic pattern. The model handles deciding when to call tools and what arguments to use – you just execute them and return the results.
Full function calling documentation is at ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4.
For vibe coders and SaaS builders – the practical picture
Let’s be direct about where Gemma 4 fits if you’re building something and paying for API calls.
The 26B MoE on a rented A100 running vLLM handles multiple concurrent users. For a typical SaaS with under a few hundred concurrent users, one GPU instance is probably enough. You’re paying for uptime, not per-token. GPU instance pricing on RunPod, Lambda Labs, and Vast.ai fluctuates daily based on availability – check current rates on each platform before planning your budget. If your OpenAI bill is over $500/month, the math starts working in your favour pretty quickly. Below that threshold, the operational overhead of managing your own inference server probably isn’t worth it yet – stay on the API, but benchmark Gemma 4 on your specific tasks in AI Studio first so you know what quality you’d be getting.
For the healthcare, legal, and fintech builders in the room: this is your path to a compliant AI stack. Apache 2.0 license, data never leaves your server, Google’s safety evaluation is done against the same standards as their proprietary Gemini models. That’s a real answer to the “our compliance team won’t let us use OpenAI” conversation.
One thing to build in from day one: a fallback. For critical outputs – anything a user will act on in a real-world context – don’t trust any model’s response unchecked. Either implement RAG so the model grounds answers in your actual documents, or add a verification step. The 31B scores 89.2% on AIME math. That means it gets roughly 1 in 10 hard math problems wrong with complete confidence. Build for that failure mode, not against it.
Fine-tuning on your own data
Fine-tuning makes sense when you have a specific domain (legal documents, medical terminology, your own product’s knowledge base) and the base model’s general-purpose behaviour isn’t quite right. The practical path for most builders is QLoRA – it fine-tunes a small set of adapter weights on top of the frozen base model, which means you can do it on a single A100 without needing to load the full model gradients.
# Install dependencies
pip install transformers peft datasets accelerate bitsandbytes
from transformers import AutoModelForCausalLM, AutoProcessor, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch
MODEL_ID = "google/gemma-4-31b-it"
# 4-bit quantization config for QLoRA
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto"
)
# LoRA config — target the attention layers
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank — higher = more capacity but more memory
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Prints something like: trainable params: 40M || all params: 31B || trainable%: 0.13%If you want a faster route, Unsloth has pre-configured Gemma 4 fine-tuning notebooks that cut memory usage significantly and run faster than the vanilla approach. For full step-by-step guides, Google’s official fine-tuning documentation covers QLoRA with Transformers, LoRA with Keras, and distributed fine-tuning: ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora.
What it genuinely doesn’t do well
The Humanity’s Last Exam benchmark (HLE) is the hardest reasoning benchmark that exists right now – problems designed to stump PhD-level experts. Gemma 4 31B scores 19.5% without tools and 26.5% with search. For context, that’s still meaningful performance on an extremely hard benchmark, but it tells you the frontier of reasoning is not here yet. For the hardest possible problems – frontier research, novel mathematical proofs, highly specialised expert knowledge – you’ll hit limits.
The 26B and 31B also don’t have audio. If audio is central to your use case, you’re on the E4B with its significantly lower reasoning scores (42.5% on AIME vs 88.3% for the 26B). That’s a real architectural trade-off to plan around, not a minor detail.
And the knowledge cutoff is January 2025. Anything that happened after that, the model doesn’t know about. For current events, real-time data, or recent developments in fast-moving fields – build RAG or use the function calling pattern with a web search tool, don’t rely on the model’s internal knowledge.
Resources – where to go deeper
Official documentation
- ai.google.dev/gemma/docs – The master documentation hub. Everything from basic inference to deployment on GKE.
- ai.google.dev/gemma/docs/core/model_card_4 – The full model card. Benchmark tables, architecture details, sampling recommendations, safety evaluation results.
- ai.google.dev/gemma/docs/core/prompt-formatting-gemma4 – Exact prompt format including thinking mode tokens. Read this before you build anything serious.
- ai.google.dev/gemma/docs/capabilities/thinking – Complete thinking mode guide.
- ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4 – Full function calling documentation.
- ai.google.dev/gemma/docs/capabilities/vision/image – Image understanding guide including token budget configuration.
- ai.google.dev/gemma/docs/capabilities/audio – Audio guide (E2B/E4B only).
Fine-tuning
- ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora – QLoRA fine-tuning with Hugging Face Transformers. Start here.
- ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora – Vision fine-tuning with QLoRA.
- unsloth.ai/blog/gemma3 – Unsloth’s optimised fine-tuning notebooks. Faster and lower memory than vanilla Transformers.
- Google Colab LoRA notebook – Run fine-tuning in the browser for free on a T4 GPU.
Model weights
- huggingface.co/collections/google/gemma-4 – All model variants. This is where you pull weights for Transformers, vLLM, and llama.cpp.
- ollama.com/library/gemma4 – Ollama model page with all available tags and sizes.
- kaggle.com/models/google/gemma-4 – Kaggle model hub, useful if you want to run fine-tuning on Kaggle’s free GPUs.
Deployment
- ai.google.dev/gemma/docs/integrations/ollama – Official Ollama integration guide.
- Google Cloud vLLM on GKE tutorial – Production deployment on Kubernetes with GPU.
- ai.google.dev/gemma/docs/integrations/langchain – LangChain integration guide.
Edge / mobile
- ai.google.dev/edge – Google AI Edge platform for E2B/E4B deployment.
- Google AI Edge Gallery on Play Store – Test E2B and E4B on your Android device right now.
- MediaPipe LLM Inference API – For integrating edge models into Android/iOS apps.
Community
- discuss.ai.google.dev – Google AI developer forum. Good for troubleshooting and seeing what others are building.
- Hugging Face Gemma 4 collection – Community fine-tunes, quantized versions, and model discussions all live here.
- deepmind.google/models/gemma/gemmaverse – Showcases of what people are building with Gemma.
Start with Google AI Studio to test before you set up any infrastructure. Once you know the model works for your use case, pick your deployment path based on your hardware and user volume. The documentation is genuinely solid – Google shipped this with full guides for every framework from day one, which is not always the case with open model releases.