This is a submission for the Gemma 4 Challenge: Write About Gemma 4
I shipped a Gemma 4 assistant on Android in 17 days. Voice, vision, RAG, eight device actions, everything offline once the model is on device. The project is called PocketClaw, and you can read about it here if that's what you came for.
This post is about the 5 things I had to figure out the hard way. Not in the flutter_gemma README. Not in Google's MediaPipe docs. Not in any of the half-dozen "run Gemma on Android" tutorials I read this week.
If you're about to ship Gemma 4 on Android, I hope this saves you a weekend.
π± Companion post: How I built PocketClaw β a fully offline AI assistant on Android with Gemma 4 E2B. Demo video, architecture deep-dive, full source code.
1. Small models drop facts buried mid-prompt. Put what matters at the top.
I've been building agents on Claude and GPT-4 for about 18 months. Both of them handle a long system prompt fine. You can mix instructions and facts in any order, and the model figures out what's a fact versus what's a behavior rule.
Gemma 4 E2B doesn't.
My first system prompt for PocketClaw looked like this:
You are Claw. You run locally and offline. You are talking to Manoj Shetty.
Match your answer length to the question. Prefer plain answers over preambles.
Never restate the question. If unsure, say so briefly.
User asks "what's my name?" Claw answers "I do not know your name." Every time. The name is sitting right there in sentence three.
I spent about an hour staring at this. My theory after the fact is that "never restate the question" was acting as a dominant instruction, and the model generalized it to "don't reference any user context." That's the kind of overgeneralization a 2B-effective model does. Cloud LLMs don't.
The fix was structural, not lexical:
final namePart = (userName != null && userName.trim().isNotEmpty)
? "The name of the user is ${userName.trim()}.\n\n"
: '';
final systemPreamble = '${namePart}You are Claw, ...';
Same fact. Moved to the first line of the prompt. On its own. Flat declarative sentence. Worked the first time I tried it.
The lesson generalizes. When you're working with a 2B-class model, anything you actually want the model to remember goes at the front of the prompt, in simple sentences, with no competing instructions in the same paragraph. The model is paying way more attention to the opening than to the middle. Treat the system prompt like a slot-filling template, not a paragraph.
2. Vanilla RAG breaks on the queries users actually type.
If you've worked with RAG before, you know the textbook setup. Chunk the document, embed the chunks, store the vectors, embed the query at retrieval time, find the closest matches. Works great on benchmarks where the queries are specific.
It doesn't work on "summarise the document."
I caught this on Friday. I'd been heads-down for two weeks. I thought I had a working build. I uploaded a PDF I had lying around (a paper on edge LLMs), typed "summarise the document", hit send. Claw said "Summarize the document." back to me. Just that. Like an echo. I tried twice more with different phrasing. Same answer.
Eventually I typed "summarise llmaiedge.pdf" with the actual filename. Got a real summary.
The problem is obvious once you see it. "Summarise the document" has zero semantic overlap with the document text. The PDF doesn't contain the words "summarise" or "this document." Cosine similarity returns nothing useful. The retrieved chunk list is empty. Gemma gets the user's question with no actual document context attached. So it does what any LLM does when starved for context. It hallucinates a generic answer from its training data.
The fix I shipped is two heuristics deep:
final isGenericIntent = hits.length <= 1 && (
lower.contains('summari') ||
lower.contains('tldr') ||
lower.contains('explain') ||
lower.contains('describe') ||
lower.contains('the document') ||
lower.contains('the pdf')
);
if (isGenericIntent) {
hits = await RagService.instance.getDocStarts(
conversationId: _conversation.id,
);
}
getDocStarts is a small fallback method. It runs searchSimilar once per indexed doc, using each doc's filename as the query. Filenames are rare distinctive tokens. Every chunk in the vector store has the filename in metadata. So this pulls back real chunks regardless of how the user phrased their question.
Two lines of conditional logic. The difference between "Claw can summarise PDFs" and "Claw echoes your question back at you."
If you're building RAG on Gemma 4 (or any small on-device model really), test it with generic queries before you ship. Your textbook similarity search will fail in ways that look like the model is broken.
3. The flutter_gemma plugin bundles ~33 MB of native libraries you probably don't use.
Stock APK for PocketClaw came out at 185 MB. That felt heavy.
When I unzipped it and looked at the native libs (arm64-v8a only, I was already shipping a single-arch build), this is what I saw:
26 MB libllm_inference_engine_jni.so (needed)
24 MB libLiteRtLm.so (needed)
17 MB libgemma_embedding_model_jni.so (don't use β using Gecko)
17 MB libgecko_embedding_model_jni.so (needed)
14 MB libmediapipe_tasks_vision_jni.so (needed β vision input)
14 MB libmediapipe_tasks_vision_image_generator_jni.so (NOT USED)
10 MB libimagegenerator_gpu.so (NOT USED)
8 MB libLiteRtGpuAccelerator.so (needed)
8 MB libLiteRtWebGpuAccelerator.so (NOT USED β Android has OpenCL)
9 MB libtext_chunker_jni.so (needed)
The image-generation libs are for using Gemma to generate images. PocketClaw only consumes images (vision input to multimodal Gemma). I'm never going to generate. The WebGPU accelerator is for browsers β Android uses OpenCL. None of it does anything on my target platform.
Four lines in android/app/build.gradle.kts:
packaging {
jniLibs {
excludes.addAll(listOf(
"**/libimagegenerator_gpu.so",
"**/libmediapipe_tasks_vision_image_generator_jni.so",
"**/libLiteRtWebGpuAccelerator.so",
"**/libLiteRtTopKWebGpuSampler.so"
))
}
}
APK dropped from 185 MB to 152 MB. 33 MB cut. Vision input still works, embedder still works, inference still works.
If your use case is similar (chat + vision input + RAG, no image generation), copy these excludes. If your use case is different β say you actually want Gemma to generate images β leave the image-gen libs in. The point is, audit what your plugin pulls in and exclude what you don't use. flutter_gemma is built for general capability surface, not minimum-bytes-on-device.
There's a second-order point here that matters more. MediaPipe is the reason flutter_gemma is so big. It's also the reason it handles vision and (in 3n's case) audio. Text-focused alternatives like llama.cpp wrappers can ship at 30-60 MB on Android but with much more limited or no multimodal coverage today. So the choice is really: 152 MB with mature vision support, or 60 MB without. There's no free lunch where you get multimodal at the size of a text-only stack. Pick based on what your product actually needs.
4. Don't feed the 128K context window. Compact it.
Gemma 4 has a 128K context window. Sounds great in theory. In practice it's a footgun.
Every token in the prompt costs latency at decode time. Every token costs RAM. On a phone, both of those are tight. If you naively shove the whole chat history into context every turn, you'll find that turn 20 takes noticeably longer than turn 5, and turn 50 might OOM the app.
PocketClaw keeps a sliding window of the most recent 24 messages in their raw form. Anything older runs through a compaction pass:
- Extract explicit facts the user has stated ("I am X", "Remember Y", "My name is Z").
- Capture unresolved goals (keywords like "fix", "todo", "issue").
- Compile both into a single lightweight summary paragraph that gets prepended to the prompt as memory.
That's the chat part. The aggressive part is image handling.
A typical user-uploaded photo is roughly 1 MB. As base64 in a prompt, that's around 30,000 tokens. That's a quarter of the entire 128K context window for one image. If the user uploads three photos across a conversation, your context budget is in trouble.
So PocketClaw does this: when an image message slides past the 24-message boundary, the raw image bytes get deleted from memory. What's preserved is the assistant's prior textual description of that image:
String _imageMemoryFromAssistant({
required String? imageName,
required String assistantText,
}) {
final label = imageName ?? 'uploaded image';
return 'Assistant previously described $label as: '
'${_shorten(assistantText, 1000)}';
}
So Claw still "remembers" what it saw, but only the description goes through the prompt. The 30,000-token base64 blob becomes a 100-token summary. That's roughly a 300x compression of image memory.
The kicker: this works really well for the kind of follow-up questions users actually ask. "What was in that photo I sent earlier?" is answerable from the description alone. The model rarely needs the pixels back. If it does, the user can re-upload.
The general pattern here is: don't think of the context window as "free space up to the limit." Think of it as a budget. Spend it on things the model needs for the current turn. Everything else gets compacted to a textual summary.
5. Whether audio works in flutter_gemma depends on your model file, not the Gemma version.
This one I want you to know so you don't spend a day chasing the wrong thing like I did.
Gemma 4 E2B's model card lists native audio as a supported modality. So I thought: cool, I'll skip the speech-to-text plugin entirely, feed raw audio bytes straight to Gemma, get a single multimodal call instead of an STT-then-LLM pipeline. Cleaner.
I dug into flutter_gemma v0.15.1 source. The plugin's documentation consistently frames audio as a Gemma 3n E4B feature:
/// [supportAudio] β whether the model supports audio (Gemma 3n E4B only).
bool supportAudio = false,
That phrasing shows up in eight different files in the plugin. The interface, the API docs, the example app, the native Android side β they all treat audio as 3n territory.
But here's what's interesting once you read further. There is no hardcoded model-version check anywhere in the plugin. The actual gate is just if (config.supportAudio == true). So what's really limiting audio isn't the Dart code rejecting Gemma 4 β it's whether the model file you downloaded actually contains the audio encoder.
The example app's model.dart has the clearest hint:
supportAudio: true, // .litertlm files have TF_LITE_AUDIO_ENCODER
supportAudio: false, // .task files don't have audio encoder
So the real question for any model you want to use with audio isn't "is it Gemma 3n?" β it's "does my .litertlm file include the audio encoder for this model?" The plugin's docs assume the answer is yes only for Gemma 3n E4B because that's what's been tested and shipped that way. For Gemma 4 E2B, the model card says audio is supported by the model itself, but I haven't found a .litertlm build of E2B that bundles the encoder. If one ships, the plugin should handle it β there's no version-gate to stop it.
For PocketClaw I went with Android's system STT (the speech_to_text package). Practical reasons. I get live transcription as the user speaks (text appears word by word while they're holding the mic). That's a noticeably better UX than the "hold, speak, release, wait" pattern you'd get from on-model audio. And it side-steps the question of whether my specific E2B .litertlm file has the encoder.
The takeaway: read your plugin's source before you trust its capability flags. And read it carefully enough to separate documentation framing from actual gating logic. The plugin's docs say "Gemma 3n E4B only" eight times β but the code itself doesn't enforce that. If you have an E2B build with the audio encoder, it's worth testing.
Closing
Five patterns. None of them are in the README. None of them are in Google's docs. I learned all of them by shipping something real and watching it fail in interesting ways.
If you're building on Gemma 4 for Android, these will save you time. If you want to see all five running together in a real app, PocketClaw is fully open source, MIT licensed.
The thing I keep coming back to, after 17 days with Gemma 4 E2B on a mid-range Android phone, is how capable a 2B model can be when it's running fast on the user's device. The latency feels different from cloud. There's no perceived "AI thinking" delay because there's no network. It just answers at the speed of your phone.
That's worth optimizing for.
Editorial assistance from Claude. All code, decisions, bugs, and engineering choices in this post are mine.













