/g/ - Thread 108566129

Anonymous
Is Gemma4 good enough for local vibecoding? 04/09/26(Thu)14:44:03 No.108566129

Is Gemma4 good enough for local vibecoding? Anonymous 04/09/26(Thu)14:44:03 No.108566129 [Reply]▶

Has anyone tried to use the new Gemma 4 models with any agent harnesses locally? I mainly use Opencode and my current machine is powerful enough to run gpt-oss 120b at q4_k_m quantization (I could use higher quants but then the t/s and prompt processing speeds fall off a cliff the longer the context gets) but apparently Gemma 4 curb stumps it despite it only being 31b. Is it actually worth trying or is it just more benchmaxxing? Also, I've seen people here say that it's not worth using Moe models because they are inherently "dumber" than sense models The only advantage to using moe is faster t/s, especially if you're using weaker hardware. To those who say that, does that mean I should just only be concerned with the dense 31B model? Does the KV cache behave differently? Like, does the moe kv cache build up slower and lead to lesser slowdowns at longer contexts than dense models or does it behave around the same?

58 RepliesView Thread

Showing all 58 replies.

Anonymous
04/09/26(Thu)15:14:11 No.108566331

Anonymous 04/09/26(Thu)15:14:11 No.108566331▶

>>108566129
It's way too big for my pc

Anonymous
04/09/26(Thu)15:40:21 No.108566479

Anonymous 04/09/26(Thu)15:40:21 No.108566479▶

>>108566129
iirc, gemma is a general purpose model and not a coding model. so you can get pretty good performance on crappier hardware than other similar models, but it's not going to be especially good at coding

Anonymous
04/09/26(Thu)15:45:04 No.108566500

Anonymous 04/09/26(Thu)15:45:04 No.108566500▶

>>108566129
Qwen 3.5 Opus 4.6 Distillation is decent for coding. Even better if you use another model to keep it on task

Anonymous
04/09/26(Thu)16:26:01 No.108566812

Anonymous 04/09/26(Thu)16:26:01 No.108566812▶

>>108566331
I can run it on my handheld PC.

Anonymous
04/09/26(Thu)16:43:59 No.108566938

Anonymous 04/09/26(Thu)16:43:59 No.108566938▶

You would think that the audio and image capabilities would hurt the coding performance.

Anonymous
04/09/26(Thu)17:07:33 No.108567093

Anonymous 04/09/26(Thu)17:07:33 No.108567093▶

asked it to change the Neovim theme in my dotfiles folder. It said, ‘Sure!’, read a bunch of files, and then immediately ran out of context and forgot what it was doing. Asked it to change my theme, but it wasn’t retaining the information it had already read, so it kept rereading the files ad infinitum.

Anonymous
04/09/26(Thu)17:10:56 No.108567108

Anonymous 04/09/26(Thu)17:10:56 No.108567108▶

File: 1769672499352563.jpg (37.1 KB)

37.1 KB JPG

>>108567093
>Ran out of context

Let me guess. You were running it with llama.cpp at the back end and forgot to correctly set the -c perimeter to a reasonably high number

Anonymous
04/09/26(Thu)17:17:45 No.108567153

Anonymous 04/09/26(Thu)17:17:45 No.108567153▶

I'm new to local ai's

Anyone here using wsl and rocm? I'm struggling to get ollama working with it. Just wondering if anyone has any tips or other alternatives

deleted
04/09/26(Thu)17:18:06 No.108567158

deleted 04/09/26(Thu)17:18:06 No.108567158▶

>>108567108
no i increased the context

Anonymous
04/09/26(Thu)17:20:05 No.108567170

Anonymous 04/09/26(Thu)17:20:05 No.108567170▶

>>108567158
>no i increased the context
To what? Other parameters like temperature matter a lot too. Did you use their recommended settings?

>>108567153
>wsl
For Windows? For what purpose?

https://ollama.com/download/windows

Anonymous
04/09/26(Thu)17:23:11 No.108567194

Anonymous 04/09/26(Thu)17:23:11 No.108567194▶

>>108567170
Windows Subsystem Linux.

I've been using it for 6 years in my job and it's a habit. I haven't created a python script in windows.

I did see that gemma4 recommends "lemonade" so I'll give that a go when I can be bothered to take another crack at it.

Anonymous
04/09/26(Thu)17:31:35 No.108567258

Anonymous 04/09/26(Thu)17:31:35 No.108567258▶

>>108566129
I can't even get it to work
it chats fine but crash when trying to use with any tools
Did anyone actually tried this shit out?

Anonymous
04/09/26(Thu)17:43:11 No.108567335

Anonymous 04/09/26(Thu)17:43:11 No.108567335▶

>>108567153
I gave up on wsl and just used PowerShell instead

Anonymous
04/09/26(Thu)17:45:54 No.108567359

Anonymous 04/09/26(Thu)17:45:54 No.108567359▶

>>108567335
I think it's about time I learn power shell if I get failures tomorrow. Cheers

Anonymous
04/09/26(Thu)18:40:25 No.108567750

Anonymous 04/09/26(Thu)18:40:25 No.108567750▶

File: Screenshot 2026-04-09 at 21.37.38.png (462.7 KB)

462.7 KB PNG

>>108567258
werkz on my machine. You might have to update the agent harness you're using. pic rel is opencode using the moe version locally.

Anonymous
04/09/26(Thu)18:42:35 No.108567760

Anonymous 04/09/26(Thu)18:42:35 No.108567760▶

>>108567750
what gpu

Anonymous
04/09/26(Thu)18:49:49 No.108567809

Anonymous 04/09/26(Thu)18:49:49 No.108567809▶

File: Screenshot 2026-04-09 at 21.48.49.png (1.2 MB)

1.2 MB PNG

>>108567760

Anonymous
04/09/26(Thu)18:50:52 No.108567820

Anonymous 04/09/26(Thu)18:50:52 No.108567820▶

>>108566129
Is it not available at openrouter etc? Try it out there. And report in, it's actually interesting.

Anonymous
04/09/26(Thu)18:59:06 No.108567881

Anonymous 04/09/26(Thu)18:59:06 No.108567881▶

>>108567750
I cant be the only one with this problem its literally unusable
do u see any vram spike
https://github.com/ggml-org/llama.cpp/issues/21690

Anonymous
04/09/26(Thu)19:01:13 No.108567900

Anonymous 04/09/26(Thu)19:01:13 No.108567900▶

>>108566938
More data to train on, so it ends up with better capabilities overall.

Anonymous
04/09/26(Thu)19:28:17 No.108568059

Anonymous 04/09/26(Thu)19:28:17 No.108568059▶

>>108567158
>>108567108
>>108567093
just increasing context during inference won't help you much if the model wasn't trained to work at high context length

Anonymous
04/09/26(Thu)19:31:57 No.108568075

Anonymous 04/09/26(Thu)19:31:57 No.108568075▶

>>108567809
Cool shit

Anonymous
04/09/26(Thu)19:38:14 No.108568109

Anonymous 04/09/26(Thu)19:38:14 No.108568109▶

>>108567881
My backend is ollama (which itself is based heavily on llama.cpp) so whatever issue you're running into doesn't seem to be the case for me (I haven't done any heavy usage of gemma4 yet so for all I know it could shit itself at long contexts like what he's >>108568059 >>108567093 experiencing so who knows. So far all I've done is have it create a README for this https://github.com/AiArtFactory/llava-image-tagger and the had it spoonfeed me how to push a verified update-commit to main. I think next I'll see if it can create and modify custom nodes and workflows for me for ComfyUI like Kimi-k2.5 was able to do form me

Anonymous
04/09/26(Thu)20:11:55 No.108568293

Anonymous 04/09/26(Thu)20:11:55 No.108568293▶

>>108566129
If they are not lying about 26B performance, it's worth looking into, maybe.

Gemma4 31B (dense model) Q4
3090 - 36t/s
4090 - 43t/s
5090 - 65t/s

Gemma 4 26B (MoE) Q4
3090-120t/s
4090-147t/s
5090-182t/s

Anonymous
04/09/26(Thu)20:15:32 No.108568314

Anonymous 04/09/26(Thu)20:15:32 No.108568314▶

>>108567809
It won't go far actually. High RAM macs are a waste.

gemma4 31B (dense)
M3 Max 36GB - 12t/s

gemma4 26B (MoE)
M3 Max 36GB - 56t/s

Anonymous
04/09/26(Thu)20:24:33 No.108568360

Anonymous 04/09/26(Thu)20:24:33 No.108568360▶

File: Gemma4-local-tokens-per-second.png (366.3 KB)

366.3 KB PNG

>>108568314
performance on my machine. I'm >>108567750

Anonymous
04/09/26(Thu)20:24:41 No.108568361

Anonymous 04/09/26(Thu)20:24:41 No.108568361▶

bonus

Gemma4 E2B Q4_K_S (from unsloth)
RX580 - 11t/s

Gemma4 E2B (not sure about Q, it's the AI Edge Gallery version, 2.5GB in memory)
ARM Mali-G57 MC2 - 5t/s

Anonymous
04/09/26(Thu)20:27:01 No.108568376

Anonymous 04/09/26(Thu)20:27:01 No.108568376▶

>>108568314
>High RAM macs are a waste.
why?

Anonymous
04/09/26(Thu)20:28:42 No.108568385

Anonymous 04/09/26(Thu)20:28:42 No.108568385▶

>>108568314
nobody cares about 36gb macs lol.
"high ram macs" are 128GB mac studios and up.

Anonymous
04/09/26(Thu)20:31:16 No.108568404

Anonymous 04/09/26(Thu)20:31:16 No.108568404▶

File: Werks-on-my-machine_Gemma4-local-tokens-per-second.png (661.6 KB)

661.6 KB PNG

>>108568385
nta. way ahead of ya

Anonymous
04/09/26(Thu)20:38:16 No.108568445

Anonymous 04/09/26(Thu)20:38:16 No.108568445▶

>>108568376
Nice to have big stuff in RAM. Useless when it's too slow to be useful. Like with LLMs.
>>108568404
Not a dense model. But yeah, this one seems usable. But then again... Do you need more than 64GB? Probably not. Even 32 used to be a waste, before OSS were released. What existed prior was mostly garbage.

Anonymous
04/09/26(Thu)20:42:25 No.108568474

Anonymous 04/09/26(Thu)20:42:25 No.108568474▶

File: 1752989845418212.png (361 KB)

361 KB PNG

>>108568445
>Not a dense model
???

https://huggingface.co/google/gemma-4-31B

>seems usable.

Define unusable. I really hope t/s isn't the metric you're using....

Anonymous
04/09/26(Thu)20:54:35 No.108568567

Anonymous 04/09/26(Thu)20:54:35 No.108568567▶

>>108568109
>ollama
Do yourself a huge favor, switch to dockerized llama.cpp

Anonymous
04/09/26(Thu)21:14:30 No.108568711

Anonymous 04/09/26(Thu)21:14:30 No.108568711▶

>>108568474
>???
Look again. At the top of your screen shot it says
ollama run gemma4:26b-lalala

Anonymous
04/09/26(Thu)21:43:29 No.108568923

Anonymous 04/09/26(Thu)21:43:29 No.108568923▶

>>108568711
You said "not a dense model" but I tested the moe AND the dense model

Anonymous
04/09/26(Thu)21:43:59 No.108568928

Anonymous 04/09/26(Thu)21:43:59 No.108568928▶

mmm im more curious of that technology of google to zip usage of memory, i guess they have implement it in this local model,

Anonymous
04/09/26(Thu)21:44:10 No.108568930

Anonymous 04/09/26(Thu)21:44:10 No.108568930▶

>>108566129
>Is it actually worth trying
what do you mean, you already have local llm setup. you just download the file and select it from the agent dropdown

Anonymous
04/10/26(Fri)00:35:20 No.108569918

Anonymous 04/10/26(Fri)00:35:20 No.108569918▶

File: Screenshot_20260409_213231.png (86.9 KB)

86.9 KB PNG

>>108566129
Rate my game.

Anonymous
04/10/26(Fri)01:23:56 No.108570129

Anonymous 04/10/26(Fri)01:23:56 No.108570129▶

>>108568711
Now look in the middle of that same screenshot

Anonymous
04/10/26(Fri)01:42:26 No.108570227

Anonymous 04/10/26(Fri)01:42:26 No.108570227▶

>>108568109
>ollama
>I haven't done any heavy usage
of course you haven't

Anonymous
04/10/26(Fri)01:55:22 No.108570279

Anonymous 04/10/26(Fri)01:55:22 No.108570279▶

Anyone having Gemma 4 be completely retarded on Openrouter? I think there is some kind of issue with their backends.

Anonymous
04/10/26(Fri)05:15:59 No.108571182

Anonymous 04/10/26(Fri)05:15:59 No.108571182▶

>>108570227
It just werkz. The ollama hate is so forced and performative.

Anonymous
04/10/26(Fri)05:24:23 No.108571199

Anonymous 04/10/26(Fri)05:24:23 No.108571199▶

neat, just tried gemma 4 heretic and it seems to be far more logically consistent writing fetish porn than what i've tried before

Anonymous
04/10/26(Fri)05:37:08 No.108571232

Anonymous 04/10/26(Fri)05:37:08 No.108571232▶

>>108569918
gay/10

Anonymous
04/10/26(Fri)08:50:09 No.108571912

Anonymous 04/10/26(Fri)08:50:09 No.108571912▶

>>108566938
31b doesn't have audio.

Anonymous
04/10/26(Fri)09:13:19 No.108571995

Anonymous 04/10/26(Fri)09:13:19 No.108571995▶

File: Screenshot 2026-04-10 at 12-04-52 Welcome Gemma 4 Frontier multimodal intelligence on device.png (34.6 KB)

34.6 KB PNG

>>108571912
https://huggingface.co/blog/gemma4
what did they mean by pick related?

>These models are trained to answer questions about speech in audio. Music and non-speech sounds were not part of the training data.
But I guess it doesn't matter.

Anonymous
04/10/26(Fri)10:18:14 No.108572245

Anonymous 04/10/26(Fri)10:18:14 No.108572245▶

>>108571995
Retard only the E 2 and 4b model have audio and video, you can see it on the model pages if you read them...

Anonymous
04/10/26(Fri)10:43:26 No.108572366

Anonymous 04/10/26(Fri)10:43:26 No.108572366▶

>>108567359
yep powershell works

Anonymous
04/10/26(Fri)11:10:49 No.108572511

Anonymous 04/10/26(Fri)11:10:49 No.108572511▶

>>108572245
>reading model pages
lol, lmao even

Anonymous
04/10/26(Fri)11:12:41 No.108572518

Anonymous 04/10/26(Fri)11:12:41 No.108572518▶

File: 1773459510256705.jpg (23.5 KB)

23.5 KB JPG

>>108572245

>>108572511 it's probably the same person that was surprised The single digit parameter "effective" models kept shitting the bed when they try to use them for tool calling

Anonymous
04/10/26(Fri)13:28:29 No.108573353

Anonymous 04/10/26(Fri)13:28:29 No.108573353▶

>>108566129
Honestly I didn't like Gemma4. Qwen3.5 is still better overall if you have the RAM.

Anonymous
04/10/26(Fri)14:09:42 No.108573624

Anonymous 04/10/26(Fri)14:09:42 No.108573624▶

>>108566129
I got better results with qwen 3.5 than Gemma 4. You should try with qwen 3.5 35b on your setup.

Anonymous
04/10/26(Fri)17:39:55 No.108575169

Anonymous 04/10/26(Fri)17:39:55 No.108575169▶

File: 1768632778147069.gif (5.7 KB)

5.7 KB GIF

Op reporting in

The people that were saying gemma4 is useless at long contexts were not lying or exaggerating. If anything they were understating it.

>>108573353
>>108573624
I had it inspecting and proposing changes to a relatively simple code repo that had literally one single script inside, but it also directed it to read two other code repositories in order to learn how the technology and those worked in order to implement the change I was proposing. I think after the 70,000 token marl (This takes basically no time to reach if you're using agent harnesses and your directing it to read hefty code bases) it's thinking output got caught in a loop and it literally just stopped generating anything eventually. Try to get it working again by just telling it "hey you seem to have gotten stick can you try again?" But at that point the context was so huge that I didn't feel like waiting for it to reprocess the bloated context. Switched to Qwen3.5-35B-A3B (specifically the ollama coding variant at q8_0 precision) and it was able to execute the task I gave it in our reasonable amount of time. I guess that chart they showed of it being comparable and performance to glm 5 or Kimi (Gemma 4 — Google DeepMind https://deepmind.google/models/gemma/gemma-4/) was too good to be true. Then again that particular chart was an ELO score benchmark Which if my understanding of what it measures is correct, is utterly worthless would determining whether or not model is good for agentic tasks/vibe coding. Is also gives me the impression They don't even use that test to evaluate it at long context either so not only is it a worthless benchmark they half-assed it by not actually using it for anything that would require a lot of work and patience

TLDR: ELO scores are the most worthless benchmark and I'm a fool for thinking they mattered in the slightest when I came to hell the normal performs and long contexts doing agentic-coding related shit. At least I learned something new though.

Anonymous
04/10/26(Fri)21:34:29 No.108577076

Anonymous 04/10/26(Fri)21:34:29 No.108577076▶

>>108575169
Try again with the proper precursor flags to negate llama's improper implementation, you need a jinja chat template, to reduce the default checkpoint quantity, you wanna turn off parallel execute, and you want to set a ram cache, finally you can further quaternize the checkpoint if needed. After adding these preprocessor flags it stopped getting stuck in tool calls. Didn't include my flags cause you're so smug you look them up.

Anonymous
04/10/26(Fri)21:37:10 No.108577091

Anonymous 04/10/26(Fri)21:37:10 No.108577091▶

File: 1753267830704616.jpg (131 KB)

131 KB JPG

>>108577076
>All of this work and extra hoot the jump through when other models do the job better the first try

A "You're just holding it wrong" teir post, but I'll look into it later

Anonymous
04/10/26(Fri)22:24:39 No.108577468

Anonymous 04/10/26(Fri)22:24:39 No.108577468▶

>>108577091
Reasonable reply, honestly I think llama just programmed some of its defaults for qwen that dont work well for gemma, here -a "Gemma" -c 131072 -ngl 99 -b 1024 --host 0.0.0.0 -fa on --jinja --chat-template-file Gemma26B_chat_template.jinja -np 1 --cache-ram 8192 --ctx-checkpoints 8 -ctk q8_0 -ctv q8_0 this is my 3090 for 26B high context no image mmproj had no problems with it making me a bread formulator app and editing it a bunch, if you run outa physical ram lower ctx-checkpoints, raise them if you want better parsing of memory, I been using 31B more but they both improve with the jinja templates enabled.

Anonymous
04/11/26(Sat)04:08:06 No.108579639

Anonymous 04/11/26(Sat)04:08:06 No.108579639▶

>>108577468
Qwen3-35b-a3b still beats it in agentic tasks if the benchmarks scores are to be believed but your settings are worth looking into

Anonymous
04/11/26(Sat)04:12:34 No.108579660

Anonymous 04/11/26(Sat)04:12:34 No.108579660▶

>>108567258
It's gotten fixes recently, update llama.cpp or whatever you're using to run it.

Subject
Name
Comment
File	Supported: JPG, PNG, GIF, WebP, WebM, MP4, MP3 (max 4MB)
CAPTCHA

Reply to Thread #108566129

🔍 Search & Sort