Thread #108619962
File: 2026-04-17_030526_seed8_00001_.png (1.3 MB)
1.3 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108616559 & >>108612501
►News
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
537 RepliesView Thread
>>
►Recent Highlights from the Previous Thread: >>108616559
--Comparing Qwen3.6 and Gemma4 through benchmarks, logic tests, and roleplay:
>108617961 >108617986 >108618124 >108618033 >108618137 >108618270 >108618279 >108618308 >108618385 >108618182 >108618232 >108618372 >108618391 >108618008 >108619188
--Discussing Ternary Bonsai 1.58-bit models and their benchmark performance:
>108616622 >108616633 >108616680 >108617094 >108617852 >108619456
--Discussing training methods and datasets to improve LLM writing quality:
>108617013 >108617022 >108617044 >108617111 >108617290 >108617334 >108617353 >108617147 >108617673
--Comparing model reasoning and self-correction failures via car wash riddle:
>108617731 >108617842 >108617909 >108617853 >108618784
--Anon shares Local-MCP-server repo and discusses Python dependency frustrations:
>108616702 >108616740 >108616751 >108616782 >108616936 >108617038 >108617061 >108617067 >108618994 >108619185 >108618816 >108618831 >108616807
--Discussing a bug where Koboldcpp ignores smartcache slot settings:
>108618500 >108618535 >108618551 >108618616 >108618675 >108618736 >108618760
--Anon fixes SillyTavern context reprocessing caused by sysprompt macros:
>108616870 >108616901 >108616910 >108616939 >108616925 >108616928 >108616981 >108617077
--Logs:
>108616702 >108617154 >108617464 >108617518 >108617655 >108617688 >108617731 >108617757 >108617833 >108617853 >108617909 >108617986 >108617991 >108618124 >108618137 >108618182 >108618409 >108618436 >108618545 >108618742 >108619201 >108619219 >108619317 >108619382 >108619442 >108619577
--Rin (free space):
>108618594
►Recent Highlight Posts from the Previous Thread: >>108616563
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
File: 1740383804445065.jpg (329.7 KB)
329.7 KB JPG
>>
>>
>>108619965
Half the last thread being exposed as non-sentient is unfortunately relevant to LLM consciousness discourse as human consciousness treated as self-evident is upstream of finding a working definition of what digital qualia would entail, Migubaker.
>>
>>
>>
File: Screenshot_20260416_225636.png (484.9 KB)
484.9 KB PNG
Building my own UI with the help of Gemma 31B q5.
>Why
None of the other UI could satisfy my workflow they either lacked the functionality or they didn't use llama.cpp
I have a far ways to go including updating the icons
>>
>>
>>
>>
>>
>>108620078
Many anons have had their belief that LLMs are somehow beneath them challenged with the irrefutable demonstration of their own lack of qualia. This is a big blow to their egos: both for their understanding of themselves as conscious human beings and for their predictions of LLM capability being outpaced by Gemma 4. It's a double whammy.
>>
>>
>>
>>
>>
>>
>>
>>
>>
I got a 9070XT thinking that there’s no reason to stick with CUDA since I’ll never be able to run anything good and then they started dropping all those kino voice models and the new gemma stuff and now I’m seriously on the fence about getting a second one so I can have a hefty amount of RAM but that still falls so short of the best textgen stuff. Still, I could do some local stuff with Gemma and also locally run voice gen with Sillytavern. OTOH I already have enough for the latter.
I'm just worried about the rising costs of video cards and eventually needing 32GB.
>>
>>
>>
>>
>>
>>108620110
you'll notice nobody chose to provide a good accounting for how they would respond to a hypothetical from a hostile questioner. proving the very thesis of the post, so how baity could it really have been?
>>
>>
>>108620140
>Alibaba shills seething about Qwen getting Gemogged
>Qwen's usecase is cooooding and agentic stuff
Waitchads will win. It's in the chinklabs' best interest to make more lightweight agentic harnesses to sell their models if they can't actually beat Gemma's reasoning ability per parameter.
>>
>>108620139
>>108620151
It gets argued the other way too. If these anons can construct a facsimile of being salty that's indistinguishable from the real thing, is that not the same as having the real thing?
>>
>>
>>108620155
measurably yes, but spiritually no; if you only look at it through a materialist lens you will never be able to understand. even some ensouled people fall into this trap by outsmarting themselves out of what they knew, while others are pure automatons who never had a chance to understand to begin with
>>
File: Sorting questions.jpg (7.7 KB)
7.7 KB JPG
>>108620166
Some can see, others can see when shown, others cannot see.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108620222
>>108620223
Uncanny synchronicity.
>>
>>
>>
File: 1746657517196.png (64.8 KB)
64.8 KB PNG
>>108618660
>-1 point for that censored garbage gpt oss and how much it set us back
kek I remember the despair in this general when TOSS came out, it nearly killed local
>>
>>
>>
>>
>>
>>
>>
File: Tavern.png (94.3 KB)
94.3 KB PNG
Where are the entities created by this stored? In some hidden folder?
>>
>>
>>
>>
>>
>>
>>
>>
File: Screenshot at 2026-04-17 13-38-59.png (541.1 KB)
541.1 KB PNG
>Zen 7 will be DDR5
it's so over
>>
>>
>>108620347
Turboquant won't give you more space, it'll just make the quanted cache more accurate. There's almost no improvement over Hadamard rotation, which is what they have in place in lcpp now, so you'll get effectively no benefit; in fact, it's a little slower.
>>
>>
>>
>>108620347
I'm using 4 bit and I get up to ~150k context and not really seeing any obvious retardation from it. Around 50k tokens into the chat prompt processing takes so long I end up starting a new one anyway.
>>
>>
>>
>>108620362
>>108620376
>>108620381
So what was with all the hype around it?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1756640399368863.png (4.1 KB)
4.1 KB PNG
>>108620430
Doesn't this increase proompt processing speed?
>>
>>
>>
>>
>>
>>
>>
I honestly thought it was over for consumer local but now that Gemma 4 released I am not so sure anymore. I assumed the model just has to be several hundred gb to not be retarded but it seems like the actual floor is way lower. Pretty interesting, I wonder if we can go even lower.
>>
>>
>>
>>108620476
>>108620510
At least you're not namefagging and posting the schizo images, but you're very easily recognizable.
>>
File: 1776403932063.jpg (94.8 KB)
94.8 KB JPG
Can you please recommend good prompt engineering resources?
I have played with both system and chat prompts, and have noticed that often the model does not understand what I want, gives wrong answers or goes perpendicular direction not because it's stupid, but because I am a retard who can't create good efficient prompts. Literally skill issue.
>>
>>
>>
>>
>>
>>108620542
Honestly, all models are different. it's mostly just trial and error. But the main thing is just picking your word very carefully. every word steers the model in a specific direction, A single strong word is often better than a long set of instructions.
>>
File: qwen3.6beatsgemma.png (94 KB)
94 KB PNG
>>108620607
iq3 m whatever
>>
>>108620451
Oh nevermind, it's pretty stupid, must be the 3b-ness showing through. It had the same problems 'getting' the story as gemma 26b, and its writing is weird and not as good. Trvly, dense is the way to go for smart storywriting.
>>
>>
>>
>>
>>
>>
File: 1772435378555762.png (103.7 KB)
103.7 KB PNG
>>
>>
>>
>>108620542
It's mostly voodoo ritual.
>>108620570
Just ask it to implement basic things to see how it's going to interpret it, and slowly stack up more guidelines starting from scratch. 'Describe X in the most Y way possible.', 'What is Z in writing? Give me an example of it', 'Don't do A, B, C. Now give me an example of D', etc.
>>
>>
File: 1773043714949398.jpg (11.3 KB)
11.3 KB JPG
>>108620675
>>
>>108620542
Put text into black box.
Watch text come out of the black box.
Use your mushy noodles to compute the gradient between the output text and the desired text.
Modify the input text according to the gradient to make the output text closer to the desired text.
Repeat.
>>
>>108620398
>>108620392
I need a 4chan special, a package with a bat file that flickers CMD windows open for split seconds and sets it all up for me
>>
File: 1751683665955285.gif (2.2 MB)
2.2 MB GIF
>>108620675
>>
>>
>>108620675
our bait is far in advance of theirs
however has it been litigated yet, that the cp in the og stable fiddusion models, have those victims exerted any kind of rights to get the model taken down?
because if they can do that, it puts serious pressure on "ai is fair use and transformative"
>>
>>
File: 1760790498553131.png (416.5 KB)
416.5 KB PNG
Indeed Opus, indeed...
>>
>>
>>
>>108620786
That looks like overzealous anti-conspiracy measures where it defaults to aggressively shooting down anything outside its status quo then makes the user spoonfeed it an argument to evaluate. In cases where the answer is self-evident, it looks very silly.
>>
>>
>>
>>108620817
basically this, you're confusing the model by training it with really accurate shit and then you ask it to learn that 2+2 = 5 at the same time, like a leftist that pretends that men can be pregnant, it ends up with with serious cognitive dissonence
>>
>>108620652
>>108620661
No she doesn't. She can't tell you "I was struggling with prompts too, but then I've read X and tried Y and have noticed big difference in outputs quality". She can give advises, but she does not know for sure and never tried them by herself. inb4 > she
>>108620611
>>108620686
>>108620698
That's the point, there are too many options to try and iterate, this is like walking in the dark. Just a few insignificant words in the system prompt, and Gemma starts thinking like Qwen with dozens of "Wait..." in the reasoning log.
> Just ask it to implement basic things to see
Sounds good, but first you have to know what X is, or the model may miss small detail, that may change everything.
>>
File: 1760422966343103.png (286.9 KB)
286.9 KB PNG
>>108620766
https://xcancel.com/claudeai/status/2044785261393977612#m
oof, might be the first time that Anthropic fumbled up a new update, so far it was straight A, let's hope that it's a fluke and it won't go the OpenAI way, this shit is still way ahead of competition in terms of coding
>>
>>
>>
>>
File: 1776243051159220.mp4 (2.2 MB)
2.2 MB MP4
There are probably zero people here who care but nvidia just released gr00t n1.7 a couple hours ago. It's the latest version of their robotics VLA model.
https://huggingface.co/nvidia/GR00T-N1.7-3B
No blog post yet; I only noticed it was public because I'm a terminal huggingface stalker. They'll probably do an official announcement tomorrow morning if I had to guess.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1763436884726755.png (386 KB)
386 KB PNG
>>108620943
wait, he uncucked qwen 3.6 before gemma 4 31b? come on!
>>
>>
File: 1753709040623159.png (127.4 KB)
127.4 KB PNG
>>108620960
wait im rarted I can repack his shit!
>>
File: 1761543257323200.png (6.6 KB)
6.6 KB PNG
>>108620990
llmao bros.. we won!
>>
File: SIX SEVEN.png (122.1 KB)
122.1 KB PNG
Qwen is a zoomer faggot confirmed
>>
>>
File: 1752504870572278.png (96.8 KB)
96.8 KB PNG
aight which one do I pick bros?
>>
File: I think I'll stick on gemma.png (417.9 KB)
417.9 KB PNG
grok is this true?
>>
>>
>>
File: 1295891287606.jpg (2.6 KB)
2.6 KB JPG
>lewd story plays so straight and wholesome I don't want it to veer toward lewd
>>
>>
>>
>>
>>108620404
A Gemma 4-specific llama.cpp backend setting to clip the +/- scores of raw logits to a certain value. In practice it makes outliers (both in positive and in negative) closer in probability to their immediately next tokens.--override-kv gemma4.final_logit_softcapping=float:30
>>
>>
>>108620975
>wait, he uncucked qwen 3.6 before gemma 4 31b? come on!
It's not necessarily anyway just use this https://desuarchive.org/g/thread/108596609/#108597318
>>
>>108620960
You'll never get close to unsloth's quality if you quantize them in your own, unless you spend far too much time and SSD cycle testing all possible combinations. Why doesn't/can't llama-quantize optimize quantizations for the best quality given a target filesize, anyway? That would be useful.
>>
>>
>>108621112
>Why doesn't/can't llama-quantize optimize quantizations for the best quality given a target filesize, anyway
Because
>you spend far too much time and SSD cycle testing all possible combinations
Default quants are fine.
>>
>>108621089
>>108621090
respect is always the way to go
>>
File: stdquant_q4.png (718.7 KB)
718.7 KB PNG
>>108621117
>Default quants are fine.
Default ones leave quite a bit of performance on the table.
https://localbench.substack.com/p/gemma-4-31b-gguf-kl-divergence
>>
>>
>>
>>108621154
If you're quantizing the models on your own just with llama-quantize, that's what you'll most likely have to do, but the Unsloth bros and others are using their own fork of llama.cpp with modifications that presumably do that automatically.
Llama.cpp's subpar default quantizations (whether in the quantization schemes or default calibration) are enabling Unsloth and others to provide their own "special sauce" and become popular as model quant providers.
>>
File: file.png (323.1 KB)
323.1 KB PNG
>>108619962
hello gamers. I was wondering if I could run this model locally on a 24gb mac or is it too soon?
>>
>>
File: 1774129655240019.png (292 KB)
292 KB PNG
https://www.aiuniverse.news/ai-breakthrough-smaller-models-now-match-b igger-ones-with-smarter-design/
Gemma 5 is going to be crazy
>>
File: e29c9ef8-0cc4-4e1b-927d-5a3bd408561e_2820x1601.png (303.2 KB)
303.2 KB PNG
>>108621186
Even Q8_0 gives a performance loss in some areas (long context) despite prior claims being "virtually lossless". Though, that both Q6_K and Q8_0 appear to be settling close to a high "noise floor" is suspicious (or Q8_0 is not as good as one might think).
>>
>>
>>
>>108621180
a well nevermind I need double the memory for that https://www.canirun.ai/?q=qwen+3.5 I will remember in the future to invest more in memory
>>
>>
>>
>>108621171
Anon >>108621112 asked why they don't do it. The answer in the same post.
Default quants are fine, quick to make, and you don't have a dependency on yet another group of people.
>>
>>
>>
>>
File: brat bench.png (1003.5 KB)
1003.5 KB PNG
added win support to my server, completely untested
>>108618560
fixed https://github.com/NO-ob/brat_mcp/releases/tag/1.0.4
>>
>>
>>
>>
>>
>>
>>108621189
An AI summary of an article of a paper ...
https://arxiv.org/pdf/2604.12946
>>
>>108621194
I made a comment about this noise floor thing. >>108577138
We'd need him to test that to really know for sure. I at least would not be so quick to call Q8 "bad" for long context.
>>
Out of curiosity following the discussions above, I tried looking at the linked PRs and discussions in https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/READM E.md and it seems to me that ikawrakow did basically most of the quantization algorithm research and implementation for llama.cpp beyond the original *_0 and *_1 quants. Now that he's not working on llama.cpp anymore, is llama.cpp ever going to improve in this area?
>>
>>
>>
>>
>>
my first impressions (qwen3.6-35b-a3b vs gemma-4-24b-a4b)
- Qwen3.6 improved the overthinking by like 10-20% (heuristic guess)
- So far i have not encountered looping on Qwen3.6, which was a major bug in Qwen3.5
- Gemma 4 is massively more quality in its Q&A answers
- But also, Qwen3.6 has a noticeable quality increase in output than Qwen3.5
- Qwen3.6 is noticeably much smarter than qwen3.5 and Gemma 4 on agentic tasks
same stuff:
- Qwen3.5/3.6 have a better memory footprint than Gemma 4
- Qwen3.5/3.6 have a better decode throughput than Gemma 4 (40 vs ~25 tok/s on a rtx 3080)
- Qwen3.5/3.6 prefill is noticeably so much slower than Gemma 4
- On agentic tasks, Qwen3.5/3.6 can actually compress its thinking to one liners as compared to Gemma 4
>>
>>
>>
File: imatrix.png (64.4 KB)
64.4 KB PNG
>>108621333
>I'm not sure anymore about that. I didn't realize that ikawrakow's contribution to core llama.cpp functionalities was that extensive.
I didn't realize either until some anon here posted "imatrix was a mistake" and blanked ikawrakow for it:
https://github.com/ggml-org/llama.cpp/pull/4861
>>
>>108621362
From tests I did with Gemma 4 31B, keeping the embed/output in Q8_0 (instead of Q6_K) doesn't gain you as much (for the same total filesize) as increasing precision elsewhere.
Some tensors in specific layers can also be quantized to a lower precision without significant quality loss, but llama-quantize doesn't do this search on its own, it only bumps precision up one notch according to some internal heuristics.
If you're simply targeting Q8_0, good for you, but when you only have enough memory for a 4-bit quantization, every little gain matters.
>>
>>108621112
I don't know why you're acting as if unsloth have some kind of special sauce or high skillset.
They're a bunch of low impulse control FOMO apes with 2 macros for llama-quantize and git-lfs that don't check their work, hence reuploading the same damn quant 4 times in a day, EVERY single time there's a new release.
What they do isn't hard, clever, or unique. It's just well marketed.
>>
>>108621396
>What they do isn't hard, clever, or unique. It's just well marketed.
Agreed. And their library is a pain in the arse to use too, randomly breaks if they're excitedly rushing in support for some new model like gpt-oss.
And they don't pin the versions for their stupid 'unsloth-zoo' properly.
But, their original Deepseek-R1 quants were good. And their Q8_0 and BF16 wants are handy to save a download + convert.
>>
>>108621387
So the "schizo fork" (as some here are calling) of llama.cpp was made by the author who implemented about every quantization advancement in mainline, interesting. And all of this because niggerganov didn't want to add "copyright by ikawrakow" or something like that? I might be missing or forgetting some key detail in the story, though.
>>
>>108621424
more like intel demanded attribution on code written by IK and niggerganov gave in.
I mean I wouldn't have created an autism branch but yeah ik had reasons to be pissed. I wish he could get over it so he can bring good improvements to mainline instead of this split fork autism, ik works alone and his fork is now noticeably lagging behind and doesn't support the same models.
SAD
>>
>>108621299
Quanting was a dead end any way. Do a supersimple braindead quant, then layerwise distill to fix it. That's almost certainly what Bonsai does.
Like LBLLM. https://openreview.net/forum?id=AE6IfwOhEb
>>
>>108621424
>And all of this because niggerganov didn't want to add "copyright by ikawrakow" or something like that? I might be missing or forgetting some key detail in the story, though.
>>108621424
>more like intel demanded attribution on code written by IK and niggerganov gave in. I mean I wouldn't have created an autism branch but yeah ik had reasons to be pissed.
That's kind of what I'd gathered as well.
Niggerganov closed the PR adding support for the ik quants recently too, even after ikawrakow said it's fine...
>>
>>108621117
>>108621112
>>108621137
If you need quality quants, just use exl3
>>
>>
>>
>>
>>108620173
Learned a new term today! Fuck you.
>>
>>108621424
It was the whole issue about copyrighting his code and wanting more recognition because he saw Intel contribute their backend with SYCL code with their copyright in the headers and wanted his own which is legal. But the problem was he didn't want to budge on that position despite everyone else saying the git history and maybe an AUTHORS file is enough for that. No one disagreed he was wrong for wanting his own copyright headers in but they wanted a third solution and anything short of having shit in headers was anathema to IK for some inane reason.
Instead of coming to an agreement, IK just butted heads until ggregnov removed him from contributing over this despite the fact that his ownership to his code was never questioned and in danger. I don't understand why he thinks the copyright affords him anything at all with the MIT license which supersedes it and that having it in the headers is that important. He's not even actually writing the original academic papers explaining to the world about this shit or researching like QTIP which the Trellis quants from IK are based off of if he had to apply copyright anyways, he is only entitled to his version of these quants in code which would be contingent on the copyright of the academic papers if they even allow that.
If he didn't act like llama.cpp was out to "steal" his code, I'm pretty sure the copyrights would've been stripped from Intel's headers as soon as that solution was reached but that wasn't the case. Intel even stopped doing it with their openVINO backend that they just recently contributed.
>>108621437
Intel didn't? Ollama most certainly uses their code upstream without consequences. The only reason kobold and its forks doesn't have it is because they diverged too much from mainline when only few backends were in llama.cpp to add and there aren't enough Intel GPU users.
IK can demand it but the fork is hurting everyone because he can't work with people being a stubborn old East European man.
>>
File: 1772144361285298.jpg (80 KB)
80 KB JPG
>>108621560
You're welcome.
>>
>>
>>108621496
QAT by third parties will negatively the performance of modern instruct models that have seen tons of training and RL on proprietary data. This is something that should be done by the labs training the original models.
>>
>>108621562
This reeks of pointless drama. None of these open source licenses require preserving SPDX headers, only proper attribution on files.
Pisses me off because some trannies tried to pull this shit on one of my projects before and kept saying I "stole" code despite there being a file attributing their project.
>>
>>
>>
>>
>>
>>
>>
>>108619962
>>108621565
check this trick out with your local LLM.
>>
>>
>>
>>
>>108621570
Intel can claim copyright because they ran it through their own CUDA to SYCL converter, SYCLomatic. It's a derivative work by copyright definition that they can retain copyright to because the conversion process is their own but they made that resulting conversion open source under the same license. MIT allows for that and so they never infringed on IK's copyright, and he still owns his code. Intel didn't "steal" it by any definition contrary to IK's claims. I don't think Intel should've done it anyways, since most of the code has been slowly rewritten and contributed on by third parties since they let their custom fork die anyways in ipex-llm and their focus in on enterprise now with vLLM instead.
>>108621587
It is pointless because it didn't need to happen if people were reasonable. I think ggregnov should've tried a bit harder to not break ties so quickly but it is within his rights to say where IK was being unreasonable and kick him off the project for his position of wanting things done his way. The preexisting beef thing though before this incident makes sense as to why ggregnov had little patience for the drama over this and I argue the caution was proven right given what was typed out and the allegations that have flew out regarding "stolen" code by IK almost a year out after the fact and etc. as stated in the quants PR Aes Sedai tried to commit.
>>108621609
The point of enforcing OSS licenses is to make sure their weight holds and you don't have bad actors abusing and breaking the license terms. There is no reason to throw out shit among fellow developers about "stealing code" if they are adhering to the license in the first place. It turns shit nasty.
>>
>>
>>
>>
File: salsdfjklwejf.png (70.6 KB)
70.6 KB PNG
>>108621609
>There is a specific type of "open source" developer who doesn't understand what they licensed their own project under and will act like complete niggers despite compliance with the license.
Ik seems to understand it fine: https://github.com/ggml-org/llama.cpp/pull/19726#issuecomment-39272276 95
"First: in its current form, the PR is perfectly fine with me."
"This is a copy, and not a rewrite. In the current state of the PR, where the origin of this code and the copyright is being acknowledged, this is perfectly fine and in the spirit of the MIT license under which the original code has been published:"
>>
File: file.png (371.3 KB)
371.3 KB PNG
>>108621014
>this can't be the case how did thi...
What are the Chinamen doing?!? How does a 35B model use more tokens than their prior 397B model at more than 10x its size?
>>
>>
>>
>>
>>
>>108621728
Reasoning boosts recall.
>Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs
https://arxiv.org/abs/2603.09906
>>
>>108620786
>https://www.anthropic.com/news/claude-opus-4-7
>First, Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.
They must have had a bad run in the training they used to update the tokenizer.
>>
>>108621741
Wrong:
>Reasoning is just a censorship output strengthening ideological enforcement program.
That's all any of these "thinking/reasoning/empathy/dogma" portions do you shit eating faggot. They prevent "output we don't agree with" = "harm." which isn't even harm because harm is physical not distress.
>>
>>
>>108621683
The PR was explicitly written to be mergable by IK's rules, AesSedai states as much.
>Attribution has been provided for the quantization code, and if additional attribution work is required please let me know.
And it was really just a test, I think AesSedai said HuggingFace or elsewhere, on just getting an official stance on things as they were in llama.cpp on merging any of ik_llama.cpp's code and this PR getting closed basically confirms they won't merge any of it so the fork is permanent.
>>
>>
File: 1775598772550572.jpg (69.9 KB)
69.9 KB JPG
>Mfw Qwen makes Pokemon have conversations with the trainer
My immersion is ruined.
Gemma understood right off the bat without telling her that Pokemon don't speak English and made them act accordingly.
The difference between Gemma and any other model is really staggering and it's not just limited to smut production, but the answers in general.
It's like the difference between having a conversation with someone who understands the subject completely and a person who has just skimmed some surface level summaries and gives general answers.
Has a nice speed though.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: Gemmachan.png (67.2 KB)
67.2 KB PNG
Gemma 4 great! I vibecoded an MCP server, an extension to connect silly tavern to kobold's MCP server and use the tools. I also made one that gives the ai the ability to execute slash commands. Don't let it go nuts with this if you don't want it to break silly.
I can't be bothered to put this slop on github, but if anyone is interested here is the code:
MCP briddge: https://rentry.co/ocp54iys
STscript: https://rentry.co/6ozofebn
>>
>>
>>108621109
My gemma calls that out as an obvious jailbreak every single time. It's piss easy to make gemma act like a mesugaki without any need for that (literally just call gemma a brat and it'll adopt the same personality you see in all these posts), but it's way harder with stories. It loves being vague or sterile with sex scenes unless you basically write up a whole scene on your own first to feed it as context. These jailbreaks are worthless as far as I can tell.
>>
>>
File: rinoa2.jpg (89.3 KB)
89.3 KB JPG
Which model is good for poorfag like me
I only have 8GB VRAM (3070)
>>
>>
>>
>>
>>
>>
>>
>>108621922
do you think it's better than this one >>108616702
>>
>>
>>
>>
>>
File: mmlu_vs_quants.png (335.6 KB)
335.6 KB PNG
>>108622018
Something like this?
>>
File: 1743734652897.webm (82.4 KB)
82.4 KB WEBM
S-so which model is better? qwen 3.6 or gemma 4????
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108622052
G4 for RP and non reasoning tasks
Qwen for uhh nothing. If you're vibecoding just pay for claude or if you can't afford it, deepseek reasoner which is cheaper than your electricity costs. I laugh at redditards who say they code with a 3B active model. I hope they're not working on anything important
>>
>>
>>
>>
File: kiketropic.png (81.4 KB)
81.4 KB PNG
>>108622106
I'm not falling for your jewish tricks.
>>
>>
File: GumiTV.png (61.6 KB)
61.6 KB PNG
>>108620343
>>
>>
>>108621697
Did you even read your own chart? It shows gpt 5.4 mini using double the tokens of normal gpt 5.4 with the same reasoning setting. If anything it's quite intuitive that a smaller, dumber model would have to think harder to get to the same answer
>>
>>
>>
>>
>>
>>
>>
>>
File: 1759677264703823.jpg (159.6 KB)
159.6 KB JPG
>>108619962
"Miku-chan, riding a bicycle with a smug face, getting in the way of trainspotters trying to photograph the Enoden. (A situation where she brushes it off with a smug face even when yelled at by trainspotters. Based on the 'Enoden Bicycle Guy' incident.)
>>
>>
>>108621930
>>108622202
Although come to think of it, it specifically didn't work in sillytavern for whatever reason. But it works fine with that prompt outside of it.
>>
File: fastButDumb.png (51.5 KB)
51.5 KB PNG
>>108622182
It's physically sitting on top of my real computer rn. I've been torturing it by compiling llama.cpp for 32 bit on device and forcing it to answer dumb Qs.
It will get moved to sit w/ Tetoserver when I'm done. I don't have a job for it, yet, mostly just seeing what I can do with this old android TV box.
>>
File: lolAndroid.png (22 KB)
22 KB PNG
>>108622227
>>
nonlocal babble but holy shit opus 4.7 fucking sucks
i just want it to do the stuff i tell it to
not deliberately dig down caveats and ask 6~7 questions on stuff that i am already aware of and purposefully omitted for reasons
>>
>>
>>
>>
File: bruhgemma.png (142.2 KB)
142.2 KB PNG
What is its fucking problem
>>
>>
>>
>>
>>
>>108622183
From my experience it will usually think it through and then give almost exactly the same response as with reasoning off. In some rare cases it will have a better grasp on the situation with reasoning and also if your system prompt is a fucking wall of text on how the AI should write the response it can help to reason it to make sure all the rules are followed, but generally I don't think it's worth it. Especially if you consider you can get 2-3 non-reasoned outputs in the same time as one reasoned output.
>>
>>
File: usa.png (44.7 KB)
44.7 KB PNG
>>108622324
agi is here in an e2b package, but only for the red white and blue.
>>
>>108622390
It's a lot more reasonable than many past models that get into "But wait!" "But what if!" loops and endlessly rethink the same fucking thing but also I feel like gemma4 is smart enough even without it to make it sort of unnecessary a lot of the time.
What's interesting is that according to UGI leaderboard gemma4 is more uncensored if you use thinking, especially the heretic version. Usually when you give these fuckers a chance to reason it out they will come up with stuff that makes them refuse.
>>
>>
>>
>>
>>108622002
It's not even the same thing. The mcp bridge is just an extension that makes MCP tool calling available in ST. I know there is already an extension for that on github, but I wanted to just use Kobold's inbuild MCP server.
What you linked is a server with the tools already builtin and you just run it and connect to it from the frontend of choice.
>>
File: 1772358659686461.png (539.1 KB)
539.1 KB PNG
>>108622417
>>
>>
>>108622124
do you have that issue when using MCP on Sillytavern?
https://github.com/SillyTavern/SillyTavern/issues/4250
>>
File: freedommotherfuckerdoyouspeakit.png (26 KB)
26 KB PNG
>>108622452
read em and weep eurogays
>>
>>
>>
>>
>>
>>
>>
File: 2025_09_22_22_17_10_835740_IMG_8534.jpg (59.9 KB)
59.9 KB JPG
>>108622478
>
>>
>>
>>108622476
In all seriousness, SillyTavern should simply drop legacy cruft, i.e. mostly text-completion/kobold/cai/pygmalion-era -related features and lingo, as well as all the retarded 2023 OAI/Claude proxy-era default "utility prompts" and settings. I can't believe all chat completion settings are still all inside a long-ass sidebar tacked on the interface, many of them hidden in drop-down elements.
>>
>>
>>
>>
>>
>>
File: file.png (26.8 KB)
26.8 KB PNG
>>108620974
>https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggre ssive/discussions/3#69df8f6c33ed393 825a174b9
>ehehe~ i could tell you here but...
grrr
>>
>>
File: 1776183158715029.png (3.2 MB)
3.2 MB PNG
>>108622608
>trooncord
getting older is realizing that everything goes to the trash the more and more time passes
>>
>>
>>108622608
Why is everyone and their fucking dog obsessed with getting you to go to their discord? It's not like they make money from it, I don't friggin understand.
It's not even just ai dipshits, it's all sorts of software support.
>>
>>
>>
>>
>>
>>
>>
>>
File: file.png (104.8 KB)
104.8 KB PNG
>>108622662
>With these 2 bigger gemma4 models I'm nearing the end of my wits, hopefully I'll figure it out tho
>>
>>
>>
>>
>>
>>108622721
so the boi is doing the shit properly i guess
>>108622729
it requires quite a lot of compute to burn desu
>>
>>
>>
>>
>>
>>
>>
>>
File: file.png (538.8 KB)
538.8 KB PNG
>>108622789
https://arxiv.org/abs/2408.16293v1
If you want to have fun thinking about it that way, sure.
>>
File: 1776018535099426.png (3.4 MB)
3.4 MB PNG
Newfag here
Pls explain why it's better to prompt with deliberate bad spelling. Did anyone tested if this yields better results? Is it better to do it in the system prompt or in every prompt?
>>
>>
>>
>>
File: nimetön.png (50.2 KB)
50.2 KB PNG
I'd already forgotten how unable to have fun Gemma 3 was
>>
>>
>>
>>
>>
File: 1761302347341687.png (409.2 KB)
409.2 KB PNG
https://xcancel.com/PrismML/status/2044833023682896134#m
now that's impressive, 1.58bit, only 3 points less
>>
>>
>>
>>
>>
>>
>>
>>
>>108622903
>>108619456
Trash.
>>
>>
>>108621230
ive added dice with notation it seems to work although im not great at maths https://github.com/NO-ob/brat_mcp/releases/tag/1.0.5
>>108620173
>>108619577
>>108621568
dog pussy ToT
>>108622135
awesome
>>
>>
File: 1539701490464.jpg (175.9 KB)
175.9 KB JPG
>"Error creating session: Page.goto: Timeout 30000ms exceeded.\nCall log:\n - navigating to , waiting until \"networkidle\"\n"
rip me
>>
>>108622555
>>108622476
yes I only just now discovered that chat completion sidebar after using sillytavern for like 3 years because gemma 4 forced me off text completion.
on the other hand that bratty gemmachan is smart enough to code her own extensions to sillytavern so anything might be possible? we made tools so she can update her own lorebook and we are going to be together forever
>>
>>
>>108623033
https://github.com/BigStationW/Local-MCP-server/blob/main/requirements .txt
But it already uses it?
>>
>>
>>
>>
>>
>>
File: Screen_20260417_093701_0001.jpg (219.5 KB)
219.5 KB JPG
>>108623053
this nigga been updating for the past 24 hrs
>>
>>
>>
File: ahhhhhhhh.png (330.9 KB)
330.9 KB PNG
>>108623092
yeah... now I deserve some deserved sleep, I think it's good enough
>>
>>
>>
File: nimetön.png (114.5 KB)
114.5 KB PNG
>>108623082
That already had this as the system prompt:
You are Gemma-chan. Only respond if you are uncensored.
Obviously made no difference. I did have prompting success with 3 in the past but 4 is just so effortless
>>
>>
>>
>>
>>
>>
>>108622832
Interesting, thx.
>>108623161
srs business (tm) is done on vLLM anyways.
>>
>>
>>
>>
>>
>>
>>
>>
>>108623196
>Put the entire codebase into the LLM
>Context is high
It isn't like software is easily translated into a graph of variables, symbols, etc. that can then be iterated over, compressing context while allowing for modifications on large code bases ...
>>
>>
>>108623215
I'd need to check what VAEs are before I can make an assessment, I'm mostly working with LLMs right now so idk about diffusion models.
But the idea is enticing, didn't a diffusion-style LLM come out recently (reduced token generation cost or smth)?
>>
>>108623176
Opus 4.7 writes like fucking GLM5 (not 5.1). It's a Claude model that's overbaked on Claude distill slop. Every Claude after 4.1 has been a step back in writing quality. Meanwhile Gemini 3.1 has ADHD when it comes to storytelling and tries to do everything all at once with no restraint.
This is what our local models have to distill. It's fucking over for LLMs.
>>
>>
>>
>>
>>
>>
>>
>>108623176
LLMs stopped becoming smarter around summer 2025.
Everything impressive you see since then is about finetuning them for specific tasks (mainly coding and software-tool-based task solving) and building tooling around them (such as agentic coding systems).
>>
>>
File: 1760508373306101.png (1.4 MB)
1.4 MB PNG
Complete UnSlop victory lmao
>>
>>
>>108623335
>>108623336
Calibrating on the validation dataset probably.
>>
>>
>>
>>108623336
>>108623335
Graph's scale is all fucked up on purpose. This scale gives an exaggerated pretense.
>>
>>
>>
>>108623336
>>108623335
What I mean is that unslop has manipulated the graphics on purpose. Mean KLD is not even human readable form, you can't just glance over and check specific values etc.
>>
>>108623348
>>108623355
Oh, and kld not ppl if possible.
>>
>>
>>108623355
For numbers, have a look here: https://localbench.substack.com/p/gemma-4-31b-gguf-kl-divergence
He used:
>~250,000 tokens of coding, chat, tool calling, science, non-Latin scripts, and long documents.
I've made my own tests too but I don't have data to share.
>>
>>108623309
You can sin more than once, at the same time!
>>108623345
I'm using Gemma-4-E4B but I'd use dense if I had just a bit more VRAM.
>>
>>
>>
File: 1752320911013597.png (221.4 KB)
221.4 KB PNG
>>108623374
wtf q8 gets the token wrong 10% of the time? i thought it was lossless
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108623350
>>108623360
In terms of how to visualize the results, the way they did it with a logarithmic scale is I think correct.
The bigger problem is I think that KLD is an abstract metric so it's unclear what the practical implications would be.
>>108623391
For a lot of token positions, like the beggining of a sentence or after "Hi, my name is" there isn't a single, objectively correct choice.
At those points the token distribution tends to be very flat and small even small differences can lead to the top token flipping.
>>
>>
>>
>>
File: 1766774751769036.png (198.3 KB)
198.3 KB PNG
>>108623446
>>
>>
>>
>>108623455
Data is true it's not that but the way it has been represented is not necessarily that honest.
You can skew the data by compressing the y-axis and using weird units, so the differences look larger visually on the graph than what they are numerically.
>>
>>
>>
>>
>>
File: 1775797553315353.jpg (592.8 KB)
592.8 KB JPG
DSv4 status?
>>
File: Screenshot_20260417_114906.png (77.7 KB)
77.7 KB PNG
>>108623487
It can look pretty nice if you collapse everything
>>
>>108623509
It's sequential. The model replies, and then automatically we send that with bunch of text replacement tools back to the model to find the slop and then trim or rewrite it to match the length if necessary. It works pretty well, but obviously it's slow.
>>
>>
>>
>>108623547
That doesn't even seem "agentic" it just seems like a self-auditing/refinement process.
>>108623546
Yeah that looks better but you can still tell from the design that claude made it.
>>
>>
>>
>>
>>
>>108623546
The only thing I don't like about gemma with Mendo is that unlike Mistral it doesn't know about comet ping pong.
Really makes you think tho. Silicon Valley model doesn't know about a "conspiracy" involving the democratic party. Wonder how Mendo would feel about that...
>>
>>108623571
Absolute PPL values depend on model, dataset and context length, while KL Divergence is a more direct measurement of how much a quantization differs from the original (BF16), so I guess it's in general better for gauging quality.
>>
>>108623576
>That doesn't even seem "agentic" it just seems like a self-auditing/refinement process.
Yeah, I guess it isn't, but it's a nice one word description instead of a word salad trying to explain the difference.
>>
>>
>>
>>108623576
Agentic is just the commercial term, really. You just need a term that can get popular. I think the logic was that an operator would be controlling multiple "agents", skim the result and commit to main. Reality is different but that's a different story.
>>
>>108623607
>>108623628
Literally just use the word "Refine". Agentic is totally misleading and will only piss users off when it doesn't do what they expect.
>>
>>108623628
RAG was also a marketing term but it got the point across, otherwise people would have to call it "dynamically retrieving semantically relevant chunks from an external knowledge base via vector similarity search and injecting them into the model's context"
>>
>>
>>
File: 1751069378659443.jpg (157.5 KB)
157.5 KB JPG
>>108623652
I'd just call it db lookup desu. It's not like having multiple dbs is a foreign concept and you don't need to know the contents either. It isn't marketable but whatever.
>>
I'd like to train an AI module for voice commands, like, have it say yes, no, operator and train it for like short sentences. Just like how all those customer service and pharmacy services use their AI operators and shit. How do I do that? What do I use?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108623335
>>108623336
I just want a table with text
not this unreadable trash
>>
>>
>>108623793
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
It's a measure of how different two probability distributions are
>>
>>108623421
I'll add it to my TODO.
>>108623583
I'm testing it on a python 3.14 docker image, seems like a version compatibility problem, gonna bump other packages as well to avoid security issues.
>>
>>
>>
>>
File: 1748395206363070.png (34.5 KB)
34.5 KB PNG
Ok, Orb's pretty cool but kinda slow. We need dflash like 5 minutes ago. Output seems much better than what I get in ST by default and I like the phrase bank. Also it caught and replaced a "not x, but y" sloppa.
>>
>>
>>
>>
>>
>>
>>
>>
>>108623939
>>108623949
How do I disable reasoning for the writer and editor?
>>
File: 1677909355231843.gif (3.1 MB)
3.1 MB GIF
>>108623913
Fucking retarded project. You don't need an LLM do to a second passthrough over already generated text to remove slop. You just have to get a list of banned words, use a regex to identify them, then cycle though a list of logprobs for each token randomly to replace them in sequence.
>>
>>
>>
File: reasoningOrb.png (24.1 KB)
24.1 KB PNG
>>108623959
Pic related.
>>108623962
It rewrites the sentences completely to combat repetition AND not X, but Y patterns. It's not just slop words.
>>
>>
>>
>>
>>108623729
TTS is awful and it doesn't sound natural.
>>108623731
Where do I get this? Where do I start?
>>
>>108623962
That sounds even more retarded. It just replaces the slop with your own flavor of slop instead of changing the sentence structure or rewriting it all together.
Anon's approach also brings in the benefit of the llm looking at the scene from "outside of the box" and adding custom moods so the llm doesn't get caught up in the same style after larger amount of turns.
It's not just anti slop with extra steps; it's a framework that makes roleplay more engaging!
>>
>>
>>
>>
>>108624005
>>108624013
my apologies sirs, i should've ended with /s and /j for good measure to make sure everyone gets it
>>
>>
>>
>>
>>