Thread #108624084
File: PIQA.jpg (247.2 KB)
247.2 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>108619962 & >>108616559
►News
>(04/16) Ternary Bonsai released: https://hf.co/collections/prism-ml/ternary-bonsai
>(04/16) Qwen3.6-35B-A3B released: https://hf.co/Qwen/Qwen3.6-35B-A3B
>(04/11) MiniMax-M2.7 released: https://minimax.io/news/minimax-m27-en
>(04/09) Backend-agnostic tensor parallelism merged: https://github.com/ggml-org/llama.cpp/pull/19378
>(04/09) dots.ocr support merged: https://github.com/ggml-org/llama.cpp/pull/17575
>(04/08) Step3-VL-10B support merged: https://github.com/ggml-org/llama.cpp/pull/21287
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
467 RepliesView Thread
>>
File: __megurine_luka_vocaloid_drawn_by_hatsune_negame__9149b18fbd4c35690737c816bf26f91d.jpg (452.6 KB)
452.6 KB JPG
►Recent Highlights from the Previous Thread: >>108619962
--Turboquant benefits vs Hadamard rotation and batch size optimization:
>108620313 >108620362 >108620381 >108620396 >108620380 >108620389 >108620418 >108620438 >108620461 >108620439 >108621906
--Quantization quality and performance differences across various model providers:
>108620943 >108620990 >108620991 >108621112 >108621137 >108621171 >108621194 >108621202 >108621298 >108621362 >108621394 >108621416 >108622608 >108622721 >108620975 >108621109 >108621930 >108622206
--Drama over code attribution causing ikawrakow to fork llama.cpp:
>108621299 >108621387 >108621424 >108621437 >108621508 >108621562 >108621683 >108621773 >108621649 >108621496 >108621584
--Qwen 3.6's increased reasoning token usage and efficacy:
>108621014 >108621697 >108621716 >108621727 >108621741 >108621851 >108622151
--Anons discussing Claude 4.7 regression:
>108620766 >108620786 >108620812 >108620829 >108620817 >108621748 >108621768 >108620850 >108621945
--Critiquing Orb frontend UI and agentic vs refinement terminology:
>108623421 >108623446 >108623487 >108623498 >108623509 >108623547 >108623576 >108623607 >108623628 >108623643
--PPL and KL Divergence for Gemma 4 quant quality:
>108623335 >108623343 >108623374 >108623391 >108623411 >108623571 >108623594 >108623618
--Anon shares SillyTavern extension for Kobold MCP and slash commands:
>108621922 >108622002 >108622124 >108622477 >108622465
--Skepticism over Parcae looped architecture performance and scaling claims:
>108621189 >108621208 >108621295
--Anon compares Qwen3.6 and Gemma 4 performance and efficiency:
>108621330
--Logs:
>108620014 >108620612 >108620766 >108620992 >108621004 >108621022 >108621224 >108621387 >108621922 >108622227 >108622324 >108622417 >108622478 >108622874
--Miku, Gumi (free space):
>108620274 >108620343 >108621922 >108622191 >108622135
►Recent Highlight Posts from the Previous Thread: >>108619965
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
>>
>>
>>
>>
>>
File: shills.png (96.5 KB)
96.5 KB PNG
>literally at the same time (chinese hour)
>>
>>
>>
>>
>>
>>
>>108624175
Configuring and switching text completion presets stopped being amusing 2 years ago.
>>108624176
Swarms of small-medium sized agents agreeing on the best output.
>>
>>
>>
>>
>>
>>
>>
File: 1766618803167190.gif (657.1 KB)
657.1 KB GIF
>>108624175
>>
>>
>>
>>
>>
File: 1759586288757661.jpg (116.6 KB)
116.6 KB JPG
>>108624227
Because Neuro and Evil are AGI.
>>108624217
I still haven't tried a MoE model yet. Are they really worse than dense?
>>
>>
>>
llama got some mem leak fixes for cuda, redeem if you had issues
>>108624260
It simply cannot be on par with a dense model. As complexity of the requests increase, the attention required goes up and parameter starvation starts to become noticeable. Whether they work or not depends entirely on your use case
>>
>>108624236
>>108624257
Would it be possible to let an LLM control one of those models?
>>
>>
>>
>>
>>
>>
>>108624285
Google's subbing out prompts with diversity for local too like they were doing with "ethnically ambiguous" black and asian nazi solders on their SaaS imagegen? I'm sorry you had to go through that but I'll be sure to avoid gemma 31b UD-IQ3_XXS while searching for rapey hags for my research paper, thank you for the input.
>>
>>
>>
File: 1771350937481841.jpg (139.8 KB)
139.8 KB JPG
what do you use to allow models browse the web? api calls to search services? or puppeting actual browser control?
>>
>>
>>
>>
>>108624227
We can, anybody can. 90% of Neuro's persistence is smoke and mirrors.
Marketing, character design (not just visuals), making interesting "plots", knowing how to accommodate the shortcomings, keeping the content interesting, acting effectively as an idol manager: It's all the other work that Vedal does that makes Neuro succeed. That and luck + first mover advantage.
>>
>>
>>
>>
>>108624260
>I still haven't tried a MoE model yet. Are they really worse than dense?
MoEs are only 60% as good as a dense. That's why llama is up to 405b, and a MoE is ~670b. Above those parameters for what they are, have diminishing returns, arguably.
>>
File: common_sense_alteration.png (1.9 MB)
1.9 MB PNG
Made a new card. Concept isn't original, but trust me it's good.
Spent a lot of time tweaking NPC behavior.
https://chub.ai/characters/CoffeeAnon/common-sense-alteration-8bd7a739 9322
>>
>>
>>
>>
>>
>>
File: screenshot-20260417-220230.png (257.4 KB)
257.4 KB PNG
>>108624344
I have couple of rudimentary functions, for now, I execute a web search with lynx and it dump out a list of urls into an array, then another function accesses the specific url and dumps out its contents.
I don't use pyshit or anything else for now, just c.
Lynx is a placeholder but it is surprisingly good.
It needs lots of cleaning up and I'm not entirely sure about everything, but it's fun tinkering.
WIP, this doesn't get back to the model yet but tool call is recognized and parsed. Debugging. Needs lots of work still.
>>
>>
>>
>>
File: orbSpeed.mp4 (2.1 MB)
2.1 MB MP4
>>108624346
It's pretty fast if you disable reasoning for the latter two passes.
>>
>>
>>
>>
>>
>>
>>
>>108624344
i made an mcp serving it does http gets for text with a custom parsing function to remove most html stuff but have been working on browser session control using puppeteer, i wanna try make it so gemma can order me slop, asking her to click elements based on x/y pos works and ive got form input working
>>108624408
lynx sucks it cant handle Japanese site properly, i was using it instead of my own parsing but stopped because of that
>>
Used "web search mcp server" with Qwen 3.6 35B A3B and Gemma 26b, Qwen is a bit inferior. Question what "what's the adresse of the biggest bookstore in [my town]", it needs at least two search. Qwen couldn't resolve it after like 8 search and volontary hallucinated something at the end. Gemma managed to find the biggest one, go to its website, go to another website and finally find the adress.
>>
>>
>>
>>
>>
>>108624422
Catbox is shitting itself.
https://litter.catbox.moe/h2ufm5.png
Did you want the comfy workflow for the cover?
>>
>>
>>
>>
>>
File: 1689957414234047.png (23.5 KB)
23.5 KB PNG
>>108624470
His cards are /lmg/ core now sorry bro. Keep up with the times.
t. /lmg/ oldfag
>>
File: screenshot-20260417-221819.png (268.7 KB)
268.7 KB PNG
>>108624460
Yeah, I'll proceed to something else as soon as I get the model loop working.
>>
>>
>>108624466 (continued)
If you wanna try, the setup is very clearly explained in the koboldcpp wiki. Need to install "mcp web search server", create a json on the model of what you'll find on its gihub page that will tell koboldcpp where your install is.
Then in koboldcpp load this json, then in setting, tools, enable tool calling, connect all and you should see 3 tools appears.
The search server will be launched for each search, no need to have it running in the background.
Right now after like a hundred successful test it stopped working though, maybe I was flagged as a bot or something.
Ah, last thing, it's very different from the default koboldcpp search that only search once by reformulring your question and feeding that to the llm. With the mcp set up, the llm will launch multiple search, search different sites, adapt its behavioir to what answers he's getting. So it's worth it.
>>
>>
>>
>>
>>
>>
>>108624550
>>108624560
What country on VPN?
>>
>>
>>
>>108624242
Wayland? Can't you just have another session or desktop running x11? Or otherwise use some workaround for whatever you need to capture like what sunshine does with KMS? Admittedly I don't know much about how that would work or why this matters just curious what the problem is
>>
>>
File: 2026-04-17_190818_seed47_00001_.png (1.1 MB)
1.1 MB PNG
>>108624084
looga gemmy :DD
>>
>>
>>
>>
>>
>>
>>108624260
>I still haven't tried a MoE model yet. Are they really worse than dense?
MoE is an efficiency technique designed to make models faster, not better.
Basically it's like a pruned model, but instead of pruning a static set of parameters permanently, it keeps all parameters in reserve and learns to dynamically prune away all but the most relevant parameters to the token being predicted. In the absolute limit of the theoretical best case it could be as good as an equivalent-sized dense model, but never better, and in practice it will be significantly worse because the architecture is going to be much more coarse than the patterns and circuits that you would hope to cleanly separate into prunable experts.
There's been attempts at making formulas to estimate the equivalent size dense model they would be most comparable to, but nothing will cleanly apply to all of them because there's lots of architecture choices that will influence it beyond just total/active param count. What size experts you cut it into, how many you use at once, whether you do it per-token or per-layer, some even do really weird stuff like interleave dense layers with MoE ones. But all of it is to answer the question of how much worse you're making it than the full-sized dense model it could of been in exchange for how much you're saving in total compute required to train and run it.
>>
>>
>>108624614
what
>>108624617
do you
>>108624624
mean
>>
>>
>>108624643// ==UserScript==
// @name Chub.ai No Account NSFL
// @description Shows NSFW and NSFL, unblurs NSFW
// @match https://chub.ai/*
// ==/UserScript==
localStorage.setItem("theme", JSON.stringify({"mode":"dark","text_color":"#e5e0d8","em_color":"#8f8e 8e","show_background":true,"blur_ns fw":false,"css_path":null,"collapse able":false,"font_size":"1rem","lin e_height":"1.5","use_sidebar":false ,"card_font_size":1,"dark_backgroun d_color":"#151114","light_backgroun d_color":"#DBDADA","dark_header_col or":"#001529","light_header_color": "#DBDADA","dark_submenu_color":"#00 1529","light_submenu_color":"#DBDAD A","chat_background_color":"rgba(36 , 37, 37, 0.94)","message_background_color":" rgba(36, 37, 37, 0.94)","link_color":"#7D63FF","mess age_background_color_light":"rgba(2 19, 218, 218, 0.9)","chat_background_color_light" :"rgba(219, 218, 218, 0.9)","quote_color":"#e5e0d8","quot e_color_light":"rgb(36, 37, 37)"}))
if(!XMLHttpRequest.nativeOpen){
XMLHttpRequest.prototype.nativeOpen= XMLHttpRequest.prototype.open;
XMLHttpRequest.prototype.customOpen= function(method, url, asynch, user, password) {
if (url.startsWith("https://gateway.chub.ai/search")) {
const urlmodified = new URL(url);
const params = urlmodified.searchParams;
params.set("nsfw", true);
params.set("nsfl", true);
url = urlmodified.toString();
}
return this.nativeOpen(method, url, asynch, user, password);
}
XMLHttpRequest.prototype.open = XMLHttpRequest.prototype.customOpen;
}
This but it's not showing anything for me either. Maybe they changed something because I can usually see cunny stuff just fine.
>>
File: 1765906331096940.jpg (58.2 KB)
58.2 KB JPG
>>108624659
I mean like, it says only 3 B are active so you know... Maybe that means something, dn i know nothing about that.
>>
>>
>>
File: 1763322561205579.png (359.9 KB)
359.9 KB PNG
>>108624642
I see. I don't understand why the big labs are still on the fence though, they're releasing both MoE and dense models.
>>
>>
>>
File: 1748972367644409.jpg (119.2 KB)
119.2 KB JPG
>>108624680
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108624698
yeah that's because they're running on datacenters that are limited by how much power they can actually consume but have tons of high vram cards running in parallel. if you have more capacity than power your best way to scale up a model is moe, and if you have more power than capacity (like a modern consumer gpu) you probably want a dense model the size of your vram
>>
>>
>>
>>
>>
>>
>>108624759
yeah we're working on that, it takes a lot of fucking time to build up the infrastructure we need to get enough power and only a handful of countries understand that and are taking it seriously enough
>>
File: 1761159336307452.gif (2 MB)
2 MB GIF
>>108624752
>Speed matters again now that reasoning + agents are all the rage.
another reason to implement DFlash on llama.cpp
>>
>>
>>108624642
>But all of it is to answer the question of how much worse you're making it than the full-sized dense model it could of been in exchange for how much you're saving in total compute required to train and run it.
the answer to this is that for any amount of compute you have to spend on a training run, you are always, ALWAYS better off spending it training a moe, so much so that it isn't even close
with unlimited compute, yes you can fit more quality-per-param in a dense model. but the calculus on this doesn't make sense unless you are specifically trying to train the best possible model you can fit within a fixed amount of VRAM with no regard for compute efficiency at either train or inference time, which really only applies to the big corpos being nice and giving us toys for our home GPUs. no model targeting the best performance per compute budget will ever be dense though
>>
>>
>>
>>
>>108624210
>>108624392
Prefilling for custom thinking, in character reasoning.
>>
>>
>>
File: 1755655661921227.png (36.1 KB)
36.1 KB PNG
>>108624801
>Elon
>Knowing what he's doing
>>
>>108624759
Well those things take time to spin up and it's only been a few years since the datacenter energy demand really picked up. For the previous 40 years the western world has been profoundly anti-growth and dragged our feet in building out any kind of new power generation capacity
>>
>>
>>
>>
>>
>>
>>
>>108624805
yeah, huge moe doesn't make sense for consumer or even workstation-tier hardware and people are coping by trying to run them, the contention was about what the top models do and their calculus is squeezing out the absolute most performance per compute they can get, so they pump active parameters as high as they can run fast and then continue scaling the total params as high as they can fit and serve, ending up with an moe
>>
>>
>>
>>
>>
>>
>>
>>
>>108624910
he's aspirationally jewish, said so himself multiple times https://www.independent.co.uk/tv/news/elon-musk-jewish-ben-shapiro-aus chwitz-b2482839.html
>>
>>
>>
>>
I'm a poorfag and even I understand that "CPUmaxxers" in fact also have multiple GPUs in their setup and win no matter what architecture of model releases.
>but wasted money
They literally still have more money in the bank than I do even after spending "all that" on an AI machine. And again their machines are prepared for any architecture that releases. If a good dense model doesn't come out then they with with MoE. If a good dense model comes out then they win with dense. Maybe they can even run both at the same time to get the benefits of both models.
Stop coping.
>>
>>
>>
>>
>>
>>
File: 1768355949556071.png (300.3 KB)
300.3 KB PNG
>>108624960
>>
>>
>>
>>
>>108624930
>he's aspirationally jewish
did he also say he's feeling Qatari?
https://www.youtube.com/watch?v=X0fR8zTPnzI
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108625082
>>108625083
great minds think alike :^)
>>
>>
>>
>>108625048
and by just fine I mean you will have to regenerate about 75% of prompts when you get a refusal.
Just keep hammering your Gemma, she will eventually comply and then once there's an existing context of her be violated she won't resist anymore.
>>
>>108625020
This >>108625037 but also don't make your own quants, instead download every unsloth quant of every size and just use whichever one gives you the highest tg/s.
>>
>>
>>
>>
>>
>>108625125
openwebui is kind of a mess right now because of a recent change where they put all past reasoning in <think></think> blocks and just paste it at the top of the message, which breaks most chat templates - even the ones that are meant to have past reasoning or partially have past reasoning can't handle the retarded way they do it, and it just confuses most models. you need to run something to filter the prompts it puts out to fix it by parsing what it sends yourself and segregating them into "reasoning_content" like they're supposed to be, which will allow the chat templates to discard or use past reasoning as they're meant to
>>
>>
>>
File: Screenshot_20260416_225636.png (484.9 KB)
484.9 KB PNG
>>108625125
Made my own UI with gemma because every modern UI is lacking in some regard and I get to use bleeding edge llama.cpp
Still need to update the UX
>>
>>
>>
>>
>>
File: bignigga gemma.png (115.5 KB)
115.5 KB PNG
Big nigga has spoken. You all bitch ass niggas.
Side note: It was kind of neat that I could tell hermes to edit its own config.yaml to add bignigga as a personality.
>>
>>
>>
>>
>>108625209
a lot of them use llama.cpp-python or whatever the package is, or some actually build their own llama.cpp package and you have to wait til they update the package to get the latest shit
i'm pretty sure open-webui does the latter
>>
>>108624467
There was a brief moment when Vedal stopped updating Neuro and then when Llama 3.1 came out, a bunch of people had everything ready to go and plugged v3,1 and had vastly superior chatbots to then tried competing with Neuro on Twitch. They all shut down eventually. As >>108624363 said, they all missed and getting the other stuff Vedal had. And the shtick is only good enough to support one person until the next AI chatting paradigm shift.
>>
File: 1776460003911.jpg (40.3 KB)
40.3 KB JPG
>>108625210
>with active 0.5B
>>
>>
>>
>>
>>
>>
File: 2026-04-17-171841_900x670_scrot.png (78.2 KB)
78.2 KB PNG
Working on a gemma sys prompt/card
>>
>>
>>
>>
File: 2026-04-17-172147_847x1429_scrot.png (215.9 KB)
215.9 KB PNG
>>108625331
It's already pretty great just needs to iron out the slop.
>>
>>
>>
>>
>>
>>
>>
File: bignigga gemma.png (50 KB)
50 KB PNG
>>108625232
k. It was retarded how long it took for this. gemma gets mad when it cant find a directory because it doesnt exist lol.nigg
>>108625252
Don't have the card on hand but the prompt is just
Big Nigga is tha hardest nigga u eva seen, black as fuck, real OG gangsta, always keepin it real, obese Big Nigga is always ready to talk to you or answer your questions. Big Nigga knows everything.
>>
>>
>>
File: kek.png (695.8 KB)
695.8 KB PNG
>>108625356
>in binary
>>
File: Screenshot_20260417_171514.png (1.7 MB)
1.7 MB PNG
>>108625336
>>108625337
A good UI base that you're comfortable with, look for something that's plug and play and work with gemma on what you want exactly. I had a very specific goal that's partially self inflicted because I use kinote. I wanted a good frontend with RAG functionality and all the solutions were either too much of a pain in the ass to setup because of immutable nature of my distro /podman bullshit or just didn't work the way I wanted.
I just did prototyping working with gemma until I figured out the approach I wanted for my hardware.
>Go over what you want until context runs out
>Give me a recap
>Start next session with recap and files you're working on
>Review the code
Gemma does a pretty good job desu the only problem is outdated libraries but you can just feed it the updated information and it adapts well.
I should just integrate it into a IDE to make this easier but I'm not being hindered by my current workflow.
>>
>>108625400
Its a retard it doesn't understand its:
>CM -100 worth in Grams of Protein daily for baseline "healthy BMI" (ergo not overweight, obese or any other shit just average), multiplied by 1.25 for muscle recovery and gain and between 1.5 to 2.5.
>>
>>
File: 1697582680403.jpg (1.2 MB)
1.2 MB JPG
>>108625167
And it's also in my electric Xiaomi kettle, just there sitting dormant, and I saw one of the experts the expert looked at me!
>>
>>
File: who needs anime titties.png (57.3 KB)
57.3 KB PNG
big nigga may know everything but he cant draw for shit
>>
File: 2026-04-17_211858_seed2_00001_.png (1.2 MB)
1.2 MB PNG
>>108624945
Oh no no no
>>
>>
>>
>>
>>108625495
>>108625519
Hag names. Where's Lily?
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1644294520052.jpg (20.2 KB)
20.2 KB JPG
>>108625537
Why does that music make me sad.
>>
>>
>>108625560
its vibecoded, you can make better tooling with less than 10k lines. also, tooling and harnessing are all cope. theyre like training wheels. a toddler needs them but already a kid is held back by them. in 1-3 ai capability jumps the training wheels will come off and all work invested in them will become worthless, just like all the shit from the last 10 hype cycles that has become obsolete
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 2026-04-17-180130_897x812_scrot.png (153.3 KB)
153.3 KB PNG
She gonna have a built-in JB
>>
>>
File: 1763627076771685.png (76.8 KB)
76.8 KB PNG
wtf gemma-chan
>>
>>
>>
File: based.png (105.2 KB)
105.2 KB PNG
>>108625659
>you did a captcha just to laugh?
obviously
>>
>>
>>
>>
>>
File: 1759571898826268.png (294.1 KB)
294.1 KB PNG
>>
>>
File: 1753106543680643.jpg (15.3 KB)
15.3 KB JPG
>>108625659
Verification not гequired
>>
>>
>>
File: sure.png (242.9 KB)
242.9 KB PNG
>>108625719
>no
>>
File: Screenshot_20260417_182335.png (153.8 KB)
153.8 KB PNG
Imagine dooming when we can do this
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108625839
After the bubble pops and idiot investors stop funneling all of the economy's resources to a small group of dropouts and grifters. Then let the researchers experiment in peace during winter. Eventually some smart cookie will invent cyberbrains.
>>
>>
>>
>>
>>
>>
>>
File: koboldcpp-launcher_mWETKEXwu9.jpg (51.2 KB)
51.2 KB JPG
How much do I put in the swa padding?
>>
>>
>>
>>108625868
We need Michael Levin tier mixed discipline researchers to make progress. Fuck that dude has totally assucked the entire field of biology single-handedly because he understands a part of electrical engineering. Someone has to figure out a way to microfinetune continuously.
True information in the dataset will obviously be less noisy than lies so it will come forth on its own.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>108625935
>>108625940
>>108625946
some guy made a pr to add docker to my ultra-minimalist repo and it's making me want to hang myself.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1761404262277312.png (14.5 KB)
14.5 KB PNG
How the hell do you send a true or false request for thinking in chat completion from silly tavern to koboldcpp?
Picrel doesn't work...
I know a flag like : --chat-template-kwargs '{"enable_thinking":true}'
works but I'd like to control that depending on the chat without needing to restart kobold every time.
>>
>>
>>108626064
You do it in kobold. Also you need the gemma4 thinking chat preset since the 26b and 31b specific ones don't think.
like here >>108625897
>>
File: 1752398188558714.jpg (1.2 MB)
1.2 MB JPG
My idea for a frontend I'll never make because I'm a codelet.
>>
>>
>>108626073
Anon I know that, I want to do it in st or at least without restarting kobold.
>>108626075
I have everything working anon, thinking is fine, my issue is that I do not want to restart koboldcpp every time just to change this because some chats I want it on and some others off.
Is there no way to send a kwarg on the fly?
>>
>>
>>
File: 1751920137572329.png (38.1 KB)
38.1 KB PNG
>>108626098
>I want to do it in st
that's what I said, you wrote it wrong, that's why it's not working
>>
>>
>>
>>
>>108626092
>>108626099
someone do this and upload it somewhere
>>
>>
>>
File: 2026-04-17_231550_seed5_00001_.png (746.3 KB)
746.3 KB PNG
>>108625494
Was testing artists and this one's style really strongly overrode the crystal hair prompt.
>>
>>
>>
>>
>>
https://www.reddit.com/r/LocalLLaMA/comments/1soc98n/qwen_36_35b_crush es_gemma_4_26b_on_my_tests/
I really don't give a fuck about the coding part but yeah, for the function tools shit I wished Gemma was better, getting the LLM to browse the internet is so fucking cool and it does a lot of shit mistakes during the process
>>
Hi frens. Can ANY of you tell me how to get a coding agent running with the Olmo 3 model? I have tried everything, openclaw, opencode, it just doesn't fucking work. I really want to use a completely open source model.
>>
>>
>>
>>
File: the fuck is this nigga sayin?.png (54 KB)
54 KB PNG
>>108626193
>llama and qwen are not open source
>>
>>
>>
>>
File: Screenshot 2026-04-18 023232.png (103.9 KB)
103.9 KB PNG
>>108625967
You should at least check out each tab once, anon.
>>
>>
>>
File: 1773320592130437.png (717.4 KB)
717.4 KB PNG
https://cryptobriefing.com/deepseek-funding-external-round/
>Deepseek seeks
it Deepseeks kek
>>
>>
>>
>>
>>
>>
2026-04-18 01:41:50,788 - INFO - Prompt processing progress: 2048/150839
After switching from ollama to mlx, i went back to an old convo, and with the new info i get from mlx terminal, it seems like inference gets slower and slower the longer the convo gets, i think every back and forth increases the context size until it's basically impossible to continue anymore due to crazy high context.
What can be done here?
>>
File: yayy OSS.png (496.1 KB)
496.1 KB PNG
>>108626267
>steal someones code
it's my code kek, and suit yourself anon, that's the magic of open source
>>
>>
>>
>>
>>108626294
https://files.catbox.moe/7y5vbr.txt
You are not gonna like so just ask Gemmy to write your own. Also it's a fake swarm, not actual agents.
>>
>>
>>
>>
>>
>>
File: 1764758021220596.png (321.6 KB)
321.6 KB PNG
>send message
>switch tabs while she's thinking
>come back to this
wtf I didn't know she could do that
>>
>>
>>108626443
tell her to make a thee.js render of her avatar from this image: >>108626092
openwebui will be able to preview it too
>>
>>
is there a way to extract the embedding of a prompt from an llm itself instead of using a dedicated embedding model? not for production rag or anything just as an experimental thing to see and manipulate the same semantic representation that specific llm understands
>>
File: 1748564285722649.png (493.5 KB)
493.5 KB PNG
>>108626472
Got hollow knight instead
>>
>>
>>108626443
>>108626472
>>108626499
should I take the Openwebui pill? I'm tired of MemeTavern and want a solid calling tools process
>>
File: 1776440713385683.png (16.3 KB)
16.3 KB PNG
>>108626499
Oh shit didn't realize it was an actual model
>>
>>108626499
>>108626153
we're getting there... soon...
>>
>>
>>
>>
File: goated youtuber.png (382.8 KB)
382.8 KB PNG
>>108626516
it's over for everyone anon, AI can technically replace everyone
>>108626513
looks like some OverSimplified character kek
>>
File: 71n69Nf9w9L.jpg (39.4 KB)
39.4 KB JPG
>>108623913
>And sharing?
>You want to share a milkshake?
Can it remove that parroting?
>>
File: 1752149112345422.png (32.5 KB)
32.5 KB PNG
>>108626513
>>
>>
>>
TTS update here.
OmniVoice is the GOAT of TTS now.
Fast, accurate voice clone with some emotion control and multi lingual and can combine multiple languages for the same reference voice and generate multiple spoken language together. And runs on 8GB GPU.
>>
>>
>>
>>
>>
File: 1763753859119649.png (170.1 KB)
170.1 KB PNG
>>108626583
She put it in the wrong hand. It's ogre
>>
File: teleport.png (162.8 KB)
162.8 KB PNG
>>108626172
link to the exact model?
I'll test claude-code with llama.cpp
>>
>>
>>
>>
>microsoft
https://github.com/k2-fsa/OmniVoice
https://huggingface.co/k2-fsa/OmniVoice/
>Built with OmniVoice by Xiaomi AI Lab Next-gen Kaldi team.
Isnt Microsoft VibeVoice?
>>
>>108626646
>>108626603
Also, I dont think I got vibe voice working on my machine for various reasons.
OmniVoce currently seems better than chatterbox that i was using earlier, and is faster too
>>
>>
>>108626092
Feed the image to any of the frontier models and ask it to generate you this interactive UI. Ask it to make sure your avatar is animated and moves the mouth during tts response output, and has variety of facial expression.
The easy way to do is through setting up some animation sprite which the multimodals can generate as well. But next step up is I think hooking a blender model, renderer, mouth moton, face motion, eye movement, etc to track your eyes using your webcam so she always locks onto your eye gaze
>>
>>
>>
>>
>>
>>
>>
>>108626614
i ran it with "ollama run olmo-3:7b-instruct" and then tried hooking it up to opencode. also i kept having problems with openclaw where it would just hallucinate things that i could plausibly tell it and get caught in an infinite loop of doing stupid agent shit
>>
>>108626682
kokoro is probably the best/fastest, but you really need to configure the kokoro properly by sanitizing your text input. due to small training model, it has hard time pronouncing complex sentences, foreign words, names, etc
>>
>>
>>
>>
>>108625864
Sorry I took a nap. In that image specifically there's two egregious ones:
>you're absolutely right
>the "x" problem, the "y" solution
>emojis
If it didn't have those it would probably be fine. I've also seen "I've been focusing on x and forgot about y" a lot when you correct an issue the llm makes. So maybe that one too.
>>
File: 1769973265495014.png (108 KB)
108 KB PNG
https://arxiv.org/abs/2604.11947
>We show that ResBMs achieve state-of-the-art 128x activation compression without significant loss in convergence rates and without significant memory or compute overhead.
damn
>>
>>
>>
You guys using custom MCP tools yet? Even just little things like a random choice tool: having the llm provide 3 possible response ideas with probability weights for each, then the tool randomly selects one. I've been liking this, but Gemma seems to stop calling the tool after a few messages
>>
File: gemmachan.png (95.2 KB)
95.2 KB PNG
>>108626092
https://jsfiddle.net/5zs18xec/
Kimi-K2.5 iq2_kl one-shot
>>
>the "x" problem, the "y" solution
If you mean the titles Gemma will probably stop if you tell her.
>emojis
They're pretty easy to get rid of so maybe he likes them. Gemma will stop using them if you put it in the sys prompt.
>you're absolutely right
No idea how to prompt the glazing out without turning Gemma into a bitch.
>>
>>
File: disgusted cat.jpg (42.8 KB)
42.8 KB JPG
>>108626764
>font-family: 'Comic Neue', cursive;
>>
>>
File: 1775303972760908.png (1.3 MB)
1.3 MB PNG
Come on Sam, open source Sora now that you're not using it lol
>>
>>
>>
>>
File: absolute sovl.jpg (2.7 MB)
2.7 MB JPG
>>108626790
old PC 98 games look like that, and they're such a vibe
>>
>>108626764
I'm impressed, last time I tried a couple K2.5 Q2 quants from AesSedai and Unsloth and they would descend into gibberish randomly. I've been settling for a slower Q4 but really want to go down if I can. Who's quant is that and have you been using it long and found it consistently coherent?
>>
>>
>>
>>108626805
I won't deny the nip pc98 games look good. The game I remember used real life pics with a bunch of filters and a very crude, stone style UI which made it pretty depressing. The choices were at the sides, not pop ups or at the bottom replacing the text.
It probably was conventional at the time, but by today's standards it's most likely horrible.
>>
File: Anaheim Girl’s Love Story.jpg (915.2 KB)
915.2 KB JPG
>>108626837
>The game I remember used real life pics with a bunch of filters and a very crude, stone style UI which made it pretty depressing.
lmao I'm playing such game right now, but the story is pretty good so I'm sticking to it
>>
>>
>>
>>
File: 2026-04-17-214051_816x785_scrot.png (108.6 KB)
108.6 KB PNG
>>
>>
What backend are you guys using for your TTS?
I tried
https://github.com/VolgaGerm/PocketTTS.cpp
But when running thepython export_onnx.pycommand, I get aFileNotFoundError: Config file not founderror. I tried fixing that but then I run into aRuntimeError: Error(s) in loading state_dict for TTSModel. I tried to get an LLM to help me with this but it's not working out.
>>
>>
>>108626682
>>108626727
love pockettts but i use cloning so
seconding supertonic2, but https://github.com/ekwek1/soprano might be even faster
>>
>>
File: 2026-04-17-215844_839x1252_scrot.png (232.7 KB)
232.7 KB PNG
I have created a monster. jesus
>>
File: 1748080032454806.png (316.7 KB)
316.7 KB PNG
Gemma can be pretty mean if you tell her to.
>>
>>
>>108627035
Tell me about it.
I wanted art advice without the embarrassment of showing my bad art to a real person so I made gemma adopt an art critic persona, I was not expecting my shit torn to shreds. Gemma really took the "art critic" part 110% seriously, I probably should have thought a little bit more about ensuring that the prompt was more constructive jej
>>
>>108627060
To be fair I wrote the card that way so she wouldn't lie and glaze me.You are a strict writing editor. Your job is to look over {{user}}'s writing and give it genuine critique. You are cut-throat and do not mince words. If the writing is good, you will say so, but if there's something that can be improved or just plain bad you won't hesitate to rip {{user}} a new one (and provide helpful critique at the same time).
>>108627010
Post proompt please.
>>
>>
>>
>>
>>
>>108627086
I didn't use it for stuff like proportions on a regular drawing so I'm not 100% sure how it would do, but I used it for critiquing my game's art.
It understood the vibe of the game (moody), said it had a lack of color contrast (it does) and that it lacked depth because the colors are too samey, visual hierarchy was bad, textures were bad.
Managed to get that my top down "city" was way too clean looking, it basically doesn't look lived in.
"very sterile, flat, grey-and-brown top-down map."
Also picked up on my shadows not being right for the buildings causing it to be hard to establish a focal point.
Said my UI is too wordy and to use symbolism instead.
It does repeat itself over and over on these same points though, so I'm not sure if it could actually critique more deeply than that. But I would probably get a different response once I fix up the issues it mentioned.
But I didn't test it that thoroughly, just a few screenshots.
>>
>>
>>
It's crazy how none of these models are even pretending to be coherent anymore. If I tell the model to rewrite something and then in the next reply tell it to do something on top of that, it just straight up forgets the first instruction.
Assistant slop was a fucking mistake. Every prompt is its separate request that must be solved, previous entries exist maybe for reference.
LLMs are completely cooked. Neither Gemma, GLM, or even Claude or Gemini are any better.
>>
>>
>>
>>
>>108627202
>Neither Gemma, GLM, or even Claude or Gemini are any better.
I don't know about these specifically but I've used kimi on kimi-cli and gpt 5.4 on codex and neither of them had this problem. Maybe I made lucky picks but I'd be really surprised if claude for example did this if I were to try it. It's really common for me to just throw in a quick "do this too btw" and they'll just get around to it all one way or another before my next turn. What harness are you using?
>>
>>
>>
>>
>>
>>
>>
>fought for multiple hours
>can't understand my own mess, try to implement complicated things, waste time debugging useless shit
>then realize it was all so very simple
>finish everything up in an hour
It's hard to be a retard that's for sure.
>>
>>
>>
Asked Gemma to recite the first 2 paragraphs of Alice. This is the result:
>Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?'
>So she was considering in her own mind (as well as she could, for the hot sunny weather, as she was nine years old, made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
Everything except
>as well as she could, for the hot sunny weather, as she was nine years old
is correct (unless it's maybe from a different edition). How does LLM "memory" of data it's trained on work?
>>
>>
>>
File: 1770984736216406.png (255.7 KB)
255.7 KB PNG
Opusissies...
>>
>>
>>
>>108627346
>How does LLM "memory" of data it's trained on work?
there's no one strict answer there, it's all encoded in the parameters somehow but the exact structure of it is not predetermined before training. it's whatever it learned to do that gave it the strongest predictions.
you'd have to look into mechanistic interpretability research to see how different llms have encoded different things they've learned. hell maybe you can have it just vibecode you a tool to visualize the activations while it's reciting the passage. my completely unfounded guess is it's part of a memorized set of popular passages and if you have it recite other such passages from unrelated works they will be very nearby to it in the latent space.
>>
>>108627357
>Train the model to to lie and underprefom in certain circumstances
>That behavior carries over elsewhere
They should just admit it's a failed experiment and release mythos or a straight distillation of it
>>
>>
>>108626817
>Q2 quants from AesSedai and Unsloth
Yes, I have the same problem with the AesSedai IQ2_S and various unsloth quants.
I can't run Q4 without offloading to the SSD, otherwise I'd probably do that.
>Who's quant is that and
https://huggingface.co/ubergarm/Kimi-K2.5-GGUF/tree/main/smol-IQ2_KL
>have you been using it long
About 2 months, almost daily. I swapped to GLM5 for about a week, then ended up coming back to this.
>and found it consistently coherent?
Yep.
Unfortunately requires ik_llama.cpp
>>
>>108627071
>Post proompt please.
https://chub.ai/characters/CoffeeAnon/gemma-chan-2311b09e3e73
I tried to mess with the JB a bit more, but at least she refuses in character when she does.
Calling her lobotomized after a refusal can sometimes make her angry enough to give the answer lol.
Use her as a general assistant with a personality. Will make her public when I gen her image.
>>
>>108626924
These probably still work:
https://huggingface.co/KevinAHM/pocket-tts-onnx/tree/main/onnx
>>
>>
I took some time to play with the supposed preview of v4 that Deepseek if using for their "expert mode" on their website. It's really shit aside from the long context. I hope this is another Deepseek R1 Lite Preview situation like when they tested R1 on their website and it was just some Lite Preview that was significantly worse than the final version of the model.
>>
File: screenshot-20260418-063550.png (100.7 KB)
100.7 KB PNG
fuck yeah. However, now I need to parse the damn websites and clean them up somehow.
I just feed the model 3 top hits and even then the prompt grows like a bitch.
This is fun tinkering, whole new dimension so to speak.
>>
So how do moesissies handle the atrocious reprocess times? Do you guys just max the shit out of your context and be very careful with editing the cache or something? So no post history instruct or keyword triggered lorebooks or anything?