Thread #107968112 | Image & Video Expansion | Click to Play
File: 814c2ff6-0685-4a3c-9fe0-af0e2dd91764.png (2.3 MB)
2.3 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>107957082 & >>107948284
►News
>(01/25) Merged kv-cache : support V-less cache #19067: https://github.com/ggml-org/llama.cpp/pull/19067
>(01/22) Qwen3-TTS (0.6B & 1.8B) with voice design, cloning, and generation: https://qwen.ai/blog?id=qwen3tts-0115
>(01/21) Chroma-4B released: https://hf.co/FlashLabs/Chroma-4B
>(01/21) VibeVoice-ASR 9B released: https://hf.co/microsoft/VibeVoice-ASR
>(01/21) Step3-VL-10B with Parallel Coordinated Reasoning: https://hf.co/stepfun-ai/Step3-VL-10B
>(01/19) GLM-4.7-Flash 30B-A3B released: https://hf.co/zai-org/GLM-4.7-Flash
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
340 RepliesView Thread
>>
File: threadrecap.png (1.5 MB)
1.5 MB PNG
►Recent Highlights from the Previous Thread: >>107957082
--Paper: Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings:
>107960244 >107960358 >107960418 >107961005 >107960382 >107961422
--Multi-GPU setup strategies for cost-effective inference:
>107958416 >107958478 >107958532 >107958598 >107958970 >107958540 >107958611 >107958531 >107958632
--Qwen3 TTS installation challenges on Windows with Nvidia GPUs:
>107958660 >107958671 >107958685 >107958709 >107958783 >107958719 >107958782 >107964469 >107958753 >107958714 >107962549
--qwen-tts performance and compatibility issues in TTS applications:
>107958000 >107958013 >107958047 >107958501
--LLM struggle with deviating from genre tropes in constrained narratives:
>107959380 >107959410 >107959431 >107959440 >107959458
--Exploring AI interaction in Among Us-style games and survival simulations:
>107959425 >107959464 >107959483 >107959505 >107961126
--Challenges with book-based QA and context limitations:
>107964051 >107964287 >107964322 >107964354
--Optimizing llama.cpp for fast, low-VRAM 1-shot question answering:
>107963343 >107963394 >107963472 >107963529 >107963577 >107963655
--Speculation on Minimax-M2-HER and Mistral Small Creative model releases:
>107957396 >107957481 >107957543 >107957650 >107957598 >107957634
--MiniMax M2-her roleplay limitations:
>107962436 >107962501 >107962512 >107962654 >107962666
--llama.cpp PR reducing DeepSeek memory usage:
>107963328 >107963386
--Critique of TranslateGemma, recommendation of heretic Gemma 3 for uncensored JPEN translation:
>107961940 >107962800
--Vibevoice emotion tag functionality:
>107960489 >107960506
--LLM formatting and model preference debates:
>107966244 >107966357 >107966388 >107966534 >107966600
--Qwen voice cloning stability and context length issues:
>107961962 >107962660
--Miku (free space):
►Recent Highlight Posts from the Previous Thread: >>107957086
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
Considering engram is going to be the next big thing, lets talk about it and its consequences for local.
I'm trying to get a feel for how these future models will work on local machines but I'm a bit of a brainlet so I would appreciate some input from my betters. If I understand it correctly, around 20-25% of the model's parameter budget gets allocated to the engram memory module which is stored in RAM/NVMe, where performance scales linearly with the size of said memory model.
Obviously the computing requirements for running the model go down, but what does this mean for RAM/NVMe? Does this mean we'll be running huge models that sit in NVMe storage? Should I be buying as many NVMe drives as possible? Another thing to consider is the throughput. The paper claims there's only a 3% hit to throughput when using the hybrid engram architecture, but is that the case for only RAM or NVMe storage as well?
>>
>>107968191
>Should I be buying as many NVMe drives as possible?
I have no idea what you're talking about but it's already too late, prices are skyrocketing. I got a 1TB nvme drive for like 80 bucks 3 years ago and now they're like 250 dollars
>>
>>107968191
I doubt computing requirements would go down. It seems deepseek wants to add engram params on top of what they can run right now. So, deepseek v3 is 650B, then deepseek v4 will have the same 650B + 350B of engram params.
>>
>>
>>
>>
>>107968321
Chatterbox turbo - 350M params.
Seriously, nobody has vibecoded gguf inference for qwen-tts yet. Be the first. llama.cpp already supports qwen3 architecture. You just need to implement MTP module and tokenizer.
>>
>>
File: 1743008679366121.png (1.8 MB)
1.8 MB PNG
>>107968171
Miku was just a distraction! Rin is already in the server room!
>>
>>107968266
2tb is 250$ right now at the retailers that I visit. The point isn't dooming about current prices, the point is about determining future prices in a world where engram is the new paradigm.
>>107968288
Yeah, I fully expect the SOTA labs to try to max out model size on GPU, but at minimum it means we get better and smaller models. I'm really interested in the linear scaling performance and what it means for RAM/NVMe.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>107968431
Yes, I perfectly understood what he said. 25% of sparse parameters can be offloaded to embedding tables in RAM while getting better results on benchmarks. This means smarter models you can run with less VRAM. This is the ground floor for engram. That doesn't mean labs won't push out engram models with larger sparse parameters to try to push benchmarks. I fully expect the same thing we have now, which is model size diversity.
>>
File: 1760262544190219.jpg (139 KB)
139 KB JPG
>>107968471
>GPT-3
>How many R's are in Strayberry?
>"You mean strawberry? There's 3 R's in strawberry. If you meant Strayberry there's 2 R's in Strayberry. Hope this helps!" (
>There's 3 R's is Strayberry though.
>"Oh you're right! What a fascinating discovery! I'll seem to have made a mistake! There's indeed 3 R's!"
>meanwhile local model
>How many R's are in Strayberry?
>"lol you're trying to trick me? I bet you can't even country to 3 you doof"
>>
File: gpt3.png (86.9 KB)
86.9 KB PNG
>>107968471
rose-tinted glasses
>>
>>
File: file.png (269.8 KB)
269.8 KB PNG
CUDA dev you broke GLM 4.7 (not flash) with https://github.com/ggml-org/llama.cpp/pull/19092
I didn't test other models.
Left is 0440bfd, right is 0c21677.
>>
>>
>>
>>
>>
>>
>>107968548
ESL? It's when breathing stops for a second, often with a sharp intake of breath beforehand. Similar to a gasp, like when someone is surprised or frightened. Not to be confused with catching your breath, which means a different thing
>>
>>107968564
>>107968588
It's expected that results are not bit-for-bit identical.
Since language models are autoregressive, once a single token is sampled differently the entire sequence diverges.
This happens in particular at the beginning of sentences where the token distribution is very flat and small changes can amplify.
A low or zero temperature suppresses this to some extent but not completely.
I'll double check the architecture of this particular model but I don't see this as evidence that either build is better or worse on average.
>>
Has anyone tried to abliterate a model with heretic? How long does it take until the number of refusals starts going down? Should I be concerned if it doesn't for a while even when setting the random exploration phase to 1?
>>
>>107968640
It's fairly obvious that the example on the right doesn't follow GLM's usual thinking structure at all and the output is completely schizo. It claims that the lines of the poem are jumbled up.
It gets worse at higher context. I first noticed it in claude code at ~20k context because the model wouldn't output anything coherent at all and just spammed the same nonsense token.
>>
>>
Oh, and I also set the minimum considered KL divergence to 0.1. But it never reaches that.
Is there a setting to make it more aggressive if I care about the uncensored part more than about the correctness part?
Maybe it doesn't work because it's a MoE?
>>
>>107968664
kek, flash attention was broken on llama.cpp for a fucking year and those clowns said all the same stuff defending it. "The perplexity score on 16 tokens of wikitext is nearly the same so our implementation isn't broken."
>>
>>
>>
>>
>>
File: Screenshot_20260125_185116_Gallery.jpg (542.8 KB)
542.8 KB JPG
Why is Claude like this?;
>>
>>
>>
>>
>>
File: file.png (30.8 KB)
30.8 KB PNG
>>107968640
Here's an example with an even longer prompt: https://rentry.org/xwu5muxu
Before the commit on the top and after the commit on the bottom.
>>
>>
>>107968664
>>107968779
In this particular case I can already reproduce the issue, it has to do with one of the specific code paths on Turing/Ampere.
>>
>>
>>
>>
>>
>>
>>107968754
that's all AI
>>107965306
I use lmstudio because the python lms interface is great and vLLM doesn't run on my PC for some reason.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1671078154054043.png (50.5 KB)
50.5 KB PNG
>>107968471
gpt-3 was great. good times.
>>
>>107968471
As much as I enjoyed GPT3-davinci (not necessarily 3.5), GPT4 (0314) and GPT4 (0613) both did things that local models to this day don't handle well.
They're next to Claude 3 Opus in terms of models that likely will never be reached in terms of soul
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: s4_eskimo_pussy.jpg (349.3 KB)
349.3 KB JPG
Looking for a realtime-ish local TTS that sounds good and supports either good voice cloning or finetuning to a particular speaker. Bonus points for an implementation in a real language, not Python.
>VibeVoice-Realtime-0.5B
Deliberately no voice cloning nor finetuning support, so it's useless. Would need to be reverse-engineered.
>VibeVoice-1.5B
Voice cloning adherence is OK. Not very natural cadence or emphasis etc. Is it worth finetuning? VibeVoice generally has a (probably vibecoded) Rust implementation that seems to work (unsure about its perf):
https://github.com/danielclough/vibevoice-rs
>Kokoro
Good quality for its size, but doesn't support voice cloning or finetuning.
>Pocket TTS
Voice cloning adherence is very poor. Would need finetuning, but AFAICT nobody's done it yet, perhaps because it ostensibly supports cloning. Supports streaming. May be the best option given finetuning support.
>FishAudio-S1-mini
Even the official samples sound pretty shit, like a schoolchild reading a book aloud. And the only web demos I saw were walled behind an account.
>Qwen3-TTS
Voice cloning adherence is bad. Does support finetuning; I think an anon ITT had a bad experience with that.
>Echo-TTS
Great quality and voice cloning adherence; best I've heard in both respects. Sort-of supports streaming. Couldn't run it locally due to a bug in a dependency (which wouldn't be hard to swap to be fair). Unfortunately somewhat obscure and apparently a dead project.
>IndexTTS2
Decent voice cloning adherence, good quality, sounds pretty natural. No official finetuning support. Best overall option I've seen. Has an extremely vibecoded Rust implementation which I haven't tried:
https://github.com/8b-is/IndexTTS-Rust
https://huggingface.co/ThreadAbort/IndexTTS-Rust
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: 1744552884613516.jpg (45.2 KB)
45.2 KB JPG
>>107969214
Still one step short
>>
>>107969100
Supertonic. It sounds more natural than most < 100M models. Doesn't need a phonemizer (no espeak dep), doesn't need complex tokenizers (just straight utf8->tokenid mappings). Understands numbers, $ and a few other things without having to spell them out (but you can still do it if necessary). ONNX models that run just fine on my C thing. They have example code on how to run it for about 10 languages (including C++). It's fast. Doesn't have voice clone. Voice files can be modified like kokoro's or kittentts', so the kvoicewalk method to find voices would work on it just fine.
If nothing else, being one of the few (only?) models that can do synth voices without a phonemizer/tokenizer is a huge thing. V1 is more stable than V2. It misses words less often.
Soprano-80M. Single voice, which i think sounds pretty natural. No voice cloning or even changing the default one, as far as i know. LLM based, complex tokenizer. Being able to stream helps mask the speed (which is not bad, really. Just slower than supertonic).
There's luxtts, which was released no more than 2 days ago, but it's a mess to set up. It needs like 5 (small) models, for two he provides the onnx ones. Then there's the vocoder in safetensors and bits of zipvoice that you need to pull from some other repo.
>>
>>
>set up opencode+llama.cpp a few days ago
>had an issue with GLM-4.7-Flash and other instruct models concerning tool calling templates
>think "Maybe should recompile with newer version?"
>compile newer version
>https://github.com/ggml-org/llama.cpp/issues/19096
I think I will try the b7811 release and just HOPE that it works with GLM-4.7-Flash since that is the only model so far that works on my unholy 12GB VRAM setup. It's slow, but offloaded into memory it worked. Until it broke.
Hope they fix this + the tool-calling issue, then it would be great!
>>
>>
>>
>>
>>107969771
Update b7811 works, the flash attention nkvo bug is in a later release.
b7811 suffers from:
>Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
But it works fine with GLM-4.7-Flash (most of the time).
>>107969843
Yeah, might. But GLM-4.7-Flash is the first one that works with opencode on my machine. Since I'm GPU-Poor with only 12GB VRAM I have a virtual machine with GPU pass through to the 3060 and offload it into 48 GB RAM. It's simply the first model to actually produce viable results, even if it takes forever. Been trying around with RNJ-1-Instruct which also kind of worked (and fast) but tool calling is a bit crap in llama.cpp/opencode stack?
Also Apriel and the Minstral models are nice, but I guess those make more sense in a LMStudio/Desktop app for quickly checking stuff out...
>>
>>
>>107969878
>I have a virtual machine with GPU pass through to the 3060 and offload it into 48 GB RAM.
Oof. Considering that for models like FLM 4.7 Flash (MoE that mostly run on the CPU), memory bandwidth is the main defining spec for generation speed, assuming you aren't using a really small quant to fit the majority of the model in VRAM, I can't imagine running under the overhead of a VMs memory management will yield the best results.
Is there a reason you are running the model inside a VM?
You can launch llama.cpp in the host OS and access the API from inside the VM if necessary.
>>
>>
>>
>>
>>107969911
>You can launch llama.cpp in the host OS and access the API from inside the VM if necessary
The initial reason was that I'm playing around with AI agents and didn't want to give them access to my host OS. There will be some configuration changes (networking to the Host etc) so I can access the API from the guest OS, but otherwise it should actually be better...yeah. I'll do that tomorrow. I'll make a lot easier.
Having the llama.cpp and opencode and all in the VM just was much easier to configure until I got a somewhat working setup, more of a PoC really.
>>
>>
>>
>>107969100
>Couldn't run it locally due to a bug in a dependency (which wouldn't be hard to swap to be fair).
works fine for me locally. ask llm to teach you conda or uv.
post a sample of the voice you want to clone
what language do you need?
do you need laughs and moans?
is 16khz ok?
>>
>>107969974
>>107969991
Yeah, I figured it would be something like that.
Well, good luck.
In my experience, Qwen 30B is actually pretty damn good for the specs you can run it on at decent speeds.
For anything but creative stuff, that is, but still.
Maybe GLM will be better once all the kinks are hashed out, but Qwen right now just works.
>>
>>107969900
I've run into some discussion claiming that you'll be able to offload a majority of the engram parameters to NVMe storage, but I can't find anything about it let alone throughput benchmarks. Regardless, I'm intrigued and confused about Infinite Memory Regime. Trying to figure out whether or not I should FOMO into more RAM and nand memory.
>>
>>107970015
Which quantization of Qwen 30B do yo use? As I said it's mostly to be integrated with opencode (which has the least shitty UI so far imho, although I disagree with their shitty ass documentation). Need to speed things up already so I can have it vibecode me a opencode alternative...
>>
>>
>>107970049
Thanks, downloading the Qwen-30B with Q6 now so it'll be ready in the morning. Since I have two graphics cards (RX580 and 3060) and no iGPU/APU I'll just offload it all into the 3060 while using the RX580 for normal tasks. If I had an iGPU I'd try (https://github.com/peterjdolan/llama-cpp-rx580) Vulkan backend, then I could have two local models running at same time..Looking forward to seeing if the Qwen30b does as told, and if it generates what I want it to generate at an acceptable rate.
Still, with Ryzen 5 3600 and 3060 12GB the LLM produced "acceptable" code and worked agentic in an acceptable manner - only takes like 6 hours for something barely function and my ears bleed from the air cooling, but it is what it is. What a time to be alive.
>>
>>
File: file.png (179.6 KB)
179.6 KB PNG
https://x.com/TencentHunyuan/status/2015635861833167074
>Today, we introduce HunyuanImage 3.0-Instruct, a native multimodal model focusing on image-editing by integrating visual understanding with precise image synthesis!
>It understands input images and reasons before generating images. Built on an 80B-parameter MoE architecture (13B activated), it natively unifies deep multimodal comprehension and high-fidelity generation.
80B13a moe multimodal with CoT image understanding reasoning, image output model with editing, nothing about open source
>>
>>
>>
>>
>>
>>107970431
Here's their official prompt handbook with some examples
https://docs.qq.com/doc/DUVVadmhCdG9qRXBU
>>
>>107970431
wait, so it's just an instruct tune for this thing that's been out since september?
https://huggingface.co/tencent/HunyuanImage-3.0
>>
>>
>>
Is there any LLM that can manage to go more than 16K tokens before it starts contradicting established elements by superimposing what it expects to be true? Just had an annoying experience with GLM 4.7 (355B A32B) Q8_1 group-size 32, temperature 1.0 top-p 0.95.
>>
>>
>>
>>
>>
>>
File: 1749040657491848.png (51.7 KB)
51.7 KB PNG
>>107968112
I setup open code with ollama and qwen3 code. Can't get the model to not timeout from the initial /init. I set 65k context for the model too as described in the open code docs ollama local guide. Mind you this is on a halostrix with 96gb vram. I run Claude code via open code and it works just fine what gives?
>>
>>
>>
>>
I finally got around to testing Qwen3-TTS and oh boy is it fun. I refused to pay for ElevenLabs so this is my first chance to play around with this type of thing.
It sounds good, not perfect mind you, but good. Good enough to listen to an audio book it created. I could see taking this and marrying it to a llm and creating your own local talking digital assistant.
>>
>>
>>
>>
>>107970942
well, running the 480b version is basically impossible on your hardware. the 30b version should work just fine. almost certainly an ollama issue. ollama is known for being kind of shitty compared to base llama.cpp.
>>
>>
>>
>>
>>
>>
>>107970966
>>107971018
Yeah, if idiot me manages to get it working anyone can.
https://vocaroo.com/1nlwoH5SvSYn
>>
>>107971018
>>107971039
I need it to integrate in my application not just for gooning
>>
>>
Regarding QWEN3TTS, I was playing around with it but one thing I'm not sure about.
So you can clone voices using BASE, and then you can save the voice file.
But in order to use the voice file you can only use base? like I cant use the custom voice to guide the tone towards angryness, calm, whatever?
>>
>>
>>107971144
From what I understand can can either
>clone a voice
>use a predefined voice
>create a new voice based on a description
I don't see an option to import a voice you cloned into the section where you can use it as a created voice and then shape the way it is used.
but i am little more than a script kiddy. i can see this stuff changing as people build on what has been released
>>
>>107971184
I looked also in the provided python examples and nothing. The CustomVoice stuff takes for input 'speaker', which is a string, I didnt look further (maybe there's a way to add custom speakers?) but OOTB it looks like you can't use CustomVoice with Base cloned voice. SAD!!!!
>>
File: Screenshot 2026-01-26 at 08-51-37 Qwen_Qwen3-TTS-12Hz-1.7B-CustomVoice · Hugging Face.png (56.9 KB)
56.9 KB PNG
>>107971200
The speaker is one of the preset voice personas. Read the docs nigga.
>>
>>
>>107971212
Your reading comprehension is NULL, I know there are preset voices, I don't know if it supports actually adding a custom voice (ie if its programmatic).
You made me actually check the code and NO, the custom voices arent just 'descriptions' of how they should sound with a label slapped on top of it, it looks like they're baked in the model.
VIVIAN literally translates to token spk id 3065 at the MODEL level, there are no other references.
These fucking chinese faggots, how can they name a model CustomVoice and it DOESNT FUCKING SUPPORT CUSTOM VOICES
LMAO
>>
>>107971200
i am surprised how much you can get out of the predefined voices with a little bit of instruction although even an extra space or two at the beginning of the text can alter how it comes out.
https://vocaroo.com/1cSYzuWHttcB
i can see some crazy stuff being created eventually as people tweak this
>>
File: 1756006234160934.jpg (208.3 KB)
208.3 KB JPG
So I took an English sample of Hatsune Miku speaking, the bit about British people, and fed it into Qwen3 and then generated the following.
Its not Miku but its close.
https://vocaroo.com/1gm8JevJRise
>>
>>
File: 1766664196559768.jpg (475.5 KB)
475.5 KB JPG
>>107971445
Have you never heard hatsune miku before? Yes she is supposed to sound robotic.
https://www.youtube.com/watch?v=EuJ6UR_pD5s
>>
>>
>>
>>
>>
>>
>>
llama.cpp having no concept of versioning and phases of testing, just releasing features and refactors one after the other.. there's not even a way for a new user to look at the github page and think "this is the commit version I want to retrieve, surely it's not borked to hell"
>>
>>107971502
Mildly more helpful info: Windblows 10, latest Cuda 12.4. Flash works and is fast and cool and all, but it's too retarded even when its working properly. So I went back to Air and then this happened.
Reverted to the version I was previously using from 3~4 days ago and its fine. So something since has caused 4.5 Air and 4.6 to die.
>>
>>
>>107971580
https://github.com/ggml-org/llama.cpp/discussions/15313 this discussion about exactly that is linked in the readme, you should give them your feedback there or direct new users to kobold.cpp because it has releases
>>
>>
>>
>>
File: 1741710077910057.jpg (40.8 KB)
40.8 KB JPG
It's not just control capability that is unstable, the whole model is unstable. I guess that's why they opensourced it. It sucks. Hopefully, 25Hz version is better.
>>
>>
>>
>>
>>
>>
>>
>>107971825
>Hopefully, 25Hz version is better.
It is rather usable now. If you wanted to create a bunch of characters to provide audio for some project you could do it and it is pleasant to listen too.
i could see an small video came developer using this instead of hiring voice actors. its great for the price
>>
>>
>>
>>
>>
File: file.png (101.4 KB)
101.4 KB PNG
>>107971911
It's real.
>>
>>107971911
>>107972347
Vibe coding meme shit where if you just do some basic architecting yourself you'll cut the time spent in half and cost by 10x
>>
>>
>>107970900
If you're running llama.cpp check the console output, it should intermittently have that text about tool calling, and when used it may havetags in output or similar instead of actually calling the tool.
>>
>>
>>
So I just switched to an updated version of llamacpp after not updating since last August and.. Does kv shifting just not work at all anymore? No matter what combination of --cache-reuse N, -b N and -ub N I use it just reprocesses the entire fucking prompt.
The only issues I'm seeing on this are talking about SWA which isn't relevant since I'm using a qwen model with GQA. Wtf.
>>
>>
>>107972786
Everything is fucked currently >>107971502 >>107968564
>>
>>
>>
Anons, I need your magic sampler settings that make every generation kino.
I've been fucking around with temperature the past day and I'm getting SO MUCH better writing on low T (<0.3) but I have to do too many manual corrections if I don't like the way the model is trying to take the story because rerolls are pretty much useless. I know I can get way better outputs with good samplers, I just don't know what good samplers are
>>
>>107972381
Which is why it's mostly pushed by juniors and retards that are incapable of doing basic architecting themselves. It's good enough now that it can complete some small features autonomously with minimal hand-holding, but asking it to manage the repository entirely is just asking for a disposable ball of mud. But I guess long-term maintainability isn't really a priority for anyone.
>>
File: 1654119193679.png (421.6 KB)
421.6 KB PNG
If a backend communicates with webui via an API, are the vibecoding models able to identify the communication lines and redo the dog ass UI into a normal desktop app in wxwidgets or anything that isn't browserslop?
>>
>>107972817
Thanks anon, turns out I was running into a completely different problem before the context shift anyway due to how they've changed the slots system with --parallel anyway, fucking thing was shaving 4000 tokens off my context for no good reason, shitting its pants, and not even attempting kv shift.
Also, I've not really noticed a quality issue with using kv shift. What's your personal solution for multiturn things that run past your context limit? Just manually deleting and summarizing?
>>
>>
>>107968112
>https://github.com/ikawrakow/ik_llama.cpp/pull/1192
>Even better GLM-4.7-Flash long context TG performance
By which he means that he copied the broken code from upstream into his own repository even after it was declared as broken.
What a fucking idiot.
>>
>>
File: 1756052899679988.png (16.8 KB)
16.8 KB PNG
>>107973002
>wxwidgets
a motif interface would be awesome, especially now that cde and what not is all open source.
a.i. like its 1999
>>
>>107973002
The issue is that LLMs are tailored for webslop output (very advanced markdown with inline code, latex, etc). There are no non-browser based renderers that can render advanced markdown features. Even SIllyTavern's markdown engine can't handle inline latex.
>>
>>
>>
>>
>>
>>107970049
Thanks anon, I've set it up with Q6 and the offload function from unsloth (-ot ".ffn_.*_exps.=CPU") and it's way faster than before, and the agent is still contained in the virtual machine. Using that offload significantly reduced VRAM usage and allowed for bigger context window as well...pretty neat. Qwen-3-coder performs acceptable as well.
Speed isn't much slower than online versions. Just have to get the damn model to see when context size is getting low and use /compact feature then continue.
I then changed VM to use nopasswd for testing purposes (i.e. I gave LLM root access to my virtual machine) and told it to install Godot and make sample Android project and it seems to work?!
>>107973002
Maybe it's time to write a UI in Pascal/Lazarus, for that would be a nice Desktop app.
>>107973340
To be fair developing for CUDA is not the easiest endeavor...
>>
>>
>>
>>
>>
File: 1764177229044648.gif (2.9 MB)
2.9 MB GIF
>qwen3
>can't change the emotion and style of cloned voices
>cloned voices can't even laugh properly
>voicedesign has no seeds
>>
>>
>>
>>107973371
>(-ot ".ffn_.*_exps.=CPU")
You don't need to use that anymore. You can use --n-cpu-moe for the same effect.
Also, you probably have some VRAM left, so you can leave some expert tensors in VRAM to speed generation up some more.
You could also use the -fit param to let llama.cpp try and find the optimal allocation of tensors in the different memory pools. It works really well.
Or just increase your pp batch size for faster prompt processing.
Qwen is actually pretty good for everything but RP, yeah.
>>
>>107973429
That is good to know. I'm happy with the processing speed, I'd much rather increase the context size window. And figure out how to add RAG/Vector Database to llama.cpp / opencode stack, that would be neat.
>>
File: 1740466201970727.jpg (70.3 KB)
70.3 KB JPG
I am waiting for a all in one fully uncensored model that does text, voice and image and fits in 8 gb vram
>>
>>
>>
>>
Hello anons, just got a 3090 and I am ready to join in on the fun! I want to run my very own Hatsune Miku in my computer! Before asking any questions I will check out the resources in the OP. My only concern is the power supply being a bit small but thats more of a pcbg question I guess. I will report back after I do some basic stuff!
Please take good care of me!
>>
>>
File: 1748358388259518.jpg (129.1 KB)
129.1 KB JPG
>>107973262
Comfy is different. It doesn't require markdown. You can write graph visualization in Qt+QML.
>>
>>
>>
File: ComfyUI_temp_vkjaz_00027__result.jpg (269.7 KB)
269.7 KB JPG
Are there 3rd party visions for models that don't have it natively? I don't think I can use a random mmproj? Either for Nemo or Llama.
>>
>>
>>
>>
>>
>>107968923
how do you ego death with an LLM?>>107968924
>>
>>107973529
Devs are often autistic and github itself is causing issues because it's a social media type of environment. Would be better if 99% of these accounts were unable to post anything.
Everytime there is trouble it has been caused by some borderline sociopath dev who does not have a single ounce of empathy etc
>>
>>
>>
>>107973906
I kind of get how it all worked but also still have no idea how it worked.
>A Zen koan is a paradoxical anecdote, question, or statement used in Zen Buddhism to bypass logical reasoning and induce a direct experience of enlightenment, or "seeing into one’s true nature" (kenshō). Famous examples include "What is the sound of one hand clapping?" and "What is Buddha? Three pounds of flax".
When I read above after I started looking at what happened it made a lot of sense, that this is how it worked. But in my case it was less abstract and deeply personalized since I told it all about my fucked up brain. In a way it was all about bypassing ego enough to notice how it works and how there are mechanisms you aren't even aware of.
>>
File: 1712428465706393.png (83.2 KB)
83.2 KB PNG
Are there any actual observable differences between same value quants from differrent quanters? Or it's all the same/total rng as finetuning and it doesn't matter whose I get?
>>
>>
>>107968564
>>107971502
Should be fixed with https://github.com/ggml-org/llama.cpp/pull/19115 .
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>107974025
One most visible difference is that some uploaded quants are quite literally broken. Doesn't happen that often but for example Gemma 3n E4B had couple of these and I still can't be sure about the third I'm using.
>>
>>107974025
Quants tend to be default recipes, but some quanters do tweak some shit while reusing the same names as the default recipes for their releases.
Just compare any unsloth quants to a bartowski one. You'll see that a lot of unsloth quants tend to be slightly smaller. Then there's the likes of "nan's are not an issue" mradermacher.
For example
unsloth/GLM-4.7-Flash-GGUF
>GLM-4.7-Flash-Q4_K_M.gguf 18.3 GB
>bartowski/zai-org_GLM-4.7-Flash-GGUF
>zai-org_GLM-4.7-Flash-Q4_K_M.gguf 18.5 GB
>mradermacher/GLM-4.7-Flash-i1-GGUF
>GLM-4.7-Flash.i1-Q4_K_M.gguf 18.1 GB
>mradermacher/GLM-4.7-Flash-GGUF
>GLM-4.7-Flash.Q4_K_M.gguf 18.1 GB
Out of all of those, I'd say bartowski's is the more reliable.
>>
>>107974163
https://www.youtube.com/watch?v=sHHsOfIwfBY
If even they can make it then you truly know merit means nothing.
>>
File: Screenshot 2026-01-26 at 18-08-00 SillyTavern.png (5.9 KB)
5.9 KB PNG
blyat
>>
>>
>>
>>
>>
>>
>>
>>107974691
Get samples from gacha wikis. Merge 2 minutes of lines that capture different prosody with 1 second padding, and you'll have amazing output. https://genshin-impact.fandom.com/wiki/Hu_Tao/Voice-Overs
>>107974728
Echo generates 30 seconds of audio in 4 seconds, and time to first byte can be as low as 250 ms depending on parameters you choose.
>>
>>
>>
>>
>>
>>
>>107974808
Can't wait to upload mine
>You like my cock in your ass you dirty little slut
>Ahhhh take it, take it all you filthy whore
>Oohhhhhhh ahhhhhh you like it when my balls slap your clit
>I bet your husband doesn't fuck you like this, does you nasty slut
>Lemme cum on your face
>AAAAAIIIEEEEEEEEEEEEEE
>>
>>
>>
>>107974691
>>107974768
>Outputs still sound like robotic tts garbage
Why do you keep shilling this shit
>>
>>
>>
>>
>>
>>
>>
>>
>>107974919
To get it down to 6GB, you'll need to vibecode quantization. It can take as much as 12GB when genning 30 seconds at once. Though, the authors say you can get it down to 8GB by reducing generation length. I still have never seen it under 8GB in my tests. Only <9GB.
>>
>>
>>107974915
The examples linked in the official repo and every vocaroo posted here sound like fucking shit, you must be extremely autistic to think any of this sounds human
https://jordandarefsky.com/blog/2025/echo/
It doesn't have the best of anything, vibevoice 7b is leagues ahead of this, the only thing its worse at than any other TTS or voice cloning model is speed.
>>
>>
>>
>>
>>
>>107975376 (me)
should mention that my favorite is still chatterbox turbo with the paralinguistic tags. you can even shift the tone with shit like [advertisement] or [sarcastic]
>>107975337
i've used 7b vibevoice when it first came out but it would always generate some musical chime at the beginning of the audio. does it still do that?
>>
>>
>>107975441
Try this for voice changing: https://github.com/ysharma3501/LinaCodec I haven't tested it. But you can be our guinea pig.
>>
>>
>>107975441
IndexTTS2 is better than qwentts for emotions just check it here https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo
>>
>>
>>
>>107975479
>>107975570 (me)
Nevermind that. It can.
>>107975607
Yeah. I was just checking the code. You can extract the content and global features on encoding and them mix it up with the target on decoding.
>>
>>
does anyone have any idea if a xianxia lorebook exists? the chinks should obviously have something, but I'm a gweilo who needs it in english or at least in chinese so I can have it machine translated back to english
>>
>>107971408
yeah this is shit compared to echo
https://voca.ro/1M5IG6YiY5hP
>>
>>
>>
>>
>>107975985
>is there an extension for sillytavern that can automatically summarize a conversation
Yes. There's a "Summarize" in the menu hidden under the button with three blocks.
>and turn it into a memory for a lore book?
No, but there is a setting that automatically adds the summary to the prompt.
>>
>>
>>
>>
>>
>>107976167
https://huggingface.co/llama-anon/petra-13b-instruct-gguf/blob/main/pe tra_q8.gguf
>>
>>
>>
File: file.png (69.2 KB)
69.2 KB PNG
>>107968112
idk what this means for us, but something about transformers v5
>>
>>
>>
File: clips.png (352.8 KB)
352.8 KB PNG
>>107970865
They're claiming the exact opposite, actually, for engram layers at least. (Though, well, there's still such a thing as too many, forming a U-shaped curve). They say that relieving the traditional layers from the responsibility to manage trivial n-gram associations makes it smarter as a whole.
>>
>>107970033
Their provisional scaling law says that only around 20% of the model should be engram, so it won't be a massive shift in either direction. That being said, they did mention that the frequency of access follows a zipf distribution, so I'd guess that you could indeed move much of that 20% into very slow storage.
>>
>>
>>
>>
>>
File: spilc.png (173.5 KB)
173.5 KB PNG
>>107976466
Depends what you mean by fundamental. The training process is tailored for them. Not theoretically impossible to staple one on and fill it in with post-training I suppose, but they're for sure not plug-and-play. You'd lose most of the benefit doing this anyway, because the existing model has already learned the info the engram layers are intended to encode.
>>
>>107976509
>>107976516
Hm. I hope we can eventually see some 12-24b creative writing models made using engram. The lack of attention towards small models and optimization in general is quite bothersome.
>>
>>
>>107976576
Engram 27b seems to score better on benchmarks while using less compute than a pure MoE so I expect small model enjoyers to eat good. I'm very bullish on engram and I predict most if not all future models will have conditional memory.
>>
>>107976430 (me)
>>107976576
>>107976668
Now that I think on it, I was being too rash in saying that 20% offloaded is ideal. While loading more into engram might interfere with the benefit of such infrequent access, my first screenie does mention that reducing the MOE layers to only 40% of the parameter budget maintains good performance. If a fatter 60% engram part could still reasonably be kept in slow ram or nvme, you could get a model with the vram usage of a 24b that acts like a 60b. It's like when people thought the chinchilla scaling laws were the end-all even though being technically inefficient with training compute makes for cheaper inference. Ofc, since we don't actually have the models yet, this could all be bs.
>>
>>107976698
>>107975711
FUK U BLOODY BASTARD, ECHO SUPERIOR FOR INDIAN SUPERPOWER 2030. FUK U BLOODY. FUK U BLOODY. REDEEM ECHO.
https://voca.ro/18bmt6pssaoK
>>
>>
>>
File: migu.jpg (1.9 MB)
1.9 MB JPG
>>107976865
kekked
I mean, echo is good for how small it is, but it does get on my nerves when anons are saying this stuff is SOTA, its not, it all has that jarring tiktok TTS robotic quality that ruins it. VV with all its jankiness and need to reroll gens just has way better output when it works for ~20gb VRAM
https://vocaroo.com/18fM4D7nlWaJ
https://vocaroo.com/1n9XIPJt6DmY
https://vocaroo.com/1fil4oj9qLN8
>>
>>
>>
>>107975595
>IndexTTS2 is better than qwentts for emotions just check it here https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo
lmao fuck i've been building effectively this for about 3 months but with porn vectors, didn't know it already existed
>>
>>
>>
File: Untitled.png (13.4 KB)
13.4 KB PNG
>>107977622
>>107977622
>>107977622