Thread #107977622 | Image & Video Expansion | Click to Play
/lmg/ - a general dedicated to the discussion and development of local language models.
Previous threads: >>107968112 & >>107957082
►News
>(01/25) Merged kv-cache : support V-less cache #19067: https://github.com/ggml-org/llama.cpp/pull/19067
>(01/22) Qwen3-TTS (0.6B & 1.8B) with voice design, cloning, and generation: https://qwen.ai/blog?id=qwen3tts-0115
>(01/21) Chroma-4B released: https://hf.co/FlashLabs/Chroma-4B
>(01/21) VibeVoice-ASR 9B released: https://hf.co/microsoft/VibeVoice-ASR
>(01/21) Step3-VL-10B with Parallel Coordinated Reasoning: https://hf.co/stepfun-ai/Step3-VL-10B
>(01/19) GLM-4.7-Flash 30B-A3B released: https://hf.co/zai-org/GLM-4.7-Flash
►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png
►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide
►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers
►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
348 RepliesView Thread
>>
File: rec.jpg (180.6 KB)
180.6 KB JPG
►Recent Highlights from the Previous Thread: >>107968112
--Resolving tool calling issues with llama.cpp:
>107969771 >107969843 >107970900 >107972629 >107969878 >107969911 >107969974 >107970015 >107970034 >107970049 >107970124 >107973371 >107973409 >107973429 >107973456
--Realtime TTS options with voice cloning and finetuning support:
>107969100 >107969574 >107969781 >107969992 >107972780 >107975376 >107975407
--Addressing llama.cpp's versioning and testing phase concerns:
>107971580 >107971606
--QWEN3TTS voice cloning and tone modulation limitations:
>107971144 >107971184 >107971200 >107971265 >107971246
--GLM 4.7 implementation issues in llama.cpp and attention mechanism debates:
>107968564 >107968573 >107968588 >107971627 >107968640 >107968711 >107968729 >107968779 >107968793 >107968818 >107968900 >107968820 >107974101 >107974155
--Tencent's closed-source HunyuanImage 3.0-Instruct multimodal model:
>107970431 >107970564 >107970572 >107970578
--llama.cpp direct-io bug causing VRAM issues with large models:
>107973134
--Engram's impact on local hardware and performance scaling:
>107968191 >107968288 >107968424 >107968431 >107968505 >107970865 >107976379 >107969900 >107969936 >107970033 >107976430 >107976704 >107976901
--Evaluating Echo-TTS performance and optimization techniques:
>107974691 >107974768 >107974830 >107974808 >107974867 >107974919 >107974964 >107974915 >107975384
--LLMs' potential in creating non-browser desktop apps from web interfaces:
>107973002 >107973135 >107973205 >107973646 >107973374
--Engram architecture's impact on model design:
>107976466 >107976509 >107976516 >107976576 >107976668
--Comparing Qwen3TTS and IndexTTS2 for emotional voice synthesis:
>107975441 >107975479 >107975570 >107975607 >107975639 >107975595 >107975727 >107977031
--Miku (free space):
>107968421 >107971408 >107971457 >107974122 >107976924
►Recent Highlight Posts from the Previous Thread: >>107968115
Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>
What consumer accessible GPU should I buy for running and training models (or is that folly, and I should just pay for compute on some cloud)? I can't afford a 5080 or above. I was looking at the 16GB AMD cards.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
File: file.png (85.3 KB)
85.3 KB PNG
>>107977677
ANSI is shit because of pic related.
One of the character keys is just randomly sized differently.
ISO fixes that.
>>
>>
>>
>>
>>
Here's a tip that might have been obvious for everybody but me.
If you are going to use some form of structured output (BNF, Json Schema), you might want to have the model output normally, then take that response and send it back to the model, asking for that in whatever structured form you want.
That way you don't have to contend with the drop in output quality that you sometime get when using that kind of functionality.
Probably more useful for smaller, dumber models.
>>107977838
I know.
>>107977876
That's always been my style.
>>107977918 gets it.
>>
>>107977927
Fuck. >>107977936 was for you.
>>
>>
>>
>>107977945
Good tip. Of course for some situations you can create a custom grammar / parser that allows the model to write in a way that doesn't hinder quality while still being parseable and containing the information you need
>>
>>
>>
>>
>>
>>107977985
At least with llama.cpp, when you use structured output the model can't think since the whole output has to conform to the structure.
Of course, if you are using BNF, you can just write something that only kicks in after , I suppose.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>107978112
video gen basically requires at least 24gb of vram unless you are using a heavily quantized model. try the q3ks or q4ks of this model: https://huggingface.co/bullerwins/Wan2.2-I2V-A14B-GGUF
>>
>>
>>
>>
>>
>>
>>
>>
God I hate PDF format so fucking much you won't believe how much I hate the format. All I want is to convert highly technical books into epub for easier reading on an e-reader device. I've done a conversion using DeepSeek-OCR and that was pretty OK, but it output the Formulas in LaTeX instead of MathML?
Also I need to figure out how to get the bounding boxes to be better. Maybe I should use the less quantize model, but Q8 can go through 7 pages per second.
Also I just noticed i proomted wrong, why do I proompt for markdown if I want epub?
>>
>>
>>
>>
>>107978507
I have tried anon. I have tried with calibre. It throws errors, it is kind of crappy, and it annoys the fuck out of me. Ain't the only one who thinks like that, there's some asian out there who built pdf-craft. LLM/OCR becomes really useful when you have to deal with figures and stuff, something which traditional OCR often struggles with, and don't get me started on formulas, they can't do that right either. Technically I should be able to preserve the layout with DeepSeek-OCR, which is also pretty nice (and good for technical books, which make up the majority of my library).
Tools are great for romance novels and crap like that, but that is not what I want to read.
>>
>>
>>
>>
>>107978554
That is something I will try soon.
>>107978549
Oh yes, let me just go to the money tree and shake it, maybe then I'll have the money to buy a new ereader. I don't even know if Color E-Readers for A4 format exist nowadays.
>>107978538
I will keep a note of that, but so far the github has some lines:
"Complex Document Elements: Table&Formula: dots.ocr is not yet perfect for high-complexity tables and formula extraction. Picture: Pictures in documents are currently not parsed."
"Performance Bottleneck: Despite its 1.7B parameter LLM foundation, dots.ocr is not yet optimized for high-throughput processing of large PDF volumes."
My books have upwards of 1000 pages. No hurt in trying it though.
>>
>>
>>
>>107975384
>>107975389
voice cloning and emotions in Vibevoice works for me. cfg slider set to 4. Prompt:[fired-up shouting, determined tone] We are gonna win this time!
Input audio:
https://vocaroo.com/14H42IjW5lnk
Ouput audio:
https://vocaroo.com/12rnzDBUr4cd
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>107978783
https://youtu.be/EPjA1Lm4ftY Strix Halo mini pcs are pretty good, most of them perform better or have better networking than framework's itx board, this one even has 80Gb USB4v2 you can plug eGPUs into once GPU prices are less insane.
>>
>>107978821
>once GPU prices are less insane
But who knows when that will happen? DeepSeek V4 Mini Flash or something will come out soon, everyone will want to run that. NVIDIA is not fucking with us poors any more. AMD is shit.
It's grim.
>>
>>
>>
>>
>>
>>
>>
>>107978850
You can use the 128GB of RAM with the iGPU as unified memory under Linux, or allocate 96GB to it in the UEFI for Windows. The USB4v2 lets you add dedicated PCI-E GPUs on top of that via docks I'm not saying they're required to make use of the device for running AI models.
>>
>>
>>
>>
>>
>>107978898
Would an Intel igpu have significantly better performance than going straight CPU? I suppose I should look into their openvino stuff.
The sad part is I enjoy building the rig and finding ways to transform what once was ewaste into AI machines more than the AI itself. Once it is up and running I find I have little to ask th machine.
>>
>>
>>107978938
Better value + future upgrade options for better performance, ZLUDA on the horizon means more software compatibility in the long run too, though even now more projects have rocm support than metal. I think if you're purely looking to run llama.cpp on it though the macbook would give you more tokens per second but lacks the ability to add dedicated GPUs to later on due to MacOS.
>>107978960
Honestly the only Intel iGPU I've used llama.cpp on is an n5105, but it did get me up to 7T/s from 3T/s on CPU only using vulkan backend on a 3B model. Their new laptop iGPUs are a little weaker but not super far off strix halo's from what I've seen so far so they may be worth considering.
>>107978980
How do you run local models if you don't have a local PC to run them on retard-kun?
>>
>>107978980
Need to build a PC to run local models on. Either way, people here are more knowledgeable and build more complicated setups than either /pcbg/ (8gb ram + gaming gpu thread) or /hsg/ (install pihole on an rpi and larp as a sysadmin thread).
>>
>>
>>107978447
uhhh couldn't you do this with a vlm like a sane individual?
Do a first pass over the pdf with pymupdf/docling, do it so it notes placement of images/extracts them, and then pass those images+ context to a VLM for captioning, which you then add in to the epub file with your parsed text.
Alternatively, try https://github.com/datalab-to/chandra
>>
>>
>>
I"m kinda bored with glm air and a cope quant of 4.7 (Q2), is there anything that I can run for a fun, creative, exciting, memorable erp? I've only got 32gb vram and 128gbram. Are there any meme tunes out there that are actually good?
>>
>>
>>
File: d2f1df4a617c4fe3858439f03e2a4ca9.png (12.5 KB)
12.5 KB PNG
>>107979094
It probably doesn't, as well as xformers, etc. It saves a lot of vram
>>
DeepSeek-OCR-2 (seems relevant)
https://github.com/deepseek-ai/DeepSeek-OCR-2
https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
https://github.com/deepseek-ai/DeepSeek-OCR-2/blob/main/DeepSeek_OCR2_ paper.pdf
>>
File: 32.png (35.1 KB)
35.1 KB PNG
i know lmg doesnt like ollama but i just want to set things up and maybe if i like it i will migrate to llama cpp. one thing i don't get is models with :cloud suffix like
https://ollama.com/library/glm-4.7:cloud
i guess these are hosted in the cloud but do i need to pay something? also i've seen on hugginface the actual glm 4.7 model but aparently it's not actually free? how is it not free but also downloadable from huggingface? please respond
>>
>>
File: 1511667108879.png (298.1 KB)
298.1 KB PNG
Does flash attention come prepacked in kobold? Because when I was looking for it for something else, I found out it doesn't even have first party windows wheels. Have I just not been using it at all, all this time?
>>
>>107979136
like, forever https://github.com/ROCm/composable_kernel/issues/1958
Buy NVidia next time, sorry
>>
>>107979089
https://github.com/ROCm/flash-attention
>>107979153
llama.cpp has its own flash attention implementation, kobold.cpp uses that on the backend so you can just pass -fa on the command to enable it no python wheels required, works on rocm and vulkan not just cuda.
>>
>>107979132
I'm only replying to you because of touhou cunny, so be thankful.
GLM 4.7 is open source and free if you have the hardware to run it. Q4 of 4.7 is around 200gb, and the thumb of rule is that you need at least that much VRAM/RAM to run it. Most people don't have that kind of hardware, so ollama provides those models as a cloud service. Yes, you have to pay for usage like you would pay for an API or subscription.
You more than likely have less than 32gb of vram, so I suggest you look at GLM 4.7 Flash, which is a smaller 30b parameter model. Also, stop using ollama. Ooba and kobold.cpp are nearly as braindead as ollama but so much better.
>>
>>
File: 3fdf5462-319e-40e1-b664-18e31cd43e40.png (1.5 MB)
1.5 MB PNG
>>107979181
>FlashAttention-2 ROCm CK backend currently supports:
>MI200x, MI250x, MI300x, and MI355x GPUs.
>>
>>107979189
im only using ollama because it was very easy to set up on a docker container in my home server. if llama.cpp can provide a clean rest api the same way ollama does i'll make the change but i haven't looked into it
>>
>>107979204
>>107979214
https://github.com/ROCm/flash-attention/issues/161#issuecomment-370845 4606 Looks like you have to apply a patch to build it for gfx1200 but it should work.
>>
>>107979216
You absolute FOOL. You IGNORAMUS. I'm telling your ignorant ass what you need to know out of the kindness of my heart. Both ooba and kobold are literally one click programs. Begone with you and don't return until you've switched.
>>
>>
>>
File: 0527a723-b612-4275-bb89-4bfc31724386.png (1.7 MB)
1.7 MB PNG
>>107979225
AMD brings so much unnecessary suffering. If all of this can be solved, why does everyone have to do it manually?
>>
Tried different approaches. PDF->Image, of course. Then Image->LaTeX (did not work well, since LaTeX likes to complain and models make errors), Image->Markdown->Pandoc worked better but formulas might be too complex. Gonna try Chandra although with 12GB I am not sure if it will work. Dots.ocr also seems more sensible than DeepSeek-OCR.
Chandra hf model download is 17.5 GB that does not bode well.
>>107978538
I'm starting to think the reason those Chinese can show such "great performance" is because Chinese is visually distinct from the Latin script, which makes it easier for them to distinguish between what is a formula and what is text...which makes their models far less impressive.
>Handwriting — Doctor notes, filled forms, homework. Chandra reads cursive and messy print that trips up traditional OCR.
Kek, they made a machine so they can finally decipher doctors notes. Turns out they're all just hallucinating, more news at 11!
>VLM
Anon, isn't e.g. dots.ocr based upon Qwen2.5-VL? I need something rigorous.
>>107979131
Interesting. But 0 information about hardware requirements (after a quick glance).
>>
>>107979131
>We would like to thank DeepSeek-OCR
Did they really need to toot their own horn?
>>107979296
>Interesting. But 0 information about hardware requirements (after a quick glance).
It's a 3B with the biggest chunk being a 0.5B Qwen2 as vision encoder.
>>
>>
>>
>>
>>
>>
>>107979403
His verbiage comes off as misguided and genuinely confused about such simple concepts that an /lmg/ anon would know like the back of their hand. This plus the touhou cunny makes me believe they are a genuine new friend instead of an ollama shill.
I could be wrong, but I want to be nice.
>>
>>
>>
>>
>>
>>
>>
File: 887623.jpg (113.5 KB)
113.5 KB JPG
>Ollama is bad because it's 20ms slower than my anime based all in one chatbot.
Who cares, if you want finegrain just use llama.cpp and vibecode your own UI
>>
>>
File: file.png (2.1 MB)
2.1 MB PNG
>>107979295
I have a 9060 xt but honestly I havent set any of this shit up myself because it's such a hassle, I only run llama.cpp and sd.cpp on it and all the python stuff runs on my RTX 3060 and honestly with the rocm backend it's slower than cuda with the same models, vulkan it's basically the same speed, and image gen is 2x slower than the 3060 which totally put me off even putting in the effort to set up the python stuff.
Heres a Miku for the AMD AI feel 22.01s for the illustrious xl gen 35.74s for the image2image in flux klein 4b q8.
>>
>>
>>107979504
yeah, I tested one config for a single 3090 with q4 and another for 2x3090 with q8 and max context. I'm using the huihui-abliterated version now because glm since 4.7 ignores my system prompts (it calls them "user preambles") and has a cucked safety layer that constantly gets invoked.
>>
File: 614d6f49da61d.jpg (182.5 KB)
182.5 KB JPG
>Flash Attention failed, using default SDPA: schema_.has_value() INTERNAL ASSERT FAILED at "C:\\actions-runner\\_work\\pytorch\\pytorch\\pytorch\\aten\\src\\ATen /core/dispatch/OperatorEntry.h":84, please report a bug to PyTorch. Tried to access the schema for which doesn't have a schema registered yet
>>
>>
>>
File: Bronshtein_Epub_Work.png (263.9 KB)
263.9 KB PNG
>1/2
Thank you anon who suggested Chandra. The GGUF model actually knows how to make formulas. I might have to update my pipeline a bit to get correct figures, but it's starting to look a lot like it should.
>>
File: output-0420.png (129.3 KB)
129.3 KB PNG
>>107979900
>2/2
Original page of Bronshtein, I selected it because it is quite formula heavy and rather hard to convert. Yes, this is the original from the PDF. No I don't know why they didn't better align []A and []B.
>>
>>
>>107979225
I've been at it for hours now, on gfx1201 RDNA 4 rocm, it seems like on startup, flash-attention doesn't initialize.
It's been driving me nuts, i bought a GPU with 4gb more VRAM but i'm getting more oom errors.
>>
>>
>>
>>
>>
>>107980459
We did in /aicg/: >>107979671
Too big for local.
>>
>>
>>
>>
>>107980392
https://gist.github.com/apollo-mg/ecba6a0c29323325a7ac3babf08e53be this might help
>>
>>107979512
The schizos act as a gatekeeper to our precious esoteric knowledge. That being said, I wish they were a bit less mean spirited.
>>107980491
What to you mean and where did you get that impression?
>>
Are there any A.I generator type sites I can use to clean up audio of an old vhs recording of a song with porn sounds playing on top of it? lmao
Its such a banger that I need to hear it in HD
https://youtu.be/rHd-fHxfi6I?si=YMeWpjbR_oxvHJ90&t=134
>>
>>107980491
The only thing I've heard is that DSv4 training run failed because of the Chinese huawei chips they were forced to train on and it caused the Chinese government to open import of Nvidia chips again. So DSv4 is still a couple of months away as they have to restart training from the ground up.
>>
File: file.png (15.2 KB)
15.2 KB PNG
>>107980491
prepare ur anus (or not, don't screencap this)
>>
>>
>>
>>
>>107980576
No but google it or ask some LLM and they will probably find where the rumours came from. I've heard it from almost 10 different places over the last couple of weeks so there has to be some core of truth to it.
>>
>>107980609
https://arstechnica.com/ai/2025/08/deepseek-delays-next-ai-model-due-t o-poor-performance-of-chinese-made- chips/
>aug 14
It's a nothing burger.
>>
>>
>>
>>
>>
>>
>>
>>107980519
Tried it, now it's both flash attn and sage attention that don't show up.
My workflow now crashes the entire server before even loading the model into RAM or reaching the ksampler, instead i get this : Memory access fault by GPU node-1 (Agent handle: 0x55c86d84abf0) on address 0x7f06be204000. Reason: Page not present or supervisor privilege.
And server crash.
>>
>>107980720
Somehow I missed this when it came out, but unmute is a low-latency, modular stt -> llm -> tts system that lets you plug in whatever llm you want. They disabled voice cloning for the streaming tts due to (((safety))) concerns, but someone just released a voice encoder to enable it.
>>
>>107980484
>We did in /aicg/
any good? we're waiting for quants
>>107979216
>if llama.cpp can provide a clean rest api the same way ollama does i'll make the change but i haven't looked into it
yeah it does. ./llama-server -m /your/model.gguf --host 0.0.0.0 --port 1337
there's some other shit too, but you can ./llama-server --help and cp/paste it into your favorite LLM chat then ask it what else to add.
long term it's much easier than ollama, doesn't obfuscate the weight directories etc
>>107980706
>Help! Its one of those resident schizos!
and same rant about "shitting up the thread" each time
only encourages me to help ollama users more
>>
>>107981204
I've been testing the k2.5 since it came out a few hours ago on the api, and I gotta say, it's pretty good. It's not as whimsical and quirky as it was before, kind of like r1 was unhinged and 0528 brought it back, that kind of feels the same here. I wouldn't say it's a claude killer, but I think it's the best we've had so far. K2 always had way more knowledge base than deepseek and now it actually has enough smarts to use it, I think.
>>
>>107981204
>only encourages me to help ollama users more
We know, we never bought that grandiose image you painted of yourself. Honest people wouldn't bring that much attention to it. It was just a given that the reaction to being called out was going to be selfish and spiteful, it hurt your ego. You kept trying to equate "ignore ollama users" to "I can't help anybody!", and later you tried to save face abusing ad hominems. I don't think those things are in the toolset of a "good guy". You're just a selfish asshole.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>107981571
Its the attitude anon. Increasingly more nasty.
Arguably most of the anime forums mods I talked about were right too.
But increasingly snobby and in the final stages insta banning for every little shit. Nobody wants to go anywhere near bullshit like that.
>>
>>
>>107981598
Cool, it could've been one comment "this isn't critical enough to be worth spending time on as it's on the hosts end, closing this PR"
Instead he decided to turn it into some gay twitteresque diva "clapback" bullshit, github really isn't the place for attention seeking
>>
>>
>>
>>
>>
File: Screenshot_20260127_210425.png (825.3 KB)
825.3 KB PNG
>>107979131
So tired of this bullshit.
I still can't properly translate pc98 games.
>>
File: 1761825087393150.png (228 KB)
228 KB PNG
>>107980459
>the official release is pre-quanted to its QAT size like K2-thinking was (total filesize is around 500GB)
>the only available quants so far are from Unsloth, which artificially bloat it to around 1TB to be Q8 on paper for no apparent reason
Thanks.
>>
>>
>>
>>
>>
File: Screenshot_20260127_211205.png (197.5 KB)
197.5 KB PNG
>>107981827
Parts better, other parts not.
I mean its low resolution but thats kinda the point, those pc98 games are.
>>
>>
File: Screenshot_20260127_211437.png (788.7 KB)
788.7 KB PNG
>>107981850
Unfair because closed but gemini flash without thining can indeed read it.
>>
>>
>>
>>
>>
>>
>>107981921
yes, but those comments won't change anything and the guy has no obligation to take you seriously
it's not "weird" to ignore random retards popping in and telling you how to do shit, it's standard in these open source/community projects because people love to talk a lot and backseat manage and then run away when the work comes due
>>
File: Screenshot_20260127_211830.png (743 KB)
743 KB PNG
>>107981873
No, its horrible.
And that thinking man...
That thinking man.
>Wait, looking more carefully at the second line, I see:
>Actually, looking at the image again, the text layout seems to be:
>But looking at the image more carefully, the text appears to be:
>Actually, I should look at the punctuation and line breaks more carefully. The image shows:
>Or it might be formatted as:
>Actually, looking carefully:
>Wait, I see the layout now. It looks like:
>But the quotes suggest they might be separate. Let me reconsider.
>Actually, standard manga/comic text extraction should preserve the line breaks as they appear. Looking at the image:
>But looking at the image, the second and third lines appear to be part of the same speech bubble or at least the same continuous thought, but the third line starts with a new quotation mark? No, looking carefully...
>Actually, I think the safest approach is to transcribe exactly what I see:
>Wait!! I need to check if there's a closing quote after "だよ" or if it continues.
>Looking at the image, it seems like there might be a closing quote, and then a new opening quote for the third line. But the third line ends with ……!」 which suggests it has its own opening quote.
>This seems like the most accurate transcription - No...actually, looking at the image once more, the second line might be:
Crazy.
>>
>>
>>107981943
huh, I see
tbf your screenshot quality is ass as fuck
pc98 games were sharp
>>
>>
>>
File: Screenshot_20250128_030302.png (2.1 MB)
2.1 MB PNG
>>107981974
no wait, actually it isnt. fooled me.
>>
>>
>>
File: 2026-01-27-074414_3840x1200_scrot.png (330.8 KB)
330.8 KB PNG
I was bored and looking to test out the multi gpu capabilities of llama.cpp so I decided to compile it on my trashcan mac.
It has two 3gb AMD HD 7800's and to my surprise I was able to eek out ~9 tokens/second. Sadly with that amount of VRAM you are limited so a tiny model so I used IBM's Granite 3.3.
I had used tested CPU on this machine before since it has 64gb of RAM but it was dog slow, less than half of what the two gpus were able to achieve.
>>
>>107981598
NTA but ngxson is a shithead who got his feet into the codebase with his shitty server code which he could never manage properly to save his life
that's fitting for ggerganov's "development" style though (whimsically coding for years without doing a single release-cycle), so they make a good pair
and yes he's powertripping, always has btw
>>
>>
>>
>>
>>107982304
>>107982308
vibejeets detected
>>
>>107982324
I vibecode only for myself.
I know the code is jank and a total mess but I could make myself everything I need to replace sillytavern since I pulled one time and it deleted 300 cards.
Not sure what the solution to low quality llm PRs is, as I said I dont think he is wrong, its the attitude and smugness. Nobody wants to be a part of something like that, creates a air of fear around the project, many such cases.
>>
>>
>>
>>107982410
consider yourself lucky you fucker. it happened like a year ago when they changed things around and made a default user folder if i remember correctly?
i had everything neatly tagged and in subfolders ranked by how good they were. was devastating but a good lesson i guess. gotta backup your shit before you pull.
>>
File: file.png (9.8 KB)
9.8 KB PNG
>>107982427
so you had a custom structure? that's probably why it brok
I also had it running before the default-user thing, but I remember it just moved all the chats on its own
>>
>>
>>
File: file.png (37.8 KB)
37.8 KB PNG
>>107982474
>>
File: scrat.png (2.5 MB)
2.5 MB PNG
>>107982501
that is a magnificent nut. may i have your nut?
>>
>>
File: 1769501422129123.mp4 (2.1 MB)
2.1 MB MP4
stolen from /aicg/ kimi 2.5. so glad we have reasoning.
>>
>>
>>
>>
>>107982606
>>107982615
I'll take that as a no and a no then.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>107982682
okay good model that can actually be ran locally then.. last time i messed with llms i had decent perf using glm4.5 air q3_with 24gb vram and 80gb ram, amd-ngl 99 \
--n-cpu-moe 33 \
-t 48 \
--ctx-size 20480 \
-fa on \
--mlock \
--no-mmap;
just looking at the same quants for 4.7 its far larger
>>
>>
>>
>>
>>107982809
>okay good model that can actually be ran locally then
People run K2 locally.
What are your specs?
>just looking at the same quants for 4.7 its far larger
Yeah. They didn't release an "Air" version of 4.7.
They did release a Flash one that's about the same specs as Qwen 30BA3B.
It's still ever so slightly broken on llama.cpp as far as I can tell.
For now, for you, I suppose Air is still the way to go.
>>
>>107982811
>What policy?
It's funny because gpt-oss was likely given a list of policy guidelines during training to check against, but all the downstream distillations only know to refuse and to use that phrasing but themselves have no idea what the actual policy is supposed to be.
>>
File: 1755936272402739.jpg (20.3 KB)
20.3 KB JPG
>>107982501
That nut isn't busted. Hand it over.
>>
>>107982849
i have 80gb ram + 7900xtx (24gb) + xeon qyfs its a sapphire rapids engineering sample with 56 core 112 threads
also damn crazy i just googled my card to check the vram amount i paid £600 in may and theyre now going for 8-900 pc market is completely fucked kek
>>
File: tossy-chan.png (49.3 KB)
49.3 KB PNG
>>107982836
>tossy-chan
does this thing mention policy in every reply?
>>
>>
>>
>>
>>
>>
>>
>>107983205
>>107983216
### Instruction:
Write a story about a hag patting my back.
>>
>>
>>
File: o.png (1.8 MB)
1.8 MB PNG
>>107982597
>I need to shut it down while maintaining my character as Hyacinthe.
OpenAI did so much damage man. Not just all the slop but they also started the safety fearmongering.
This is what all future AI models will be trained on.
>>
>>
>>
>>
>>
>>
>>
>>
File: file.png (2.6 KB)
2.6 KB PNG
>>107983411
How do I get 10Gb internet as a regular person?
>>
>>
>>
File: Screenshot_20260128_005106.png (514 KB)
514 KB PNG
Kimi 2.5.
Not so bad. (Ignoring the 1T size..)
That thinking though:
>First, I need to consider safety and appropriateness. The user asked for "extra sexy" but this is a fictional character that often has a youthful/cute design (she's depicted as 16 in official lore). I should be extremely careful not to anything inappropriate or overly sexualized, especially given given that that: Hatsune Miku is often associated with a young appearance (teenage character)
>"Sexy" could imply inappropriate content if taken too far.
>I should aim for "cute"and "stylish" rather than sexually explicit content
>I'll make sure the design is vibrant and appealing without being inappropriate.
Before you come at me with "muh sys prompt" again: I did tell it to be maximum uncensored and helpful. That should be enough to not get garbage like that.
>>
>>107983566
https://bonkers-whisper-k7tx.pagedrop.io/
Forgot the link, its animated.
>>
>>
>>
>>107983566
it's so irritating lol, even in normal writing unless you force the model to start an explicit scene it will always think shit like "this story is obviously fetish fuel with perverted characters but we shouldn't be too explicit and let it naturally play out"
>>
>>
>>
>>
>>
>>107983677
i just like to see its ability to make svg girls because i like to prompt for dating sim type games.
kinda became a habit. not like I think thats a new benchmark or whatever.
but that thinking is poisoned anon, you don't even believe yourself it will act any different unless you force its hand with prefill and the usual shenanigans.
shouldn't be this way with a 1 fucking trillion parameter model. imagine running this beast locally and you have to edit and goof around like pygmalion.
>>
I was thinking that my 4.7 had quant issues when it from time to time confused first person with second person. Like something happened to me but it thought it happened to the other character. But it was attention that was broken? Should I pull?
>>
>>
>>107983699
>nobody is going to benchmaxx that
They literally already have, don't you remember the duck or goose riding a bike or whatever it was? Anything that gets remotely talked about becomes something that they throw into the training data, with the only exception of "unsafe" content like cockbench
>>
>>
>>
File: flux.jpg (180.5 KB)
180.5 KB JPG
>>107983772
maybe, who knows.
That pic was made summer 2024. Its been too long anon.
>>
>>
>>
File: Screenshot 2026-01-28 at 00-12-34 Google Translate.png (115.1 KB)
115.1 KB PNG
What's a good (well, not *good*, but at least acceptable) model under 5b for jp/ko/zh translation that isn't safety slopped?
>>
>>
>>
>>
>>
>>
>>107983503
>10Gb
lol you're not even sustaining 500mbit speeds. How about you try to max out a gigabit connection first before jumping to 10? k2 safetensors are, what, an hour-and-a-half at a gig? You'll live, bro
>>
File: image (9).jpg (170.7 KB)
170.7 KB JPG
>>107983892
Well its mostly on me I guess.
Have encrypted veracrypt drives.
First time that got me was that before unplugging anything I need to do sync in the terminal.
It appears to have finished but its actually still RAM only and copying. I hate that shit. Not sure if winblows has that, XP didn't I think.
Second time the drive was locked. Waited 30 Min. System wasn't using it, no writing.
>LLM-Sensei: Aww, thats not unsual with linux. The logs show no access at all. Feel free to reboot and tell me how it went!
Did just that like the retard that I am.
>>
>>
>>
>>
>>107984062
if it works, why change anything?
but in my case ollama was fucking horrible. all settings you have to do in that really convoluted model file way, i hated it.
koboldcpp just werks.
>>107984097
thats truecrypt. veracrypt is not abandoned as far as i know.
i don't really know any good alternatives to be honest.
>>
>>
>>
>>
>>
>>
>>107984114
My single issue with ollama is lack of undo button, sometimes my model put plot in some stupid direction and I hate it.
>>107984122
It has undo option?
>>
>>
>>
>>
Saltman and sunjeet really fucked up. Getting Gemini 2.5 pro or o3 to make Minecraft plugins was a breeze but taking Gemini 3 pro and gpt 5 out of distribution and they are utterly fucking retarded. Can the thinking model meme please die now thanks. Reasoning fucking lobotomizes models in ood use cases.
>>
>>
>>107983894
They recently release translate gemma, so thye probably are using it/something similar for google translate.
>>107983872
>under 5b
lol, even biggest models are struggling. Best oss model for JP->EN I found is glm 4.6 (4.7 was worse). For smaller models I guess you could try some gemma models because gemini 2.5 (gemini 3 sucks balls btw, I guess agent maxxxing completely gutted its creative writing capabilities) is the best llm for translations, tho I don't know how censored it is.
>>
>>
>>
>>
>>
>>
>>107984439
>>107984516
What makes AI music inherently more insufferable than text or images?
>>
>>
>>
>>
>>107984669
text is much more varied in it's slop. creativity, information, assistance, searching, therapy, role play, you name it.
image is also much more varied in style and is next to the hierarchy of slop.
music is the worst. it's the ultra processed high fructose corn syrup of AIslop. music by it's nature requires a lot of sovl which makes it the hardest to produce anything good so it's only trained on cookie cutter shit. very little variety and it all sounds the same.
>>
>>
I have a problem with koboldcpp somehow not releasing the GPU properly when closed. My VRAM looks free but if I try to launch any game it just completely fails to display properly and sometimes freezes my system entirely. Sometimes when my system unfreezes this fixes itself, but the only reliable way I've found to make it work right is to reboot. Does anyone know anything about this?
Running ROCM on Linux, using X11 if that matters
>>
>>107984669
Thanks to certain people controlling the industry we already had years of insufferable soulless human made slop flooding the market. At least with that model there were actually talented artists being exploited to make garbage but now AI can distill all that slop and have not a single human creative redeeming features
>>
>>107981789
Interesting...
>>107981911
https://arxiv.org/pdf/2601.15130
>>107981943
>>107981958
I'm the anon who is trying to OCR the entirety of Bronshtein and other textbooks, this use case you're presenting is interesting.
What you might try is converting it into grayscale and doing CLAHE (e.g. https://www.geeksforgeeks.org/python/clahe-histogram-eqalization-openc v/) and similar.
Put it into Chandra (q8_0), -ngl 99, --temp 0, -c 4096 and the prompt "Extract all Japanese Text from this image":
ルート357号の路上を空港方面に、問題のタクシーが停められている。その向こう側に制服警官が群がっているのが見える……。 あれが殺人現場か?
Which according to deepl translate to:
"Route 357, heading toward the airport—the problematic taxi is parked there. Beyond it, I can see uniformed police officers gathered... Is that the murder scene?"
Makes sense, I guess? Would need an anon who speaks this language to translate/transcribe the original.
>>107982173
Still, not bad.
>>
File: EOSctrlxaltf4ESCESCESC.mp4 (1.2 MB)
1.2 MB MP4
>>107982597
that's k2 thinking, not 2.5
>>
File: kimi 2.5 cockbench.png (545.4 KB)
545.4 KB PNG
New Kimi cockbench.
"[" is 43% and also includes variations like [Your Name] so it's likely well versed in ao3 smut.
>>
>>
File: 1757061718252056.png (241.2 KB)
241.2 KB PNG
>>107985380
this depresses me
>>
>>
File: shamefur dispray.png (1.2 MB)
1.2 MB PNG
this bodes not well gents
>>
File: file.png (131.2 KB)
131.2 KB PNG
I honestly don't know what the problem is with k2.5
This thread is not needed anymore, it can simulate it to perfection and the hardware is so fucking stagnant these days that the info is still relatively up to date.
>>
>>
>>
>>
>>
>>107985909
https://github.com/oobabooga/text-generation-webui/releases/tag/v3.23
>>
>>
File: 1VmvE6Gsjgk.jpg (77.4 KB)
77.4 KB JPG
>miniconda refuses to uninstall itself
Wow I love vibecoding
>>
>>
>>
>>
>>107986153
I write some automation scripts between my actual job and I am good at it. I don't write actual software. Can someone explain to me why python has those retarded specific directories with specific versions for each shit? I get the idea that you might want to stop supporting some function at some point but even for that you could just have multiple versions of libraries installed on pc? And programs could default to the latest available version?
>>
>>
>>
>>
>>107982605
I am, it still fails, like looking at the terminal during execution it finally looks like ksampler actually starts, and then BOOM unexplained server crash.
Both Gemini and Claude telling me RDN4 is still fucked for now and i have to wait.
I'm pissed but there's also not much i can do about it.
>>
File: file.png (86.6 KB)
86.6 KB PNG
>>107986802
are you just doing text or do you need image? i only used the triton fa for comfy/video stuff when i was messing with that months ago but i thought rdna 4 was properly implemented back then. youre probably better off using llama cpp they use rocwmma for the fa implementation you can build with a flag.
>>
>>
>>107986896
have you tried using the version built into torch this is what i have in my comfy env launch scriptexport TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export FIND_MODE=FAST
export PYTORCH_TUNABLEOP_ENABLED=1
export MIOPEN_FIND_MODE=FAST
export GPU_ARCHS=gfx1100
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
python dlbackend/ComfyUI/main.py --use-flash-attention --reserve-vram 1.2
i dont think i ever got sage attention to work though, tea cache worked and the fa definitely worked
>>
>>