//g/
This is a general which is focused on archiving, but also interested in other related topics.

Storage technology and file sharing:
Hardware, software, services, shadow libraries, backups, home server, and networks such as tape drives, HDDs, file systems, archive.today, IPFS, Arweave, BitTorrent, etc.

Development:
Example topic: web archiving is much harder in 2026 compared to 2016. Too many websites are walled off by systems such as Cl0udflare, making it impossible for services such as web.archive.org, archive.is, and megalodon.jp to capture their webpages. That's a big chunk of important data that easily disappears with no web archive captures. We have to develop solutions to this, such as using the SingleFile extension and other stuff.

In-depth history:
Examples: get into the "minutia and trivia" about the history of websites and all the little changes, or, talk about more important web history events and future events such as sites closing.

Analysis:
Examples: analyzing files and folders that you obtained from scraping or data hoarding, or, what you're sad was lost and not archiving, what you're glad was archived.

Questions:
Ask whatever questions about any of this.
Showing all 122 replies.
>>
File: KPC-Blog-Tape-Library.jpg (150.2 KB)
150.2 KB
Inspirations for this general:


/dhg/ - Data Hoarding General
>Links
>Rentry: https://rentry.org/dhg
>
What is /dhg/
>In this thread we discuss and create technology and software for data-hoarding, archiving, scripts, and more.
>
>gallery-dl - scrape images, manga, videos and more from many websites
>https://github.com/mikf/gallery-dl
>
>Hydrus Network
>https://hydrusnetwork.github.io/hydrus/
>
>Stash
>https://github.com/stashapp/stash
>
>SmartImage
>https://github.com/Decimation/SmartImage


/dapp/ P2P Decentralized Applications General
>Share your favourite dapps here.
>
>Examples:
>
>brig https://brig.readthedocs.io/
>ipfs https://ipfs.io/
>ZeroNet https://zeronet.io/
>Arweave https://github.com/ArweaveTeam/arweave
>Gitopia https://gitopia.org/
>BitTorrent
>
>Leave your suggestions below.
>
>These components collectively make up the future internet known as web3.


/dshag/ thread
>Data scraping, hoarding and analytics general thread.
>What are you scraping, hoarding or analyzing frens? Also post some pics so I can post them from next time, anime also works
>>
I wanted to make this about more topics than just archiving and data hoarding as I don't think that attracts many posters.

Also, /asdiq/ sounds like "ass dick". HAha, hope this general never dies. At least it isn't exactly another AI slop general.
>>
>>108914628
Nice idea. I've always felt that archival is going to become more and more important with the passage of time, especially in the face of rising storage costs, increasing surveillance and corporate greed.
>>
With IPFS gateways, I can have whatever URL path at https://site.com/ipfs/[cid]/[path] or https://[cid].ipfs.site.com/[path]

This is great, and directly helpful for archiving, but is there a way to have the URL contain a question mark? Not possible with ipfs gateways. Possible with a .onion site, but I don't want to use that anymore.

Do I really have to pay for some domain name so I can run this?:
https://site.com/memento/20260203040506/https://othersite.com/index.php?id=123

(Using ipwb.)
>>
>>108914674
Yup. Corporate greed makes grabbing some websites basically impossible. More reasons that archival becomes more important:

We live in the enshittification era of the Internet. Both web.archive.org and archive.org/details/ are enshittified procensorship hellholes that shouldn't be trusted. We need more alternatives and support for existing alternatives.

A year after BitTorrent was created, there was maybe tens or hundreds of terabytes of torrents. Decades later, that's ballooned into a much bigger and much harder to manage size if you want to capture a large part of it. Same can be said of other stuff. Many things drop off and are forever lost.

The world creates so much more data per year than it did last year. So far, it's an ever increasing trend. I learned that from reading about Filecoin (kinda sucks); I hope they finally got this FilBeam thing working:
https://docs.filecoin.cloud/reference/filoz/synapse-sdk/filbeam/toc/
>>
>>108914675
some chatgpt solutions:

>Encode the archived URL so it fits into path
>Instead of raw ?, encode the full target URL (base64, percent-encode, or use a path-safe encoding) and have your ipwb or handler decode it. This avoids needing special host handling. Example path: /memento/20260203040506/https%3A%2F%2Fothersite.com%2Findex.php%3Fid%3D123

>Use a free TLS proxy (ngrok / localtunnel / Cloudflare Tunnel)
>Cloudflare Tunnel (free) with a free workers.dev or *.trycloudflare.com address can front your local ipwb server and accept queries. Ngrok has paid TLS subdomains for custom domains; free subdomains rotate.
>>
How hard is it to have a hard drive and a pi running on a crt 24/7 ish simulating say Boomerang AMC reruns but instead of shitty old TV my favorite phonepost doomscrolls?
>>
>>108914954
Sounds fairly easy once you have all the hardware and connectors to the CRT TV.

Collection of images named this
img001.jpg
img002.jpg
img003.png
...
(GIF probably also works)

Then
ffmpeg -framerate 1/6 -i img%03d.jpg -c:v libx264 -r 30 -pix_fmt yuv420p out.mp4

Then play the "out.mp4" video. Done, slideshow of images at 6 seconds per image.

Reminds me of my time copying VHS tapes to DVDs. I could say more about that.
>>
>>108914940
Percent encoding method didn't work (I think I knew this months ago but forgot). https://archive.is/hFLPb is proof that it fails.

A file named
"https%3A%2F%2Fsite.com%2Findex.php%3Fpage%3Dpost%26s%3Dview%26id%3D12345679"

Becomes this in a gateway (double percent encoded):
https://[cid].ipfs.ipfs-02.hypha.coop/memento/20260527051814/https%253A%252F%252Fsite.com%252Findex.php%253Fpage%253Dpost%2526s%253Dview%2526id%253D12345679

We need it to be /memento/20260203040506/https://othersite.com/index.php?id=123 (or single percent encoded?) so archive.today can index it to othersite.com and not just *.hypha.coop
>>
>>108914940
>localtunnel
This would be fuckin dope if it worked with no walls:
>https://theboroer.github.io/localtunnel-www/
>$ sudo npm install -g localtunnel
>$ ipfs daemon &
>$ ipwb replay 20260527051814-https---rule34.xxx-index.php-page-post-s-view-id-13656708.cdxj &
>$ lt --port 2016

I got the random tranny porn web capture to show up in clearweb at
https://tidy-meals-feel.loca.lt/memento/20260527051814/https://rule34.xxx/index.php?page=post&s=view&id=13656708

BUT ONLY after clicking/copy-pasting on some verification shit. Works flawlessly if using a .onion site:
https://archive.is/ysIMX

but I said I didn't want to use that anymore.
>>
>>108915771
It's sad that the Tor2clearweb gateways have all went extinct. I could have used those. I'm now trying to use this thing:
https://localxpose.io/apps/nginx

Works:
>$ sudo npm install -g loclx

Fails:
>$ loclx tunnel http --to http://localhost:2016
>bash: loclx: command not found
>$ sudo npx loclx tunnel http --to http://localhost:2016
>sh: line 1: loclx: command not found

Works?
>$ npm config set prefix "$HOME/.local"; npm install -g loclx
>>
Archive-related news:

Deathwatch
>https://wiki.archiveteam.org/index.php/Deathwatch#2026-05
>May: Bucknell University Press will close at end of the 2025-26 school year.[61]
>May: The Primary School will close at end of the 2025-26 school year.[62]
>May: Sterling College will close at the end of the 2025-26 school year.[63]
>May: Trinity Christian College will close at the end of the 2025-26 school year.[64]
>2026-05-31: University of Houston Digital History will close.
>2026-05-31: Tistory will remove all uploaded videos.
>2026-05-31: plus a, a site documenting theater, will shut down.[65]
>2026-05-30: https://minelli.fr/[66]
>2026-05-30: ruru-jinro.net, ruru-jinro is an online Japanese werewolf game server that has been operating since May 2009. It is scheduled to close on May 30th (JST).[67]
>2026-05-29: Tele2 will be discontinued by it parent company Odido[68]
>2026-05-28: NIKKEI COMPASS will close service.[69]

Silicon Valley VCs Invest in Head-Mounted Cameras on Workers in India For Training AI
>https://web.archive.org/web/20260527022137/https://gizmodo.com/silicon-valley-vc-backs-startup-that-gathers-ai-datasets-from-head-mounted-cameras-on-workers-in-india-2000761062
>Human Archive believes its technology "will become foundational infrastructure for automating manual labor."
>A video went viral in India about a month ago appearing to show a vast number of garment workers wearing tiny, head-mounted cameras while they worked in a dreary-looking factory. A widespread hunch was the technology the video depicted was a system for what’s known as egocentric data collection—gathering first-person footage of people in action to train AI models, in order to replace the workers with robots. But it wasn’t totally clear if the video was real, let alone if the footage would or could be used to replace the workers.
>>
Is there a localhost to Internet thing which doesn't suck? Hoping one exists that doesn't require a login/verification. Such things did in fact exist in the past: see Tor2web and >>108915771 before it required said verification.

Otherwise, I'll have to make account(s) and pay for it.

>>108915974
>Works?
Nope:
>$ ~/.local/lib/node_modules/loclx/bin/loclx tunnel http --to http://localhost:2016
>Error: unauthenticated access
>>
This is an email from 1995-12-26 10:46. It has the subject line "Red Neck".

This image was deleted off of https://archive.org/details/ because that website is ran by petty fucks.

Full/original image in ar://:
- meta: https://thuanannew1.store/raw/B99wT2us-zAEYox4b1tSVGpgwYGw_N5V5XRNlKjQUvM
- data: https://bienchecung.store/raw/LaM_OMXzH7bxlANb9_K_IF8u9F7F-kg3KfpN0W66q0k
>>
>>108914639
Forgot about this general which I first saw months ago:

/AAD/ - Archiving And Donating computer resources general
>>108890811

Most recent thread in that general died in 2026-05-24:
https://desuarchive.org/g/thread/108890811/

Last post was:
>Another bump. I just wanted to say that I can't live without the Wayback machine anymore. I'm working on a project that often involves dead links and it would have been far more difficult to complete without it, maybe impossible. Whatever happened to "if it was uploaded to the Internet, it's there forever" or however the saying went?
>>
>>108914639
>https://ipfs.io/
Sadly, since May 13, 2026 all of the https://ipfs.io/ipfs/[cid] links redirect to
>title: IPFS Service Worker Gateway | HEAD@[7 hex characters]
>url: https://[cid].ipfs.inbrowser.link/
which is an inferior IPFS gateway.
>>
A month or two ago, I bought a used 4-TB HDD for 15 USD per terabyte. I have reason to believe that it was only lightly used. I catted it all out to /dev/null and saw no storage medium errors. U jelly?
>>
>>108914628
>hoard a bunch of shit in 2012
>it just lies on the NAS for over a decade, providing zero value to anybody
idk man, the zombie apocalypse just aint coming
>>
>>108917635
Breakdown of what you have?

I could think of value that it has such as
- deleted YouTube videos
- torrents which are dead now
>>
>>108914940
>solutions
Another one would be to enable port forwarding in the router. I don't want to do that.

>Cloudflare Tunnel
This NetworkChuck idiot spergs out about how wonderful that is even though you have to put in credit card info for their free tier:

"EXPOSE your home network to the INTERNET!! (it's safe)"
https://www.youtube.com/watch?v=ey4u7OUAF3c
>>
File: 1682642570004.png (200.9 KB)
200.9 KB
>>108920500
Went with ngrok, but it's not working in a weird way.

Worked:
>$ # made an account, use a password manager if you're not a fucking retard
>$ pass generate me@email.com@ngrok.com 28
>$ pass show me@email.com@ngrok.com | xsel -ib
>$ # Run the ngrok program
>$ wget https://bin.ngrok.com/c/bNyj1mQVY4c/ngrok-v3-stable-linux-amd64.tgz
>$ sudo tar -xvzf ~/Downloads/ngrok-v3-stable-linux-amd64.tgz -C /usr/local/bin
>$ ngrok config add-authtoken $str # https://dashboard.ngrok.com/get-started/setup/linux
>$ ngrok http 80 # or port 2016 or port 8080

Failure:
Nothing shows up at https://directed-snoring-available.ngrok-free.dev/

Debug:
Running "ngrok diagnose" says this at the end
>Report written to /tmp/ngrok-diagnose1685308347/diagnose.json
>ERROR: Error establishing ngrok connection:
>ERROR: No tunnel servers were reachable via TCP.
>ERROR: (ERR_NGROK_8007)
>ERROR: https://ngrok.com/docs/errors/(err_ngrok_8007)

(Doing this just to have web archive captures from "?"-containing-URLs show up in clearweb.)
>>
File: fXH16.png (47.9 KB)
47.9 KB
web.archive.org has excluded approximately 2049 websites (pic related).

archive.today has excluded approximately 3 websites.

>>108921219
Seems to be an old binary executable which connects to addresses which aren't there anymore: command "ngrok diagnose" said
>dial tcp 54.176.167.82:443 (connect.ngrok-agent.com): i/o timeout
>dial tcp 52.53.56.252:443 (connect.ngrok-agent.com): i/o timeout
>dial tcp 54.193.166.121:443 (connect.ngrok-agent.com): i/o timeout
>dial tcp 52.53.75.151:443 (connect.ngrok-agent.com): i/o timeout
>dial tcp 204.236.189.107:443 (connect.ngrok-agent.com): i/o timeout
>dial tcp 52.9.131.203:443 (connect.ngrok-agent.com): i/o timeout

The update command checks a 404'd page:
>$ ngrok update
>[ https://update.ngrok-agent.com/check = HTTP 404 ]
>$ ngrok --version
>ngrok version 3.39.5

Fairly useless info at https://ngrok.com/docs/errors/err_ngrok_8007 - it should say "Maybe those server addresses aren't being used by ngrok anymore."

Could run it via docker instead. Run "docker pull ngrok/ngrok" and so on.
>>
File: JlGGK.png (97.9 KB)
97.9 KB
If you load a blog.csdn.net webpage in web.archive.org, it'll redirect to https://www.csdn.net/ ( example: https://archive.is/ezhfP ). Therefore, archive.today can't get a copy of it.

Solution = use SingleFile+ipfs(+ipwb+Tor):
https://archive.is/JlGGK

>>108921969
>Could run it via docker instead
Lastest docker image is also version 3.39.5. This thing still isn't working:
>$ # https://dashboard.ngrok.com/get-started/setup/docker
>$ docker run --net=host -it -e NGROK_AUTHTOKEN=$str ngrok/ngrok:latest http --url=directed-snoring-available.ngrok-free.dev 8080
Going to directed-snoring-available.ngrok-free.dev with noscript:
>You are about to visit directed-snoring-available.ngrok-free.dev, served by [IPv6 address]. This website is served for free through ngrok.com. You should only visit this website if you trust whoever sent the link to you. (ERR_NGROK_6024)
which is boilerplate.

I could try installing the software on another computer and see if that works.
>>
This jeet talks about how sites are angry because people/bots are using web.archive.org to bypass the original sites' rate limiting or other annoying restrictions:

"AI Companies Are Killing The Internet Archive..."
https://www.youtube.com/watch?v=WsYXXFT9SiM

Inb4 all the original sites request that their website be removed from Wayback Machine (WBM), then WBM complies because they're procensorship.

>>108923830
>https://archive.is/JlGGK
>Original: https://blog.csdn.net/qq_33472553/article/details/143965935
>28 May 2026 06:23:27 UTC
Not entirely correct. The .cdxj file was originally this
>[...]"original_uri": "https://web.archive.org/web/20250629102618/https://blog.csdn.net/qq_33472553/article/details/143965935 [...]
had to remove all references to web.archive.org; otherwise, ipwb wouldn't work.

I forgot to change 20260528055120 to 20250629102618 in the .cdxj, oops.
>>
File: XjYsi.png (31.1 KB)
31.1 KB
>>108914639
Saw another one:

>https://web.archive.org/web/20260528065757/https://desuarchive.org/g/thread/79634154/
>Anonymous Mon 11 Jan 2021 02:56:41 No.79634154 View ViewReport
>
>/web archiving general/
>
>what snap app would you recommend to save a web2.0 (or 3.0) with all the markup CSS JavaScript and html5 things working ?
>
>>like YouTube dl but for the whole in browser page
>>freezing ffox or chromium state and storing the page perpetually would theoritically do the same but I'm a math grad so I can't do shit
>
>discuss your web crawlers, what zone of the web do you consider more important, unexplored or simply most lulz worthy and how do you explore and store it.
>
>>data hoarders and lifehack savers welcomed
Some one replied:
>https://github.com/Y2Z/monolith
Description of that:
>CLI tool and library for saving complete web pages as a single HTML file

So that's basically SingleFile-CLI, but a different project.
>>
>>108923986
>tab rehab
LOL
>>
>>108924009
tabhab
>>
This is a screenshot of a website about optical illusions. This file has a timestamp of 2007-09-04!

This image was deleted off of https://archive.org/details/ because that website is ran by absolute turds.

An amount of ETH was spent to upload this pic to a better archival system; here's the TX:
https://superstone.site/raw/IvNMUsM-eqJkV-Ru20P6OMEUAhAmIjq5iWPZcIAjEFY
>>
File: 1502190998919.png (36.5 KB)
36.5 KB
First time hearing that PDF files could be uploaded to 4chan was when the 4chan hack happened, done by syjak(s).

Here's a .pdf that was posted to >>>/tg/:
https://desu-usergeneratedcontent.xyz/tg/image/1515/63/1515633945730.pdf

It's "The Library of Babel, by Jorge Luis Borges (1941)". Only 8 pages, go read it. That essay or short story is related to archiving.
>>
>>108921969
It might be true about the exclusions, but 9/10 times a page I am looking for, if it is archived anywhere, will be on the wayback machine and not on archive.today. It's a pretty wide variety of sites I am looking at, too. I think it's because wayback uses a crawler, I'm not sure archive.today does or whether it's all just manual. If they have a crawler it seems much worse than the wayback one.
>>
>>108926406
First, we must understand that
- archive.today uses a "frozen page" system (similar to SingleFile and Monolith)
- web.archive.org uses a WARC-based system (WARCs are used by or created by grab-site, GNU Wget, InterPlanetary Wayback / ipwb which also uses IPFS, etc.)
- both use browsers to capture web data, we know that at least web.archive.org uses non-browser tools as well

The last time Wayback Machine MAYBE used it's own freestanding crawler was years ago, like a decade ago. Ever since then they only get web data from:
- Save Page Now (SPN): users go to their site and manually, one-by-one, submit URLs to be saved
- ArchiveTeam: they run a distributed virtual machine system that people use to mass download websites that are scheduled to be shutdown or something. This "ArchiveTeam Warrior" software uses grab-site internally or something (I know the VM is based on Alpine Linux). I stopped caring to do that anymore due to my own petty dislike of ArchiveTeam's IRC channels; they have Reddit-tier mods, so fuck them.
- Common Crawl: terabytes (or petabytes?) of web crawl data! They should have grabbed websites harder because I know of so many websites/pages/forums that they're missing.

Open up a Wayback Machine (WBM) capture and click the "About" button/link on the timeline thing at the top. It'll tell you the "Why?": if this capture created by an ArchiveTeam project, if the capture was created by SPN, if the capture was created by something else.

All the fucking AI sloppers scrapped the web in an unethical way. What they should have done, especially in the early days of AI:
1. Used grab-site to download many webpages; this creates many .warc.gz files
2. Donate that data to archive.org (the pages may also be allowed to show up in WBM)

It's that simple, but they were greedy fucks. Also WBM and Internet Archive (IA) sucks balls so fuck them as well.
>>
>>108926979
Interesting anon, I appreciate the explanation. Yeah sucks that AI providers couldn't donate all their scraping back but I would have been surprised if they did, desu.
>Also WBM and Internet Archive (IA) sucks balls
Because of the exclusions/censorship?
>>
archive.today is 100% manual, one-by-one user submitted

wayback machine maybe ran it's own crawler years ago, but they mainly get data from other peoples' and other organizations' crawls + manual one-by-one user substitutions

WARC-based archival systems are ultimately better than frozen page archival systems. There's pros and cons:
- pro: WARC has much better one-to-one correspondence to the original web raws and server headers
- con: WARC is usually based on CLI tools, and sometimes it's impossible for those to grab pages running some Cuckflare or Anubis anti-bot/anti-archival thing
- pro: Frozen can have better archival fixity
- con: Frozen has no server headers saved
- con: Frozen doesn't work in any of the WARC / web replay systems without extra programming and working on it
- con: Frozen has no or little record of the functionality of JavaScript and WebAssembly

I could maybe or probably go on and on about this stuff. Oh, one time I was talking to this retarded furfag who said that he'd rather have the web page raws and not the .warc.gz files. What a dumb bastard. The furry like persistently argued that having all the uncompressed data as .html, .css, .js, etc. was better; he said he didn't care about server headers and the metadata found in WARC files. One thing to realize is if it was all uncompressed as .html/.css/.js/etc. then many times things would simply not work. You need the WARCs + a WARC replay system to correctly have the web data replay and not be broken. And sometimes it's necessary like for https://dropbox.com/example?fileId=892182189892898 where the MEANINGFUL filename is only in the server header otherwise you have some random filename like "892182189892898" if you only have the web raws.

So if you opened https://dropbox.com/example?fileId=892182189892898 in the replay system (an custom Electron/Chrome browser or something) it would say "where do you want to download file '2026-04-26-091018_1280x1024_scrot.png'?" or whatever the filename is.
>>
>>108927013
>Because of the exclusions/censorship?
Yes. Internet Archive is literally and figuratively ran by trannies. What do trannies do? They erase history and support censorship. The https://archive.org/details/ section of their shitty website operates much like JewTube: if they dislike you then they will delete you account or most of its items.

They mass delete thousands of archive.org/details/ items for completely asinine or hypocritical reasons. One time the upper-level IA turd(s) said in a blog post that the Wayback Machine is the jewel of IA. Like I was saying, WBM is WARC-based. The IA jannies deleted multiple archive.org/details/ items which were fully WARC grabs of entire websites. They are discriminatory against anyone who isn't their ArchiveTeam buttbuddies. So that's terabytes of web data -- which they basically said they value the most, especially it it's .warc files -- lost, because it was uploaded by non-ArchiveTeam accounts.

All WARCs from non-ArchiveTeam accounts uploaded to IA were ingressed into WBM; that ended in some year, maybe 2015 due to distrusting "Internet randos", they didn't want them to modify the WARC data to make fake web archive captures. Sounds like stupid elitism; WBM could have a thing where you click on "About" in the capture and it says "from a non-ArchiveTeam WARC". archive.today chads keep winning (>>108923876): they have a thing which works with memento URLs.

>>108927102
Rather "+ manual one-by-one user submissions"

About the furry who stupidly wanted an HTTrack-style copy of the website: at least he was interested.
>>
File: 1763211459008515.jpg (1.3 MB)
1.3 MB
Man, saucenao can't even find pictures from Pixiv anymore. The walls are getting bad bros.
>>
>>108927201
Rather "archive.org/details/ section of their shitty website operates much like JewTube: if they dislike you then they will delete your account"

Rather "IA jannies deleted multiple archive.org/details/ items which were full WARC grabs"

Rather "they value the most, especially if it's .warc files"

>dumb furry
We were talking about a 1-TB torrent of a full-site WARC of a shutdown website. That torrent data was uploaded to archive.org/details/ and subsequently deleted.

>>108916448
/AAD/ is like the pro-Archive.org general.

/asdiq/ is or can be the anti-Archive.org general. Archival underground or something.

I wish either general would have a million posters as Friendly GNU/Linux Thread general and other >>>/g generals have. Unfortunately, I don't think that will happen as I think there's little interest in archiving and we archivists will remain in the minority. All based on what I've observed with forum posters' and peoples' interests in archiving.
>>
>>108914628
>storage tech and questions
I found an old file container that I think is TrueCrypt, I know the pass is less than 10 characters but can't remember it. How do I brute force mounting it? I'm pretty sure of the first 3 characters so that only makes the length at most 7 chars so should not take too long since I also know it only has English alphabet characters.
>>
>>108929045
>only has English alphabet characters
All lower case? If so, then that's 26^7 which is 8,031,810,176 permutations. 8 billion different permutations would be done shortly, as long as the thing handling the password doesn't make you wait 1 second between each attempt.

(Made me think of the total amount of Monero hashes I've checked/mined over the months: 272,989,758,847, which is 273 billion, but the system was designed so that it takes a while to calculate each one.)
>>
>>108929172
I think both upper and lower unfortunately but if I'm right and it's a TrueCrypt container it shouldn't take long to try one password. Would love a GPU brute-forcer program if there is one.
>>
Ways to upload to arweave from a CLI, using an API or something?
>>
>>108925288
>The Library of Babel, by Jorge Luis Borges (1941)
>Like all men of the Library, I have traveled in my youth; I have wandered in search of a book, perhaps the catalogue of catalogues; now that my eyes can hardly decipher what I write, I am preparing to die just a few leagues from the hexagon in which I was born. Once I am dead, there will be no lack of pious hands to throw me over the railing; my grave will be the fathomless air; my body will sink endlessly and decay and dissolve in the wind generated by the fall, which is infinite. I say that the Library is unending.
Me when I die in the universe-sized library.

>>108929045
>found an old file container that I think is TrueCrypt
An older segment of data from you past I assume. I'm guessing you created that and forgot part of the password.

I have someone else's PS4 HDD. It's multiple TB in size, and I basically can't decrypt it as I don't have the PlayStation 4 to get the keys out of. More on that:
> https://desuarchive.org/g/thread/108785672/#108861085
> >decrypted only UFS2 fs table
> Command in that script that does that:
> >$ sudo cryptsetup -r create -c aes-xts-plain64 -d ${TOOLKIT_PATH}/keys/${KEY_ES} -s 256 ps4hdd_es ${DEVICE}
> where $DEVICE is /dev/sdx
Ah, I remember now, no one's found out how to decrypt such external PS4 HDDs. They do know how to decrypt internal PS4 HDDs (and of course you can move data from the external one to the internal one).

>>108929245
52^7 = 1,028,071,702,528 = about 1 trillion. Sounds doable in not too long.
>>
>>108929245
Install Python and Hashcat.

In a command line, navigate to the hashcat directory:
cd hashcat

Get the hash for your TrueCrypt file:
python "tools/truecrypt2hashcat.py" "tcdir/tcfile.tc" > "tcfile.hash"

List backend devices so you can choose which one you want:
hashcat --backend-info

Then you can run Hashcat on your chosen GPU (mine is 2):
hashcat --backend-devices 2 -a 3 -m 29321 -1 ?l?u --increment --increment-min=3 --increment-max=10 "tcfile.hash" "ABC?1?1?1?1?1?1?1"

Replace ABC with whatever the first 3 characters are if you know them.
You can replace ?1 (custom charset 1, defined here as uppercase or lowercase) with ?l (just lowercase) or ?u (just uppercase) if you think a certain character is one or the other for sure and reduce the search.
You can check hash mode 29313 for RIPEMD160, 29323 for SHA512, or 29333 for Whirlpool. The last digit is the max number of chained ciphers it will check. If you only have one (like AES) then you can put it as 1 (like 29321) and it should be a bit faster, though this won't crack it if you have two or three.

If you used an old version of TrueCrypt before XTS was implemented, Hashcat doesn't support that.
>>
File: t3hL67xHNes-maxres.jpg (156.4 KB)
156.4 KB
>>108925288
PDF:
>each book is of four hundred and ten pages; each page, of forty lines, each line, of some eighty letters which are black in color
Web incarnation:
>https://libraryofbabel.info/search.cgi

These mostly-meaningless books aren't so easy to archive. They compress to about 800,000 bytes at the smallest. Uncompressed size: about 1,333,000 bytes. It would be easier if each book was smaller than 100,000 bytes when compressed. Here's a Library of Babel book titled "swj cftauthd gwlb" (AKA "fuckai") and an edited/abridged version of it:
https://hupsoapsoap.store/raw/bFV6_CKqQMMYDbS-1l1VsJ_Qozq_frnbFpO2-Yu-92k
>>
>>108931795
>libraryofbabel.info
Site history:

2018: had it's own forums
https://web.archive.org/web/20180511224119/https://libraryofbabel.info/
https://web.archive.org/web/20180504232955/http://www.libraryofbabel.info/forum/?page_id=14

2019: storage device failure = forums lost, I guess, hopefully not
https://web.archive.org/web/20190513174601/https://libraryofbabel.info/
>My apologies; Due to hardware failure libraryofbabel.info had some downtime. I am still working to restore the forums. -JB

2020: uses Reddit as forums
https://web.archive.org/web/20200519031603/https://libraryofbabel.info/
https://web.archive.org/web/20200626041708/https://www.reddit.com/r/BabelForum

I remember posting on those forums back when they weren't Plebbit.

>image
YouTube ID is https://www.youtube.com/watch?v=t3hL67xHNes

>ebook filesizes
In the Babel Image Archives section: while the IDs for the images are always approximately 940 KB in size, the images themselves are often smaller than 100 KB. Picrel is an example of that.
>>
File: ngrok.png (92.7 KB)
92.7 KB
>>108921219
Running "ngrok http 8080" now magically works for some reason.

However, ngrok was a big WASTE OF TIME. Same crap happened that happened with >>108915771 except this time it says:
>ERR_NGROK_6024 - You are about to visit directed-snoring-available.ngrok-free.dev, served by [IPv6 address]. This website is served for free through ngrok.com. You should only visit this website if you trust whoever sent the link to you.
>[click button to see this webpage]
>>
>>108930781
>An older segment of data from you past I assume. I'm guessing you created that and forgot part of the password.
Yeah it's my own old encrypted file container. I don't remember what's in it but I don't want to delete it without knowing.

I looked at the desuarchive link you gave and the 4dOp-QA4VK4 video it contained, seems to be unrelated BitLocker stuff? My container was made with TrueCrypt or VeraCrypt (I think TrueCrypt since it is an old file).
If I misunderstand and you were trying to tell me of a program to brute force the container please tell me again.
>>
File: fuckai.studio.mp4 (2.8 MB)
2.8 MB
>>108932083
And yes clicking the button did show the webpage, but this doesn't help in my goal of web archiving. Maybe I will buy some .xyz or .space domain name (some cheap TLD)...

Interesting site:
https://dnhub.io/bulk-tld-check

You put in some text and it sees if there's any sites for that. So you can put in "fuckai" and it'll show
https://fuckai.studio
https://fuckai.lol
https://fuckai.se
etc.
>>
>>108932181
I know basically nothing about TrueCrypt. The desuarchive thread I linked was just to show the investigation into PS4 jailbreaking and is unrelated to your problem or project.

I posted about it because I had a similar problem, but no solution in my case. The PlayStation 4 formats HDDs with some crap that makes things difficult; cryptsetup is involved, also unrelated to TrueCrypt.

See this anon's advice, he seems to be the genius with a solution: >>108931323
>>
>>108932219
oh shit i didn't see >>108931323 for some reason.
will look at it but it looks too hard for me. had hoped for some gui program for babby's first data recovery.
>>
File: IMG_20200803_153438.jpg (242.9 KB)
242.9 KB
>>108932083
If it's suggesting client-side changes then that's certainly a waste of time. Otherwise, messing with the reverse proxy local server: could make it send those headers server-side = removes that verification page. Here's hoping.

>>108916412
This is a photo from 2020-08-03 in USA. It shows a bench / outdoor table wrapped in plastic wrap due to Covid-19.

This image was deleted off of https://archive.org/details/, also not deleted by the uploader.

Photo with metadata retained (has no GPS info):
https://liluandinhcao.store/raw/K8uCucJk8fTrud-Yry9sgH6ka3KhNDNFlx2sfFYNP58
>>
>>108932238
Are you using Windows or Linux? How large is the file?

Things should be easier now. You can talk to a chatbot at duck.ai and it'll probably help you. Or ask questions in this thread.

You can ask the LLM chatbot to analyze that command or ask it for where to get those files and so on. Here's another breakdown of what part of that does:

https://explainshell.com/explain?cmd=hashcat+--backend-devices+2+-a+3+-m+29321+-1+%3Fl%3Fu+--increment+--increment-min%3D3+--increment-max%3D10+%22tcfile.hash%22+%22ABC%3F1%3F1%3F1%3F1%3F1%3F1%3F1%22

I'm curious to know what lies forgotten in said encrypted data.
>>
>>108932599
>If it's suggesting client-side changes then that's certainly a waste of time. Otherwise, messing with the reverse proxy local server: could make it send those headers server-side = removes that verification page. Here's hoping.
Request headers are what the browser sends to the server. I don't think you can configure an nginx reverse proxy to inject certain Request Headers. Therefore, it's just client-side crap. It would be better if it wanted response headers to be changed (to removed the verification wall), as that can be controlled by the reverse proxy server.
>>
>>108934397
Windows, size is only just under 1 GB. Might just be old school work. But it could be important so I must know.
Thanks for the tip about LLM. And explainshell.com was neat.
>>
File: 7RfaT.png (38.3 KB)
38.3 KB
Years ago, I had an HDD, model WDC WD5000AVVS-6 (Disk /dev/sdb: 465.8 GiB, 500107862016 bytes, 976773168 sectors). I may still have that 500GB hard drive, or a .img file of it via "sudo cat /dev/sdb > raw.img".

It's a DirecTV HR22 Receiver 500GB hard drive which had XFS as its file system. It maybe needs a decryption key to read all of it's data. I assume that key is in the device it was in (not in the HDD). In the year 2021, I was able to carve JPGs/PNGs out of it with Binwalk.

Output of command
$ sudo xfs_ncheck /dev/sdb2 > xfs.txt
is here:
- meta: https://sieucapchinhtri.store/raw/FjM_K255ApAY1eCct840-hsJ3k9o64PN_dHbFPQ6DgA
- data: https://ong3.xyz/raw/5cKzEEO6xtnLoZW7Yly3myBoZJyvoSK9DZC8ngKsnaY

It lists inodes and paths, such as
> 1397141 network/apg_data/extgdb_objs_go_hd/0000078b/extgdb_go_0078bb5c
> 14025 viewer/segments/Rcrd-03-18-2015-1027-08-10405-ch501-min65535-src2.mpg/0000000006023020544
> 1142995 network/apg_data/extgdb_objs_go_hd/00000ad7/.
> 1142996 dms_data/encrypted.tmp
> 41197 viewer/recordingsMessageCache/recording1094/event0000001433334281085
> 4390 viewer/indexfile/Rcrd-03-03-2015-1900-00-5-ch21-min65535-src2.mpg/meta_man.xma
> 63322816 backup/viewer/indexfile/Rcrd-03-22-2015-1900-00-26-ch254-min65535-src2.mpg/meta_man.xmi
> 54526098 network/font_cache/Direct Gothic_Medium_28_2_1_33_0_2_1_1_5A_2_C0_2_5A_0_26

I was unable to extract or access those files with the usual methods; xfs_ncheck could see them, so there must be a way? I don't have the full HDD or .img now, but I have 250-MB sections of the .img...
>>
>>108936132
>it's just client-side crap
Yeah, opening devtools in Brave Browser and running this in the console
>(async () => {
>const res = await fetch('https://directed-snoring-available.ngrok-free.dev/[...]', {
>method: 'GET', headers: {
>'ngrok-skip-browser-warning': '1'
>}, credentials: 'omit', mode: 'cors' }); const html = await res.text();
>console.log(html); })();
results in the webpage with not wall.

Running the same thing with a line commented out or change to some other header
>//'ngrok-skip-browser-warning': '1'
results in the verification-walled ngrok webpage.
>>
File: 00053296.png (59.3 KB)
59.3 KB
>>108937519
"Data mining" info: I have

File "wdc_wd5000avvs-63zwb0_skip0_count500000_bs512" (256 MB, .gz?) = start of the drive, head bytes of the .img:
$ sudo fdisk -l wdc_wd5000avvs-63zwb0_skip0_count500000_bs512~
Disk wdc_wd5000avvs-63zwb0_skip0_count500000_bs512~: 801.77 MiB, 840714240 bytes, 1642020 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000

Device Boot Start End Sectors Size Id Type
wdc_wd5000avvs-63zwb0_skip0_count500000_bs512~1 64 1060289 1060226 517.7M 82 Linux swap / Solaris
wdc_wd5000avvs-63zwb0_skip0_count500000_bs512~2 1060296 32531624 31471329 15G 83 Linux
wdc_wd5000avvs-63zwb0_skip0_count500000_bs512~3 32531632 976768064 944236433 450.2G 83 Linux
$ # 256,000,000-byte file deleted off of archive.org/details/


File "..." = middle partition, XFS according to years-old records of mine:
$ lsblk -f
NAME FSTYPE LABEL UUID FSAVAIL FSUSE% MOUNTPOINT
[...]
sdb
─sdb1 swap [SWAP]
─sdb2 xfs a870119e-1278-4c3c-a075-5793bde4788d
─sdb3
[...]
$


File "wdc_wd5000avvs-63zwb0_skip60500000_count500000_bs512" = start offset of 30,976,000,000 bytes (31 GB), filesize of 256 MB

Carve files out by using foremost:
>$ foremost -t all -o /pathTo/EmptyDir/ wdc_wd5000avvs-63zwb0_skip0_count500000_bs512~

Foremost got
- thousands of files out of the first partition (a SWAP partition), such as picrel
- zero files out of part of the third partition (unknown filesystem)
>>
File: 00052408.png (55.9 KB)
55.9 KB
>>108938074
Ugh, I'd have to look through the files (each one increments by 500,000)
"wdc_wd5000avvs-63zwb0_skip0_count500000_bs512"
"wdc_wd5000avvs-63zwb0_skip500000_count500000_bs512"
...
"wdc_wd5000avvs-63zwb0_skip60500000_count500000_bs512"

then find where the offset of 542,871,552 bytes is for partition 2

then fine where the offset of 16,656,195,584 bytes is for partition 3

all just to try again at a problem I couldn't solve years ago. Maybe foremost would succeed where Binwalk failed. The better program for carving out files seems to be foremost. (Plus, maybe LLMs could help me: weren't a thing in 2021.)
>>
File: 00051928.png (48.0 KB)
48.0 KB
>>108938207
foremost also craved zero files out of this file, which is in the partition 2 section
"wdc_wd5000avvs-63zwb0_skip2000000_count500000_bs512"

Worse, I suspect that those split files are from a file named something like
"wdc_wd5000avvs-63zwb0.img.gz"

and in that case, I'd have to download 20 to 30 GB of the files, decompress it, then struggle more to get this locked-down stuff to work in ways I want it to.
>>
File: 00048240.png (29.9 KB)
29.9 KB
>>108938346
So would that all be worth it to, say, possible see someone's DVR recordings from 2015 (.mpg video files or something)? I can at least get some images and HTML files out of it. I can't get all the files that it claims to contain as per >>108937519
>>
There's a trend of normie content winning out over everything else. This victory of globohomo (global homogenization) means that the videos, for example, that people watch are mostly completely sanitized politically correct private-equity-funded YouTube videos.

This trend is fueled by the censorship that happens on all major platforms (including archive.org/details/).

Some videos which aren't globalist homogenization:
https://web.archive.org/web/20260503041834/https://chanii.ddns.net/b/res/76.html
https://web.archive.org/web/20260504173705/https://chanii.ddns.net/b/res/577.html

That website appeared to have went offline forever in around May 6, 2026. Last post I know of was >>779 at 05/04/26 (Mon) 21:56:37 UTC. Said in one of the MP4s/WebMs, something like: "A man lives three lives. First, the lose of innocence. Second, the lose of naivety. And lastly, the lose of life itself."
>>
155.4 KB
>>108939910
>Some videos which aren't global homogenization:
Or, just go to >>>/wsg/ and >>>/gif/ if you like the 4chan way of watching videos and talking about them.
>>
File: Tr4Yt.png (138.8 KB)
138.8 KB
Aaron Swartz is the Jewish co-founder of Reddit.

He mass downloaded JSTOR in a ridiculous way: sent like a billion requests per second. Folks, this is why when downloading a website, you don't exceed a download concurrency of 4.

That resulted in him getting into legal trouble which was very ridiculous. Later on, his cause of death was suicide by hanging.

Saw this thing today about him:
https://web.archive.org/web/20260530083519/https://desuarchive.org/g/thread/30706801/
https://web.archive.org/web/20130515063511/http://aaronsw.archiveteam.org/

It was encouraging people to download www.jstor.org web data as a fuck you to the tards who ran JSTOR.
>>
File: 1358218820141s.jpg (13.4 KB)
13.4 KB
>>108939995
>Aaron Swartz
Wonder what happened to the files that he specifically downloaded. I guess those files on his devices were deleted or lost.

I was reading that /g/ thread from 2013: apparently JSTOR put many public domain documents behind paywalls. The absurdity...
>>
File: s-l960.jpg (144.9 KB)
144.9 KB
Anyone else notice how YouTube has been deleting full album videos in favor of individual music tracks? Why? Probably due to money/greed.

I can still listen to full music albums as single files in IPFS: used it yesterday to relisten to some "The Residents" albums (picrel):
https://archive.is/http://148.113.164.86:8080/ipfs/*

>>108931943
The Library of Babel has more books than atoms in the universe. I was reflecting on that short story; one of the things I have to say: they're not as meaningless as you think. Some of them are in fact completely meaningless, others are completely coherent, but written in unknown languages and encodings (or encrypted).

Number of atoms on/in Earth: 10^40
Number of atoms in the universe: 10^80
Number of books in The Library of Babel: 10^4677
Number of images in the Babel Image Archives: 10^961755
(Numbers according to https://www.youtube.com/watch?v=Sd0tB3tR3yQ video)
>>
File: 1779733571208315.jpg (54.4 KB)
54.4 KB
>>108940183
YouTube uses DistroKid and stuff like that to autogenerate band pages. I've even seen it smush together bands that have nothing to do with each other aside from similar names. It even did that with singers and comedians that have similar names. It's probably deleting everything not approved by the copyright holder even more aggressively than usual in order to push the official versions. Which is very bad considering that some rare and not so rare songs are georestricted or just not made publicly available at all (at least officially).

>Library of Babel
I've stumbled upon this a bunch of times while browsing the indie web. I'm not knowledgeable about Borges, so it just felt like a gimmick to me (even if the concept itself is very interesting). Did anyone navigate it productively, like actually finding things that are comprehensible, even meaningful? I wonder if some day it will be connected to some kind of LLM, as silly as the idea might seem now.
>>
File: 1772787101010905.jpg (88.0 KB)
88.0 KB
Also, interesting thread OP, thanks! It feels quite esoteric, even without the Borges stuff.
>>
File: browsehex.gif (33.2 KB)
33.2 KB
>>108938361
Rather "possibly see someone's DVR recordings from 2015"

>>108939995
>when downloading a website, you don't exceed a download concurrency of 4.
On the flip side, some sites are dumb: "Downloading at a concurrency of 10?! I'm being heckin DDoSed! I'm under attack!"

>>108940263
Didn't know that about YouTube autogenerated music data. I've seen such things happen on non-YouTube sites: "smush together bands that have nothing to do with each other aside from similar names".

Sad to see the non-official music uploads from regular YouTube users being deleted (along with their channels and the comments on the videos). Another move from "YouTube: broadcast yourself" to "YouTube: broadcast your corporation".

>Did anyone navigate it productively, like actually finding things that are comprehensible, even meaningful?
Statistically, it's basically impossible to find sensical short-or-medium/long-length sentences, especially if not looking at "random English words" book pages. I've personally found sorta meaningful small bits of texts in it. Such as
>fedbgayxsshifr
in the "swj cftauthd gwlb" book >>108931795, but this is like schizo-levels of pattern recognition and meaning assignments.

>>Library of Babel
>I wonder if some day it will be connected to some kind of LLM, as silly as the idea might seem now.
That's one way of looking through it. 10^4677 = a number with 4678 decimal digits.

Terabyte (TB) = trillion bytes, petabyte (PB) = quadrillion bytes, exabyte (EB) = quintillion bytes, zettabyte (ZB) = 10^21 bytes, yottabyte (YB) = 10^24 bytes, ronnabyte (RB) = 10^27 bytes, quettabyte (QB) = 10^30 bytes. All of these units are inadequate to describe the Library of Babel if it was a static complete set of files. It has a data size of 10^4647 QB.

>felt like a gimmick to me
Important thing that should be done if not already done: make libraryofbabel.info website software free and open source, standard identifiers on everything = long-lasting dynamic database/system
>>
File: shelves.png (888.1 KB)
888.1 KB
>>108940430
>standard identifiers

In libraryofbabel.info:
Each page in each book can be uniquely identified with 3266 characters (3,266 bytes). Character set: lower case alphanumeric. 36^3253 (number with 5063 decimal digits) is larger than 10^4677. Of course, each individual page is smaller than 100 KB. All findable books are basically 1 MB in size (can compress down to about 800 KB). The IDs look like this:
>Book Location:[3253 characters here]-w1-s3-v21:111
where w=wall, s=shelf, v=volume, and :[number]=page.

In libraryofbabel.app:
Uses a different system for IDs. Going to a random page in that site -- https://libraryofbabel.app/ref/@ce90f1c41a06d76e571dbc22325168650156952a561f548e3d771454dc100b91.1.1.30.178 -- I see "Room 1acbbhjh...cd98hpet / Wall 1 / Shelf 1 / Book 30 / Page 178" with a link to https://libraryofbabel.app/fullref/@ce90f1c41a06d76e571dbc22325168650156952a561f548e3d771454dc100b91.1.1.30.178 = the ID looks like this:
>[1.3 megabytes of lower case alphanumeric text].1.1.30.178
first number after "."=wall, second number=shelf, third number=volume, 4 number=page. The ID in the URL is
>ce90f1c41a06d76e571dbc22325168650156952a561f548e3d771454dc100b91.1.1.30.178
same thing, except for the first part, which is 64 hexadecimal characters instead. 16^64 = a measly 10^77. Seemingly, the address space of libraryofbabel.app's URLs can only map to less than one quintillionth of all of the books in the Library of Babel megastructure.

There's apparently other Library of Babel websites, but for now the most glaring problem is the lack of a standardized system (books in libraryofbabel.app have !,?,- = not the case with libraryofbabel.info) and the lack of standardized IDs.

>>108931795
>.txt file of "The Library of Babel" by Jorge Luis Borges, formatted into the style that he described
>in the other [very small closet], satisfy one's fecal necessities
But then what do any of those people eat? Story doesn't have to be completely fleshed out, still makes me wonder
>>
>>108940430
>has a data size of 10^4647 QB
Actually, if each Library of Babel book is 1 MB in size, then the true size is 10^4653 QB (10^4683 bytes).

>>108940671
This other site says:
>https://babel.zwyx.dev/intro
>There are 29^1,312,000 books in the Library of Babel — a number with 1,918,667 digits.
So what's the correct total number of books? "Each book contains 410 pages. Each page, 40 lines. Each line, 80 characters. Each character can be a space, a letter, a comma, or a period." That's 1,312,000 characters per book, so yeah, charset^slots = 29^1312000. >>108940183 I guess that dumb YouTuber messed up on the calculations in his Sd0tB3tR3yQ video.

And that site uses yet another system of IDs:
>https://babel.zwyx.dev/random
>Book ID = [24,000 ASCII non-whitespace characters]
>Location in the library = Room number of 46,793 digits, wall, shelf, volume
Book ID can also be downloaded as a 19,610-byte PNG image.
>>
File: MistyTasteOfMoonshine.jpg (111.0 KB)
111.0 KB
>>108938361
Not sure I'll do this soon. Both things take some work (either combining the split files and so on or finding the HDD and sticking it internally into a computer of mine).

I can say that I used "xfsprogs_3.2.1_amd64.deb" in Lubuntu in 2021 to gain some info about what was in that drive. Here's that file:
- meta: https://ong3.xyz/raw/tpNRY30SwyTHzLqzgbrX55ETXUfsaTwu-HW9TDCKn-s
- data: https://xacminh.store/raw/gvCNNKxHXTq2DTPigyaMwva81FcjCO3YHbWXwLG1UiA

Both files "xfsprogs_3.2.1_amd64.deb" and "xfsprogs_3.2.1_i386.deb" were deleted off of https://archive.org/details/, also not deleted by the uploader, same as >>108925208

Attached: another pic carved out of that "XFS hard drive" (or "DVR hard drive").
>>
Is there any way to recover domains that the wayback machine nuked from their archive?
I think it's bullshit that someone can just buy up an expired website and then request wayback to remove it from their index.
>>
>>108942646
There's ways to get captures of excluded websites, but one problem is that not enough people get and share WARCs. What sites are you looking for?

Well-known methods:
- Look for the site at https://archive.is/site.com = can sometimes find captures; searching that will show all URLs for that site, searching https://archive.is/*.site.com shows all subdomains of it
- Look for URLs at https://megalodon.jp/[URL] = very rarely it'll be here. I don't know how to search megalodon.jp (ウェブ魚拓) like how you can search archive.is in the previous bullet point.

Lesser-known methods:
- Search the indexes of WARCs in archive.org that were downloaded around the time of that website's existence under a certain webmaster; search the indice for "site.com" using grep
- Do the same search but wherever else you might find WARCs (like in some torrents)

Problems: not enough people grab WARCs of websites and the non-ArchiveTeam non-normal-user WARCs in archive.org are un-downloadable; for example:
>$ curl -I https://web.archive.org/web/20260218003722/https://www.meridian.space/blog/introducing-pay-per-byte-a-new-era-for-filecoin-retrieval
says:
>x-archive-src: CC-MAIN-2026-08-1770395863965.96-0037/CC-MAIN-20260217225554-20260218015554-00755.warc.gz
that's
>https://archive.org/download/CC-MAIN-2026-08-1770395863965.96-0037
>Files marked with lock are not available for download [even if you're logged in, and all the important files are marked with a lock symbol]
WARC indexes are the .cdx.gz/.cdx files and, if they were created by grab-site, "wpull.log". grab-site and Wget can create WARC files; each .warc.gz is around 5 GB in size and contains thousands of webpages.

The lesser-known method was used to successfully obtain this webpage which was removed from web.archive.org and not in archive.is:
- title: "Twilight hugging Moondancer by MrPoniator on DeviantArt"
- url: https://www.deviantart.com/mrponiator/art/Twilight-hugging-Moondancer-544149993
- screenshot: attached
>>
>>108943227
>each .warc.gz is around 5 GB in size
grab-site standard for Web ARChive files (WARC files). WARCs created by GNU Wget can be whatever size, doesn't have a default size, I think. Wget-created-WARCs also have indexes in CDX files, I think.

>lesser-known method [WARCs] was used to successfully obtain this webpage which was removed from web.archive.org and not in archive.is
Oh, and the original / live / source webpage was deleted in around the year 2020. Source code of the page when rendered in ReplayWeb.page (WARC replay software) includes this text:
http://localhost:5471/w/id-1e7546e7fe91/20190903230859/https://www.deviantart.com/mrponiator/art/Twilight-hugging-Moondancer-544149993

The HTML file at /ipfs/[CID]/i_localhost.htm renders as plain HTML with no CSS and JS. Realizations:
- if running ReplayWeb.page with that WARC open (the .warc.gz is in some torrent): then you can maybe go to that http://127.0.0.1:5471/ link and it will render with the .css and .js
- I can replay that page via ipwb and get archive.is to capture it. It would be better if I used SingleFile to capture 127.0.0.1:5471/... then replay that SingleFile-created-HTML with ipwb

Torrent which has that WARC of "great interest" (probably dead now):
magnet:?xt=urn:btih:3850e42c8449a43e2959db46ad4985ded54408aa
>>
File: JP2WW.png (75.5 KB)
75.5 KB
This mega.nz folder was deleted:
https://mega.nz/folder/papA0DIa#crI_OpajKXo_r_ZL1jJ5dg
https://archive.is/2024.09.08-220759/https://mega.nz/folder/papA0DIa%23crI_OpajKXo_r_ZL1jJ5dg

I currently have a copy of it at
/mnt/sshfs/zd/b/z2/data/0221061/https(u003a)(u002f)(u002f)mega.nz(u002f)folder(u002f)papA0DIa(u0023)crI_OpajKXo_r_ZL1jJ5dg/

Size: 58 GB. Contents: hashes and metadata of millions of images that were/are on the web (tumblr).
>>
Is the future of Discord archival (without getting your account banned) hopeless?
https://github.com/Tyrrrz/DiscordChatExporter/issues/1497
>>
>>108943386
Sample:
https://xacminh.store/raw/jp73AEGImAJIkB8i_w8HW_zcI6YfNHFOYc2CSz2OLfc

Pic related.

>>108943528
I hope not. With the amount of people who use that dogshit, there's sure to be important things.
>>
File: 1775971945296130.jpg (336.7 KB)
336.7 KB
>>108943227
>- Look for the site at https://archive.is/site.com = can sometimes find captures; searching that will show all URLs for that site, searching https://archive.is/*.site.com shows all subdomains of it
This is really cool feature. Does archive.org have something like this as well?

Or while we're on the topic, any other site/search engine, archive related or not, that lets you list subdomains like this? I know many search engines use wildcards *, but I've never considered their uses beyond very basic searches. Something that could list all the pages on a website might prove useful both for archiving and beyond.
>>
>>108943620
>With the amount of people who use that dogshit, there's sure to be important things.
Tell me about it. As the most egregious example I've experienced to date, there's this fan-port of a game where the only way to get it is to make a thread in their Discord server proving you own the original (because they don't want to get sued or whatever) and wait for the lead developer (who made a pinned thread saying he's not only overwhelmed with the thousands of requests, but is currently on vacation) to DM it to you.
>>
>>108943528
I had no such problem with dht.
>>
>>108943641
>Does archive.org have something like this as well?
Yeah, but maybe not the subdomain thing; example:
>https://web.archive.org/web/*/https://ipfs.nftstorage.link/*
>https://web.archive.org/web/timemap/json?url=https%3A%2F%2Fipfs.nftstorage.link%2F&matchType=prefix&collapse=urlkey&output=json&fl=original%2Cmimetype%2Ctimestamp%2Cendtimestamp%2Cgroupcount%2Cuniqcount&filter=!statuscode%3A%5B45%5D..&limit=10000
shows everything captured from that site under http:// and https:// (picrel = traditional art as an NFT from that site)

>any other site/search engine, archive related or not, that lets you list subdomains like this?
This paid search engine -- https://www.shodan.io/ -- can maybe enumerate all known subdomains of sites. (Shodan does stuff like list exposed APIs which shouldn't be exposed, or just general intelligence about which IP addresses are running some service and at what port. I can already do that for free, but on a small and targeting two specific services.) I used some FOSS software that got subdomains from websites in the past; it worked OK, not great.

>Something that could list all the pages on a website might prove useful both for archiving and beyond.
Sites used to have a sitemap.xml that listed all paths (pages, files) in the site. One way to get "all the URLs" of a site is to, for example, look at the WARC indices of ArchiveTeam uploads to IA when they were saving Imgur before it enshittified further. You can get ~millions of Imgur links using that method. (I could go into more detail on this topic.)
>>
>>108943842
>I had no such problem with dht.
As in Discord History Tracker? How long ago did you last use it?
>>108945112
If you only want results archived with a status in the 200 range, change
>filter=!statuscode%3A%5B45%5D..
to
>filter=statuscode%3A2..
>>
>>108943386
Couting up the records in those SQL files:
- #-d, p, s, z = dunno the exact row count because I Zstandard-compressed them; however we can think about the average rows per byte: 0.00328900696299. Doing calculations with the uncompressed sizes = 89029400 rows
- everything else = 115780158 rows

Total = 204,809,558 records. I could share these compressed SQL files (picrel is from one of the URLs).

That's more than 200 million tumblr URLs >>108943641 >>108945112 as you were talking about getting a site map. 204809558 rows =
- 58 GB uncompressed
- 17 GB compressed (level 19 .zst) and each file compresses to <2 GB

>>108945112
Rather "on a small scale and targeting two specific services"
>>
>>108946689
>Couting up
Counting up
>>
File: blombooru.png (1.1 MB)
1.1 MB
I'm really liking Blombooru for local image/video archiving.

https://github.com/mrblomblo/blombooru

It's not as full features as big booru software but super easy to get running. Performance has been good so far too. I'm using one instance for memes and general images, and a separate instance (with auth) for pron.
>>
File: 1755353303311666.png (19.7 KB)
19.7 KB
Bump.

Also, thoughts on this extension? https://addons.mozilla.org/en-US/firefox/addon/view-page-archive/

I haven't had much luck using it to search for stuff, if something isn't on archive.org or some archive.today mirror, then it might as well never have existed. But it seems interesting nonetheless. Pic related is the list of archives it tries to search.
>>
File: dvd3a1.png (675.3 KB)
675.3 KB
I've noticed that starting like a month ago, archive.today can no longer capture direct image links if the target image is larger than archive.today's browser height. Pretty sure that archive.today's browser viewport is 1024x768. So, high resolution pictures where you can click to zoom in, larger than 640x480 or 1024x768 = fails to capture it, eventually says "Not Found (yet?)"; I've seen this happen multiple times.

I know of about one work around: involves making the highres image show up in a certain webpage.

For an example, attached is a photo of disc 3A of Azumanga Daioh: The Animation ("KIBA-9799") which was deleted off of archive.org/details/ by non-uploader. Adding it (https://junnew.site/raw/e2pxrafWfRp0t00kcjhuctRYV2ZhSh2mmyuWa2X6H3k = metadata) to archive.today:

Fails
https://archive.ph/?url=https://lamsachmay.store/raw/euuCXvWBN_qbXe4WhLVPba9mhj0PkDmequClMZGcFUE

Works
https://archive.ph/?url=https://web.archive.org/https://lamsachmay.store/raw/euuCXvWBN_qbXe4WhLVPba9mhj0PkDmequClMZGcFUE
>>
>>108947281
Concerns:

Time it takes to tag everything.
I have more than 1 million images and nearly 1 million videos. I have only 1 me, and not a million idiots who are willing to be abused by booru.org when using the deletionist websites that booru.org owns (such as rule34.xxx); so, not a million idiot users who will tag my stuff. Need some AI / image recognition thing unless I want to spend months manually doing things (I don't).

Search systems heavily lend themselves to centralization.
Search systems, especially if large scale are almost entirely a server-side only thing, and only one server has it (localhost or a remote server). We need many servers with the same data if server-side only (like Elasticsearch) or a large scale client-side search thing. I've always wondered if there's a thing that can effectively search millions of tagged images using only client-side tools: HTML, JS, maybe also WebAssembly. There's Arweave, a decentralized and distributed permanent storage network; theoretically, there could be multiple online services that use GraphQL to search it, but last I checked there's only one: Goldsky's.

Example query using Goldsky:
>https://arweave-search.goldsky.com/graphql?query=query%20just_values{transactions(first:9,tags:[{name:%22IPFS-Hash%22,values:%0A%22[CID here]%22}]){edges{node{id%20tags{name%20value}}}}}
>https://web.archive.org/web/20250208164940/https://megalodon.jp/2025-0209-0148-57/https://archive.is:443/r2ade
>>
File: ip2lg.png (26.9 KB)
26.9 KB
>>108948621
>Perma.cc
I vaguely remember using this in the past. I had to make an account to save some web page if I remember correctly. Could only make 3 to 10 captures per year. We need more of "a web we can return to" and perma.cc isn't helping me with that:
>https://perma.cc/docs/perma-link-creation
>Memento
>Memento is a framework for accessing archived versions of web resources. Like many other web archiving services, Perma has implemented Memento. As a result, all public Perma Records are available via the Memento framework.
OK, where?! Typically that's at web root, but I see nothing under this or at the live page:
https://web.archive.org/web/*/https://perma.cc/memento/*

Nothing under that path, also proven by checking archive.is. Perhaps I can find captures labeled "perma.cc" at mementoweb.org though I have like no experience using mementoweb.org (hope I can at least search it like https://archive.is/site.com = all captures under a site or specific subdomain).

I know for sure in the past I got perma.cc to save some https://xbooru.com/ webpage. Looks like all perma.cc uploads are also uploaded to archive.org/details/; however, like I said, IA is untrustworthy. I no longer see that perma.cc xbooru.com capture in IA. I did see it in the past, probably deleted off of IA now. I can't search or find the capture in perma.cc as I no longer have the perma.cc ID which looks like ABCD-1234.
>>
>>108950185
>perma.cc memento, where?
existed in the past, not anymore
>https://archive.is/2026.05.31-182404/https://web.archive.org/web/20250727181358/https://groups.google.com/g/memento-dev/c/XHB4IezBiqA
>Tomorrow (Weds Feb 4, 2020), Perma.cc will begin the process of deploying completely reimplemented support for timegates, timemaps, and memento-related headers on Perma Link/memento playbacks. Our timemaps, timegates, and memento-related headers have been broken since early last summer; we apologize for the frustration, and that it took us so long to address.
>
>We expect the full re-indexing to take several hours, possibly up to a full day. During this time, Perma will initially return 404 for all timemap and timegate queries; partial results will be exposed in real time as the index is re-built. I'll post to this list again when it's complete.
>
>Timemaps will subsequently be available at:
>https://perma.cc/timemap/link/&lt;url> (replacing https://perma-archives.org/warc/timemap/*/<url>)
>https://perma.cc/timemap/json/&lt;url> (newly available)
>https://perma.cc/timemap/html/&lt;url> (non-standard, browser-friendly format, replacing https://perma-archives.org/warc/*/&lt;url>)
>
>and timegates will be available at:
>https://perma.cc/timegate/&lt;url&gt; (replacing https://perma-archives.org/warc/timegate/<url>)
>
>We will 301 redirect from the old routes to the new ones.
>
>More details are below, if anyone is interested or might be tripped up by our changes.
>
>We hope this whips Perma's Memento support into shape. We've run the validator (http://mementoweb.org/tools/validator) against our staging server (https://perma-stage.org), and it looks good, but I can now easily tweak the output as needed.
>>
>>108951098
One of those works:
>https://perma.cc/timemap/html/https://0.0g.gg/?a1a6d33cb095785d#-J5Ft7PXfHJiCwRAqfvBN4ffpFsg2otNg2wwf8JnkfwKt
>Query Results
>No captures found for https://0.0g.gg/?a1a6d33cb095785d
but it looks like it only does exact URL matches, not a list of links captureed for a site or subdomain.

A URL that was captured:
>https://archive.org/details/perma_cc_M55W-652M
>https://perma.cc/timemap/html/https://www.dailymail.co.uk/news/article-5182895/Man-ordered-remove-Santa-beard-violating-burqa-ban.html
>https://perma.cc/timemap/json/https://www.dailymail.co.uk/news/article-5182895/Man-ordered-remove-Santa-beard-violating-burqa-ban.html
>Query Results
>1 capture of https://www.dailymail.co.uk/news/article-5182895/Man-ordered-remove-Santa-beard-violating-burqa-ban.html
>Captured At Perma Link / Memento Url
>Nov. 21, 2021, 6:54 p.m. https://perma.cc/M55W-652M
>>
>>108951484
>not a list of links captured for a site or subdomain
Unless perma.cc improves their software, I somehow login with my forgotten email+password, or it shuts down, I may never find that webpage capture I got perma.cc to make years ago.

perma.cc has a Contingency Plan which includes:
>https://perma.cc/contingency-plan
>3. Publish a map file. Perma.cc will publish to third-party websites a text file mapping each Permalink back to the original archived link. The file will also map Permalinks to the corresponding third-party archives, and to any new locations where the archive has been moved during the phaseout period.

Just give me the map file right now: with only "Permalink <-> original link"!

Other than that, I looked at everything in https://archive.is/offset=1400/perma.cc and didn't see my capture. I did see:
>failed capture: https://perma.cc/4MAQ-UWJU -> https://rejouer.perma.cc/replay-web-page/w/id-8a4e972226e2/mp_/[URL here]#
>https://perma.cc/2R8K-3G6U "Perma | Whites to Lose Majority Status in U.S. by 2042 - WSJ"
>https://perma.cc/NZ6D-YMEH "Perma | Sam Heughan Felt Betrayed on 'Outlander' Over 'Unnecessary' Penis Shot - Business Insider"
>https://perma.cc/67WT-R4GD "Perma | Korean streamer suddenly dies on stream - YouTube"
>https://perma.cc/5ZG2-BXUL "INDUSTRIA GAMER IS EXACTLY THE SAME, A MAN THING. A WOMAN'S PLACE IS DOING A WOMAN'S THING, FUTILE AND IMBECILE THINGS THAT EVEN A RETARD COULD DO."
>>
I need to modify a qBittorrent .fastresume file to fix the upload-download ratio stats. Those files are the metadata files which say how many pieces of a torrent you have and so on. They're stored at
~/.local/share/qBittorrent/BT_backup/*.fastresume

Reason: I have 191.4 GiB of a torrent with a total size of 1954.8 GiB. After checking finished it downloaded 362.0 MiB more of it (didn't intend for this to happen). Right now, I don't want to download any more of it. It's messing up my share ratio because it's basing it on 362.0 MiB being total downloaded, not 0 MiB downloaded. 0 MiB being the total downloaded stat = share ratio works the same as if all of it was downloaded.

Don't know what to change in the .fastresume right now.

( Relatedly, I do know how to edit .fastresume files to fix for a different problem: https://desuarchive.org/g/thread/108656842/#108776690 )
>>
>>108952524
qBittorrent v5.1.2 seems to already have some fix for that, if you're within a certain ratio/fraction. Inside the ratio:
- does NOT do: 11.84 GiB uploaded / 362.0 MiB downloaded = 33.492 share ratio
- does this, basically: 11.84 GiB uploaded / 1.909 TiB total size = 0.0061 share ratio
- actual Share Ratio stat is off by one order of magnitude for some reason: 0.06 share ratio

Outside of said ratio:
- Downloaded: 3.0 MiB (~3,145,728 B); Uploaded: 167.7 MiB (~175,846,195 B); Total Size: 165.2 MiB
- Result: 54.71 share ratio

>Don't know what to change in the .fastresume right now.
In ~/.local/share/qBittorrent/BT_backup/$i.fastresume
>:save_path16:/some/path9:seed_modei0e12:seeding_timei8174059e19:sequential_downloadi0e10:share_modei0e15:stop_when_readyi0e13:super_seedingi0e16:total_downloadedi3215388e14:total_uploadedi175914042e8:trackersll43:
notice how "total_downloadedi3215388e14" = 3,215,388 bytes and "total_uploadedi175914042e8" = 175,914,042 bytes, so change it to
>total_downloadedi0e14
>>
>>108949975
Then good news! Blombooru has an optional AI image tagger.

I'm running it on piece of crap hardware so I can't test it out, but it looks like you just import a CSV of the tag set you want to use, then let the automated tagger iterate over your image collection.
>>
>>108949975
>>108953911
Also, you raise an interesting idea. It would be possible to share suggested tags between booru instances by allowing image-hash queries against instances. Those queries could then return any tags associated with that image from a given booru.
You'd need to federate or centralize, but it'd be reasonably light touch. Queries only go out when an image is added to a booru, and a response only comes back if that image hash matches on in the other booru's database. It only allows discovery of booru content in as far as confirming a specific image has exists there (but it never serves it).
>>
My faithful old WD Blue 1TB is at 105,952 power-on-hours according to SMART. No errors, but it can't last forever. Pensive emoji.
>>
>>108954667
That's 12 years. I think how much you use it also factors into the annualized failure rate (AFR) of HDDs. You say it's been powered on for that long, but how much do you read or write to it? I have multiple old 1TB HDDs sitting around, bit-rotting away.

>>108937861
Does ngrok allow me to pay to remove the verification wall with cryptocurrencies? If yes, it better be cheap.

Update: 10 USD per month, fuck that. Plus the only payment method is credit card: also bad. Source:
https://ngrok.com/pricing -> https://dashboard.ngrok.com/billing/choose-a-plan -> https://dashboard.ngrok.com/billing/checkout?plan=hobbyist_monthly
>>
3.4 MB
>>108940430
>basically impossible to find sensical short-or-medium/long-length sentences, especially if not looking at "random English words" book pages
Challenge: find 3 or 4 consecutive English words. This Library of Babel book titled "dncvutswesgslocann l.r m" has one English words page:
- uncompressed version: https://archive.is/https://140.235.158.66/ipfs/*
- compressed version: https://webthree.site/raw/_5YMz3PNJCd2VaINqK-L-kjC-p2VFRq5RhzhfYxBfqE

Skimming through the "random text" part (not said page) and not using any tools, I found it difficult to find two consecutive English words. Findings (7): "sex is|try.ok|sexbag|illcpgok|loli lupcp|sixneo|fap ew". Duck.ai only allow jpg/png/gif and pdf uploads. It can only analyze 15 PDF pages, so asking "find meaningful sequences of words in this" won't help unless it can analyze the 410 pages packed into 15 pages. For PDFs, duck.ai probably has both a page limit and a text length limit.

This book was deleted off of archive.org/details/ by someone other than the uploader

>>108940671
>the address space of libraryofbabel.app's URLs can only map to less than one quintillionth of all of the books in the Library of Babel megastructure
Issue addressed at
>https://libraryofbabel.app/about
>the room identifiers are represented as SHA-256 hashes. The actual room identifiers get incredibly long, up to roughly a million characters. This makes them much too long to be used in URLs, so instead a 'bookmark' is created and stored for the room in the form of a hash when it is first visited. This hash is used in place of the room identifier in the page URL.
>
>However, there are orders of magnitude more possible room identifiers than there are possible hashes. In theory, eventually when enough rooms are discovered there will be collisions (2 inputs map to the same output hash) and they will start to be overwritten as new, undiscovered rooms are accessed for the first time. In reality, this limit will never be reached

Vid unrelated
>>
If you use Wayback Machine, you'll sometimes get this error for days:
>web.archive.org refused to connect.
because they rate limit you. (It's not an HTTP error like 4xx or 5xx: those only happen after you successfully connect to a site.)

In these cases, you can still use the site in Tor Browser or with torsocks in a CLI. But how do you use a thing like torsocks but for I2P instead? First run "i2prouter start" to start the daemon then run:
>$ TZ=UTC http_proxy=http://localhost:4444 wget --no-verbose -O- http://artixmirror.i2p/ | grep Onion
>[...]
>2026-06-01 11:16:14 URL:http://artixmirror.i2p/ [4565] -> "-" [1]
> <li><a href="http://artixhnbzrty77wcrnv4a5ylx7ujro7w5ueopb6un6uxmc36lhnz2oid.onion">Onion Service</a></li>

A use case: daily usages (URLs saved) are used up for Tor users using Wayback Machine. Oh wait, for this to work, you need to use an I2P outproxy from a CLI; not sure how to do that, but with the above method you can get WARCs of eepsites (.i2p):
>$ TZ=UTC http_proxy=http://localhost:4444 wget -p -r --level=1 --span-hosts --adjust-extension --convert-links --warc-max-size=700000000 --warc-cdx -e robots=off --warc-file=w http://fluttershy.i2p/ 1>wget_out.txt 2>wget_err.txt
>>
File: UsaJg.png (14.3 KB)
14.3 KB
>>108943297
ReplayWeb.page says it can "Load Web Archive" (replay) a webpage from a HAR file.

My excitement disappeared when I tried to open 5-MB local file
>https://dn720001.ca.archive.org/0/items/fa6cadb204274becc890306411d68a/i_localhost.har
in
>./ReplayWeb.page-2.4.6.AppImage
then got this error:
>An unexpected error occured: TypeError: Cannot read properties of undefined (reading 'size')
>>
In the end, evil triumphs over good. All life and the universe itself will die to a heat death or something.

However, things are abysmal on much shorter timescales as well. In 10,000 years, all the information that humans have now will be gone. The English language will be gone or unrecognizable by then.
>>
File: 1750816208885901.jpg (101.7 KB)
101.7 KB
>>108955763
That's someone else's problem. I care about archiving because it's genuinely useful.
>>
>>108955763
So what? Keep going. Spread good things around. Out of spite, if you have to.
>>
>>108943297
>Torrent which has that WARC of great interest (probably dead now):
>[magnet link]
Whoa, I saw one peer from Asia with 100% of this 897.92-GiB torrent today. That folder / infohash was created in 2022. It's hundreds of gigabytes of Web ARChive file, with outlinks! So it's not just WARCs of a single website only.

This isn't just "data hoarding for no reason", I find it helpful and am accessing a file in it right now.

Question is: can ReplayWeb.page only import 3 out of the 5 GB of the .warc.gz file? Or, does it require at least 5 GB of free space on the computer which is running the ReplayWeb.page AppImage?
>>
>>108955763
The information will be gone because people don't really value information. Not because the English language will be gone or very different. In case anyone was making that connection.

If 10,000 years later, humans are more technologically advanced then they are now (or around the same level of advancement as now), then they can easily translate Industrial-Revolution-era English into whatever their language is then.

>>108958491
>can ReplayWeb.page only import 3 out of the 5 GB of the .warc.gz file?
Don't know, didn't test that (yet?).

>require at least 5 GB of free space on the computer which is running the ReplayWeb.page AppImage?
I hope it just processes the .warc and leaves the data storage and retrieval to wherever the .warc.gz is stored. It could be stored over sshfs where there's not enough space locally to store 5 new gigabytes.

Update: it does work like that! >>108943227 stats on one of the 5.38-GB warcs: around 208817 records total.

However, going to http://localhost:5471/w/id-2cc15264811f/20190903230859mp_/https://... = "localhost refused to connect"
>>
File: 1733015312812.png (16.2 KB)
16.2 KB
>>108958636
Does ReplayWeb.page seriously not have a WebUI / thing: where the webpage replays can be accessed in any browser on any device in the LAN? I need that so I can get SingleFile to capture the replayed page.

>>108951890
Thought I found it a while ago (with the help of a third-party service), but it was some WARC created in 2026. The one I'm looking for was from 2018 or 2022. Not this one:
https://archive.is/2026.05.31-223302/https://web.archive.org/web/timemap/json?url=https://rejouer.perma.cc/&matchType=prefix&collapse=urlkey&output=json&fl=original,mimetype,timestamp,endtimestamp,groupcount,uniqcount&limit=100000
>>
File: 1590964986871.jpg (47.6 KB)
47.6 KB
>>108958675
In my Apache server:
>ReplayWeb.page could not be loaded due to the following error:
>SecurityError: Failed to register a ServiceWorker for scope ('https://10.0.0.54/path/replay/&#039;) with script ('https://10.0.0.54/path/replay/sw.js?serveIndex=1'): An SSL certificate error occurred when fetching the script.

In server /usr/local/nginx/sbin/nginx (usr local nginx, gotta remember that path):
>Sorry, the ReplayWeb.page system must be loaded from an HTTPS URL (or localhost), but was loaded from: 10.0.0.73:8000.
>Please try loading this page from an HTTPS URL

Per this guide:
https://replayweb.page/docs/embedding/

Will have to do whatever steps (I forgot but did them in the past) to get this computer to trust a LAN IP address's HTTPS cert.
>>
>>108958919
I stopped being dumb and loaded it from:
http://localhost:8000/replay.html

However, now I get this error:
>Unexpected Loading Error: "https://replayweb.page/docs/examples/tweet-example.wacz"

>>108958675
>Thought I found it a while ago
referring to
https://archive.org/details/daily_perma_cc_2026-02-22 -> 9KE9-G3V9.warc.gz
>>
>>108958968
CORS policy error. The docs say to put this in the HTML:
><script src="https://cdn.jsdelivr.net/npm/replaywebpage@2.4.6/ui.js"></script>
>
><replay-web-page source="https://replayweb.page/docs/examples/tweet-example.wacz"
>url="https://oembed.link/https://twitter.com/webrecorder_io/status/1565881026215219200"></replay-web-page>

You want to change that to this after downloading that WACZ file to the same folder as were "replay.html" is:
><replay-web-page style="height:9999px;" source="tweet-example.wacz"
>url="https://oembed.link/https://twitter.com/webrecorder_io/status/1565881026215219200"></replay-web-page>

It's a capture of this Shitter post:
>Want to help us make the best open-source web archiving tools to empower anyone to create, portable high-fidelity web archives?
>
>We are looking for a senior dev to focus on our web crawling tools, including R&D to push the limits of web archiving!
>
>https://webrecorder.net/jobs
>7:55 PM · Sep 2, 2022
>>
File: samepath.png (77.6 KB)
77.6 KB
>>108960189
Point of that post was that I got it working, but it seems to only work on WACZ files and not WARC files. I wish these smelly little nerds would just make the AppImage have its web replays accessible in global localhost (not localhost in the AppImage).

That would be easier. I'm having problems otherwise. For example: >>108958636
>can ReplayWeb.page only import 3 out of the 5 GB of the .warc.gz file?
Yes. If you copied the first 2.19 GB of a 5-GB .warc.gz to another file then it can load everything in that. It shows no errors or has silent errors. However, trying to load that partial file at https://replayweb.page/ fails with this error:
>Loading
>file://part.warc.gz
>...
>
>An unexpected error occured: NotReadableError: The requested file could not be read, typically due to permission problems that have occurred after a reference to a file was acquired.

Even though it has 777 permissions and is owned my normal user. This shit is glitching.
>>
>>108943297
>>108958491
Found this anime orgy pic in that torrent. It's from this URL which isn't in web.archive.org:

20190903215637 -> https://i.pximg.net/c/1920x960_80_a2_g5/background/img/2019/01/19/05/05/32/12048318_801d09ef8901765a35dd3cfb2699fc3f.png -> HTTP 200, picrel

Probably also not in archive.is, but that site isn't loading for me now. The live version of that link as of now = HTTP 403 Forbidden.

>>108960404
>If you copied the first 2.19 GB of a 5-GB .warc.gz to another file then [ReplayWeb.page AppImage] can load everything in that
The ReplayWeb.page AppImage also doesn't care about headers / starting bytes in a certain way: nice. So you can run:
>$ zcat pbooru.com-2019-09-03-7322d8b3-00000.warc.gz | tail -c+2523460591 > p2.warc
and the resulting "p2.warc" will load up just fine in the AppImage.
>>
File: timemap.png (161.9 KB)
161.9 KB
>>108960404
>but <REPLAY-WEB-PAGE> seems to only work on WACZ files and not WARC files
True. I looked into this more. As if it wasn't annoying enough that you have or can have the Gzip-compressed version and the uncompressed version of whatever.warc.gz, I have to make a .wacz version of the WARC. First, install py-wacz:
>https://pypi.org/project/wacz/
>$ pip install wacz

>>108960578
>ReplayWeb.page AppImage also doesn't care about starting bytes in a certain way
wacz 0.5.0 does care though. Need a way to edit an uncompressed .warc file without vim trying to make a backup file. Anyways, you gotta run this so wacz version 0.5.0 can make the Web Archive Collection Zipped file:
>$ wacz create -o /mnt/usb/myfile.wacz ~/Downloads/p4.warc

But hey, it finally works. Now the problem is that SingleFile 1.22.81 can't get it. (I forgot where I got this image from.)

>>108960997
>> mfw anon thinks a domain checker is the missing piece for web archiving
I never said that. I said it's an interesting site. It revealed some things I didn't know about.
>>
How do I go about ripping/encoding a DVD I got? I have a drive for it already.
>>
>>108961059
Back when I used Windows a decade ago, I used these programs to rip many DVDs (mainly movies): to MKV and ISO. Software:
- MakeMKV
- AnyDVD, or was it called "AnyDVD Pro"?, I used a cracked version (warez)

I don't remember ever doing much DVD ripping with Linux. Maybe "$ sudo cat /dev/sr0 > file.iso" works, or open it in VLC first so that decodes it.
>>
>>108961043
The REPLAY-WEB-PAGE html element uses iframe and shadow root, both things which might be "impossible" for stuff such as SingleFile version 1.22.81 to capture.

Ugh.

>>108958636
Somewhat obvious, but if you open a .warc(.gz) in
>https://replayweb.page/?source=file%3A%2F%2Fpbooru.com-2019-09-03-7322d8b3-00000.warc.gz
Then it first need to load the 5 GB file into RAM. Doesn't really matter since that too would use said iframe and shadow root DOM.
>>
File: we got a discord.jpg (131.5 KB)
131.5 KB
>>108961016
>Implying I have any say in the matter
giw
>>
>>108961140
Possible solution: use ipwb (InterPlanetary Wayback) instead of ReplayWeb.page. As I remember, ipwb lacked archival fixity, so unlike ReplayWeb.page, it couldn't show non-Base64-encoded images in webpages. At least ipwb doesn't use iframe + shadow DOM / web components like ReplayWeb.page. Plus, all the images in the webpage I'm thinking of are broken/not-grabbed in ReplayWeb.page's replay as well. Both ipwb and the other correctly render CSS and probably also JS.

I'm running this:
>$ ipwb --daemon /dns/10.0.0.76/tcp/5001/http index /pathTo/00000.warc.gz > /pathTo/00000.warc.cdxj
It took roughly 3 minutes to get started and now it says:
>Processing WARC records in 00000.warc.gz: 409/104415

It's helpful to make this modification to "/usr/local/lib/python3.12/dist-packages/ipfshttpclient/client/__init__.py" beforehand:
https://github.com/ipfs-shipyard/py-ipfs-http-client/commit/c191872706e1118d2cd76ea326a2a8d580899353

Picrel shows the entirety of the edit which is useful for ipwb's usage of ipfshttpclient.
>>
109.2 KB
>>108961477
>giw
What does this mean?

>>108961111
>or was it called "AnyDVD Pro"?
AnyDVD HD 8.4.8.0; this screenshot of it was deleted off of archive.org/details/ by someone other than the uploader, so I added it to ar://:
https://kingoffireland.store/raw/oTEbjOKt2ld91N_owwYtvmkSEsDvE1Ch3l0nGkMDR6E
>>
>>108961140
SingleFile dev complaining about shadow roots:
>https://news.ycombinator.com/item?id=20232628
>On my side, the criticism I could make to web components is that there is no standard to serialize their shadow roots and, therefore, they are not deserializable without using JavaScript. I have been maintaining SingleFile [1], a web extension to save complete web pages, for 9 years and this is the first time I have had to include JavaScript code [2] to attach and display the shadow root of the web components (e.g. embed tweets) included in the saved page.

He also said:
>Thanks! You can find web components in a lot of unexpected places. For example, this page [1] contains more than 10K web components... The good news is that once the Pandora's box is open, I had the idea to code SingleFileZ [2] which also requires JavaScript to be enabled but frankly uses it!
10,000 shadow DOMs in one webpage? Disgusting.

>>108962297
"got it wrong" maybe.
>>
File: kxsHC.png (39.4 KB)
39.4 KB
>>108923876
>>108960959
IA trannies are aware of this issue in the face of popular news outlets wanting to be excluded from the WBM. They will probably cave to their demands, just like WBM always does, and excessively does. (I'd once again like to take this opportunity to say fuck archive.org.)

As of today or yesterday, you can go to any https://archive.org/... page and see their message about it (picrel):
>Keep the news in the Wayback Machine. Sign Fight for the Future's letter. [ https://www.savethearchive.com/NewsLeaders ]
following that link, you can see:
>https://archive.is/2026.06.02-012236/https://www.savethearchive.com/NewsLeaders
>Are you a journalist? Join Rachel Maddow and sign the journalist letter here.
>A project by Fight for the Future
>Tell New York Times, The Atlantic, and USA Today to keep the crucial work of journalists in the Wayback Machine!
>The news isn’t getting preserved in the Wayback Machine anymore because major media outlets are blocking it.
>This petition is a demand for them to stop.
>>
Why'd you or someone else delete your posts? Promoting some thing in a spammy/AI way? I found >>108960981 to be especially suspect. Also the thing you said about csv files sounded sorta dumb.

>>108962370
that pic is login-walled in IA (would add it to ar:// but it's not flat-out deleted, yet). maybe I'll add it anyways...
>>
File: 1754254849340668.png (85.9 KB)
85.9 KB
>>108962297
>>108962320
"giw" is more likely "god i wish" since anons sometimes abbreviate "god i wish that were me" to "giwtwm"

>>108962393
>Why'd you or someone else delete your posts?
NTA, but it could be unrelated. I had all my posts nuked once after pretending to be an LLM despite the fact that the posts before that were perfectly normal replies.

Reply to Thread #108914628


Supported: JPG, PNG, GIF, WebP, WebM, MP4, MP3 (max 4MB)