The first thing I setup when I started to manage my own Kubernetes cluster more then a year ago was this Warrior, I completely forgot about it until this post.
Has been active for over a year steadily working the recommended project. Downloaded over 3TB in 6 days (node reboot, so pod was restarted and stats are not persistent). So rough extrapolation is about 180TB. Happy to help the good cause of the ArchiveTeam!
Edit: typo
ch71r22 38 days ago [-]
For anyone else interested in running this, it only took a couple seconds to launch their docker-compose.yml
I noticed from the docker overlay filesystem that the container was spraying files all over the disk. (Ephemeral, destroyed on container shutdown, sure, but I wanted to reduce write-wear on my ssd...)
I tried setting it up with /tmp as a tmpfs (ramdisk) but it then refused to start...
Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?
michaelt 38 days ago [-]
> I wanted to reduce write-wear on my ssd
Modern SSDs are pretty good at things like wear levelling.
For example [1] reports that a bunch of 256 GB SSDs lasted to 2000+ terabytes written, and a handful up to 7000 terabytes written. So you could saturate a 100 megabit internet connection for 5 years before even a small SSD would wear out. And an SSD 4x the size has 4x the life.
If you're running on a raspberry pi with a microsd card for storage, feel free to keep worrying though :)
> the container was spraying files all over the disk
Right, that's basically the point...the Warrior downloads files, compresses them, and uploads them for archival. This necessarily requires staging the files somewhere between download and upload.
> Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?
Why would you want this? This sounds like a terrible footgun.
myself248 38 days ago [-]
The Warrior doesn't resume old jobs after a power cycle, so what's the point of committing anything at all to non-volatile storage?
rustyminnow 38 days ago [-]
They say exactly why they want it... "I wanted to reduce write-wear on my ssd"
j4ah4n 38 days ago [-]
I think you'll just need to mount it at the right place, with right permissions.
You can put the entire docker directory in a ramdisk. Same as you would when trying to move it to a secondary harddisk. Risky though as a reboot would wipe everything
Isn't there substantial risk involved in having who knows what scraped from your IP?
tech234a 37 days ago [-]
Yes but many projects are usually restricted to specific websites. A few projects, such as the URLs project, are generally unrestricted.
honestSysAdmin 38 days ago [-]
[dead]
badlibrarian 39 days ago [-]
Many of these sites are already captured and archived by proper entities as required by federal law. More is better, I guess, except when it isn't. Duplication of effort is a huge problem in the humanities in general and with archiving in particular.
The whole concept needs to be rethought. Captures from these tools show up under "ArchiveTeam" which is currently pumping thousands of copies of the Google Home Page into the Wayback Machine every week. Or at least trying to.
Like so many things about archive.org, when you dig in you start to find wonder and craziness at every turn.
myself248 39 days ago [-]
> by proper entities as required by federal law.
What federal law do you suppose is guiding the mass deletions? That doesn't look like archiving to me. Now that the foxes are running the henhouse, how reliable do you suppose their own archives are?
badlibrarian 39 days ago [-]
Some of the mass deletions are merely a new administration setting up shop. Policies from the previous administration don't belong on the current whitehouse.gov. They wind up here instead https://bidenwhitehouse.archives.gov/
We pay half a billion in tax dollars for the National Archives, and nearly a billion to the Library of Congress to preserve these records. Others are managed as part of Presidential Libraries.
Thousands of employees, dozens of facilities, billions of dollars.
Meanwhile archive.org doesn't have air conditioning and preserves physical material within the blast radius of an oil refinery. They let vagrants sleep on their steps yet seem surprised when they set the utility pole outsides on fire.
I didn't say it didn't need to be done. I said the whole process needs to be rethought with professional supervision. Setting up more volunteer K8 clusters so that more copies of the Google Home Page can be captured with the wrong user agent isn't going to save democracy.
toomuchtodo 38 days ago [-]
Archive.org is outside of the reach of the US government, and is globally distributed. When the US government deletes or darks data (as it has recently done across wide swaths of the federal government website properties), you have no recourse. This means your argument about the resources that go into the US government as a data custodian are meaningless: the outcome is what is material, which is the archival and long term custody & availability of the data sets in scope. Arguably, the Internet Archive has recently proven better at this job than the US government (unsurprising).
You're angry at a high value non profit operating on a limited budget. It's weird. I recommend focusing on more important issues than "it is icky around the richmond facility, the power goes out once in a while, and they use ambient air and convection for system cooling which I don't like."
If you want to save democracy, the Internet Archive doesn't do that itself. It protects the historical record. If you want to save democracy, that's a different conversation.
I would classify the end of term web archive (which archive.org is, in its typical fashion, taking far too much credit for) as an example of entities doing things right.
And saying "archive.org is outside the reach of the US government" -- hell, it's not even outside the reach of the RIAA or the book company with the little penguin on the cover.
We should have proper supervised federal archiving and archive.org should be far better run, too.
And I don't know what Archive Team is but maybe they could update their site to provide some information on the people involved. And perhaps update their understanding of what's possible with docker containers while they're at it.
Because the counterpoint to a radicalized Musk screwing around with government databases isn't an opposing group of anonymous radicals screwing around with commercial databases.
gumball-amp 38 days ago [-]
I'm interested in why you are saying that the Internet Archive is taking too much credit for the end of term web archive. The website you link to demonstrates that it's run by the Internet Archive, although various partners have joined it since it began.
Is that not correct?
> And I don't know what Archive Team is but maybe they could update their site to provide some information on the people involved.
You don't need to reveal your identity, but looking through your comments, it looks like you originally spun up this account to criticize the Internet Archive. I'll just note that accusing others of being "anonymous radicals" falls a little flatter when you're anonymous yourself.
(Relevant disclosure: I've worked with IA and Brewster Kahle, and defended him here before.)
badlibrarian 38 days ago [-]
> Is that not correct?
It's not run by the Archive. It's a collaboration. They didn't even do all the crawling, and the Library of Congress keeps a copy.
As for Archive Team, their site declares "Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths."
Dedication is great. And radicalization in response to copyright and preservation certainly deserves some leeway. But a little professionalism wouldn't hurt and the 2600-era roleplay isn't fooling anyone.
toomuchtodo 38 days ago [-]
Agree the US government should contribute in some capacity. Agree they should be robustly funded to do this. But, checks and balances are also important, and when a node goes rouge or dark, the system must be fault tolerant and operate when degradation occurs. I previously said "I trust Brewster and the rest of the IA gang more than the US government to safeguard the Internet Archive." [1] I feel this assertion has been proven out over the last few weeks.
ArchiveTeam stands on its own as an independent, community driven volunteer digital archival and preservation effort. If you don't understand why, what, and how they operate, look closer and be more curious [2].
If the checks and balances of NARA and LOC (6,000 employees, $1.5 billion in annual funding) is Brewster Kahle asking for $10 on pages serving pirated Nintendo games, then we're in a bit of trouble, aren't we?
toomuchtodo 38 days ago [-]
On the contrary, the fact that a single person's charitable digital archive can stand toe to toe with a global superpower's archival efforts is a sign that success is possible. We may see things differently though, and that's fine. Would I want to fund NARA and LOC more? Or the Internet Archive? I prefer the latter. Checks and balances. I have donated $10 on your behalf (in addition to my annual donations).
(lots of good people at NARA and the LOC, but they are subject to the whims of the US electorate, which is not great; the Internet Archive is not)
badlibrarian 38 days ago [-]
I think archive.org deserves more funding and I also think they need to decide if they're an archive, a library, or a pirate site. Since each has a different set of costs, legal risk, and projected longevity.
For the record my opinion is that they need to focus on archival and with a few tweaks could make it safe for more users to upload more material. Going legit archive (as their name implies) instead of hiding behind the DMCA and playing high-stakes poker with copyright law would also make it possible for more entities to provide direct support.
I also disagree that NARA and LoC is subject to whims of the electorate. The Library of Congress is set up to serve, well, Congress. Who funds it. Lotta barriers to cross there, even in these weird times.
I'll take that risk over one guy with limited governance who seems genuinely surprised that he keeps gets hacked and sued. There's a chance the whole thing goes away because he couldn't resist serving up free Frank Sinatra records and got hit with a $621 million lawsuit after he thrice refused to take the stuff down.
toomuchtodo 36 days ago [-]
> I also disagree that NARA and LoC is subject to whims of the electorate.
I trust these folks far more than the "not affiliated with archive.org" Archive Team and their wget scripts that somehow jam data via backdoor into the web.archive.org database.
If someone pulls nonsense at archives.gov, whistles will be blown and the press will respond. When nonsense goes on at archive.org, I see hagiographies, third party apologists, and people who lack the qualifications to get a job filing paper in a University library mismanaging simple archival projects.
taurknaut 38 days ago [-]
> They let vagrants sleep on their steps yet seem surprised when they set the utility pole outsides on fire.
Tbf I have let many people sleep on my doorstep and none of them tried to set my building on fire. One of them even sang for me; he had a killer baritone. Overall it seems like a fairly harmless thing.
badlibrarian 38 days ago [-]
I wasn't speaking metaphorically. Fire set to pole. Site went down.
johnklos 38 days ago [-]
You really do live up to your name!
You imply that archive.org is somehow doing something wrong by letting "vagrants" sleep on their steps. I'd assert that people who are compassionate are more trustworthy than people who think punishing others should be normalized. I'd definitely prefer my backups in the hands of compassionate people.
The problem is that the people who want to see others be punished can't be trusted to, you know, not do that. Removing information about climate change, about vaccines, about trans care, et cetera, very well could happen at the hands of those who get off on punishing others.
You say the National Archives already does this. What happens when the current administration fires everyone and replaces them with non-professionals?
So I really don't know why you'd be in here talking ish about ArchiveTeam.
badlibrarian 38 days ago [-]
> I'd definitely prefer my backups in the hands of compassionate people.
I prefer them in the hands of competent people, in a building with climate control.
Heard about the time these compassionate folks tried to run a bank and got shut down in the Obama era?
> Unwillingness to open accounts within the field of membership, make loans, and establish operations in the low-income community where the credit union was chartered to serve
How do I as a non-US citizen get access to information from those "proper entities"? Is it even possible for US citizens? This is often a surprise for some visitors of this fine website, but there's a large world outside the US where "federal law" does not apply.
badlibrarian 39 days ago [-]
We fund the Library of Congress (largest library in the world) and the National Archives (NARA) who make all of this stuff public. Other goverments do similar things. It's all on the web.
There are other agencies and data sources to be monitored of course but I'm not seeing a lot of nuance in those efforts yet.
jfkrrorj 38 days ago [-]
[flagged]
badlibrarian 38 days ago [-]
Suspicion warranted, but citation needed. For now my money is on archives.gov over archive.org.
And loc.gov over a website that prioritizes making Pac-Man and Donkey Kong playable in the browser yet leaked the drivers licenses and passports of its patrons, and whose public policy on their wonky javascript UX is "don't read books on a phone."
Tijdreiziger 38 days ago [-]
> leaked the drivers licenses and passports of its patrons
I write on behalf of Internet Archive to inform you about a security incident that involved personal information about you. We regret that this incident occurred and take the security of personal information seriously.
On October 20, 2024, we discovered suspicious activity involving our customer service platform. Specifically, between October 17, 2024 and October 20, 2024, an unauthorized actor obtained access to our customer service platform, which contained information about requests from certain Internet Archive users.
As soon as we learned of the incident, we took action to contain the incident, including by terminating the unauthorized access, and then launched an investigation to determine the nature and scope of the access.
We have determined that the personal information involved in this incident included your name and government ID information such as a driver’s license or passport.
As noted above, we took action to contain the incident and investigate it, including by temporarily taking the customer service platform offline.
You should always remain vigilant for incidents of fraud and identity theft, including by regularly reviewing and monitoring your accounts. If you discover any suspicious or unusual activity on your accounts or suspect identity theft or fraud, be sure to report it immediately to your financial institutions. If you have been a victim of fraud, you can report it to your local police.
Please know that we regret any inconvenience or concern this incident may cause you. Please do not hesitate to contact us at info@archive.org if you have any questions or concerns.
Sincerely,
Internet Archive
jfkrrorj 38 days ago [-]
[flagged]
mplewis 38 days ago [-]
Hey man, you only created your account one day ago and you're going to town flooding this site with bad takes. What's the deal?
Has been active for over a year steadily working the recommended project. Downloaded over 3TB in 6 days (node reboot, so pod was restarted and stats are not persistent). So rough extrapolation is about 180TB. Happy to help the good cause of the ArchiveTeam!
Edit: typo
https://github.com/ArchiveTeam/warrior-dockerfile/blob/maste...
I tried setting it up with /tmp as a tmpfs (ramdisk) but it then refused to start...
Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?
Modern SSDs are pretty good at things like wear levelling.
For example [1] reports that a bunch of 256 GB SSDs lasted to 2000+ terabytes written, and a handful up to 7000 terabytes written. So you could saturate a 100 megabit internet connection for 5 years before even a small SSD would wear out. And an SSD 4x the size has 4x the life.
If you're running on a raspberry pi with a microsd card for storage, feel free to keep worrying though :)
[1] https://www.reddit.com/r/chia/comments/mukiwz/are_we_overthi...
Right, that's basically the point...the Warrior downloads files, compresses them, and uploads them for archival. This necessarily requires staging the files somewhere between download and upload.
> Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?
Why would you want this? This sounds like a terrible footgun.
Demonstrated here https://stackoverflow.com/questions/39193419/docker-in-memor...
The whole concept needs to be rethought. Captures from these tools show up under "ArchiveTeam" which is currently pumping thousands of copies of the Google Home Page into the Wayback Machine every week. Or at least trying to.
https://web.archive.org/web/20250122000033/www.google.com
Like so many things about archive.org, when you dig in you start to find wonder and craziness at every turn.
What federal law do you suppose is guiding the mass deletions? That doesn't look like archiving to me. Now that the foxes are running the henhouse, how reliable do you suppose their own archives are?
We pay half a billion in tax dollars for the National Archives, and nearly a billion to the Library of Congress to preserve these records. Others are managed as part of Presidential Libraries.
Thousands of employees, dozens of facilities, billions of dollars.
Meanwhile archive.org doesn't have air conditioning and preserves physical material within the blast radius of an oil refinery. They let vagrants sleep on their steps yet seem surprised when they set the utility pole outsides on fire.
I didn't say it didn't need to be done. I said the whole process needs to be rethought with professional supervision. Setting up more volunteer K8 clusters so that more copies of the Google Home Page can be captured with the wrong user agent isn't going to save democracy.
You're angry at a high value non profit operating on a limited budget. It's weird. I recommend focusing on more important issues than "it is icky around the richmond facility, the power goes out once in a while, and they use ambient air and convection for system cooling which I don't like."
If you want to save democracy, the Internet Archive doesn't do that itself. It protects the historical record. If you want to save democracy, that's a different conversation.
https://blog.archive.org/2024/05/08/end-of-term-web-archive/
https://web.archive.org/collection-search/EndOfTerm2024PreEl...
(no affiliation)
https://eotarchive.org/partners/
And saying "archive.org is outside the reach of the US government" -- hell, it's not even outside the reach of the RIAA or the book company with the little penguin on the cover.
We should have proper supervised federal archiving and archive.org should be far better run, too.
And I don't know what Archive Team is but maybe they could update their site to provide some information on the people involved. And perhaps update their understanding of what's possible with docker containers while they're at it.
Because the counterpoint to a radicalized Musk screwing around with government databases isn't an opposing group of anonymous radicals screwing around with commercial databases.
Is that not correct?
> And I don't know what Archive Team is but maybe they could update their site to provide some information on the people involved.
You don't need to reveal your identity, but looking through your comments, it looks like you originally spun up this account to criticize the Internet Archive. I'll just note that accusing others of being "anonymous radicals" falls a little flatter when you're anonymous yourself.
(Relevant disclosure: I've worked with IA and Brewster Kahle, and defended him here before.)
It's not run by the Archive. It's a collaboration. They didn't even do all the crawling, and the Library of Congress keeps a copy.
https://eotarchive.org/about/
As for Archive Team, their site declares "Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths."
Dedication is great. And radicalization in response to copyright and preservation certainly deserves some leeway. But a little professionalism wouldn't hurt and the 2600-era roleplay isn't fooling anyone.
ArchiveTeam stands on its own as an independent, community driven volunteer digital archival and preservation effort. If you don't understand why, what, and how they operate, look closer and be more curious [2].
[1] https://news.ycombinator.com/item?id=41984664
[2] https://en.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fence
(lots of good people at NARA and the LOC, but they are subject to the whims of the US electorate, which is not great; the Internet Archive is not)
For the record my opinion is that they need to focus on archival and with a few tweaks could make it safe for more users to upload more material. Going legit archive (as their name implies) instead of hiding behind the DMCA and playing high-stakes poker with copyright law would also make it possible for more entities to provide direct support.
I also disagree that NARA and LoC is subject to whims of the electorate. The Library of Congress is set up to serve, well, Congress. Who funds it. Lotta barriers to cross there, even in these weird times.
I'll take that risk over one guy with limited governance who seems genuinely surprised that he keeps gets hacked and sued. There's a chance the whole thing goes away because he couldn't resist serving up free Frank Sinatra records and got hit with a $621 million lawsuit after he thrice refused to take the stuff down.
National Archives Workers Unsure If Marco Rubio Has Secretly Been Their Boss for Weeks - https://www.404media.co/national-archives-workers-unsure-if-... - February 6th, 2025
Please report back when the transition from the 11th Archivist to the 12th Archivist causes any data or paper loss.
https://www.archives.gov/about/history/archivists
I trust these folks far more than the "not affiliated with archive.org" Archive Team and their wget scripts that somehow jam data via backdoor into the web.archive.org database.
https://www.archives.gov/about/organization/senior-staff
If someone pulls nonsense at archives.gov, whistles will be blown and the press will respond. When nonsense goes on at archive.org, I see hagiographies, third party apologists, and people who lack the qualifications to get a job filing paper in a University library mismanaging simple archival projects.
Tbf I have let many people sleep on my doorstep and none of them tried to set my building on fire. One of them even sang for me; he had a killer baritone. Overall it seems like a fairly harmless thing.
You imply that archive.org is somehow doing something wrong by letting "vagrants" sleep on their steps. I'd assert that people who are compassionate are more trustworthy than people who think punishing others should be normalized. I'd definitely prefer my backups in the hands of compassionate people.
The problem is that the people who want to see others be punished can't be trusted to, you know, not do that. Removing information about climate change, about vaccines, about trans care, et cetera, very well could happen at the hands of those who get off on punishing others.
You say the National Archives already does this. What happens when the current administration fires everyone and replaces them with non-professionals?
So I really don't know why you'd be in here talking ish about ArchiveTeam.
I prefer them in the hands of competent people, in a building with climate control.
Heard about the time these compassionate folks tried to run a bank and got shut down in the Obama era?
> Unwillingness to open accounts within the field of membership, make loans, and establish operations in the low-income community where the credit union was chartered to serve
https://ncua.gov/newsroom/press-release/2016/internet-archiv...
https://www.archives.gov/presidential-records/research/archi...
There are other agencies and data sources to be monitored of course but I'm not seeing a lot of nuance in those efforts yet.
And loc.gov over a website that prioritizes making Pac-Man and Donkey Kong playable in the browser yet leaked the drivers licenses and passports of its patrons, and whose public policy on their wonky javascript UX is "don't read books on a phone."
Source?
https://www.newsweek.com/internet-archive-hacked-zendesk-197...
-----
Subject: Notice of Data Security Incident
January 6, 2025
I write on behalf of Internet Archive to inform you about a security incident that involved personal information about you. We regret that this incident occurred and take the security of personal information seriously.
On October 20, 2024, we discovered suspicious activity involving our customer service platform. Specifically, between October 17, 2024 and October 20, 2024, an unauthorized actor obtained access to our customer service platform, which contained information about requests from certain Internet Archive users.
As soon as we learned of the incident, we took action to contain the incident, including by terminating the unauthorized access, and then launched an investigation to determine the nature and scope of the access.
We have determined that the personal information involved in this incident included your name and government ID information such as a driver’s license or passport.
As noted above, we took action to contain the incident and investigate it, including by temporarily taking the customer service platform offline.
You should always remain vigilant for incidents of fraud and identity theft, including by regularly reviewing and monitoring your accounts. If you discover any suspicious or unusual activity on your accounts or suspect identity theft or fraud, be sure to report it immediately to your financial institutions. If you have been a victim of fraud, you can report it to your local police.
Please know that we regret any inconvenience or concern this incident may cause you. Please do not hesitate to contact us at info@archive.org if you have any questions or concerns.
Sincerely, Internet Archive