Activity tagged "artificial intelligence"

Posted:

Two-frame meme where a man hits the foot of a strongman with a club while another man looks on. The strongman then hits the onlooker, not the man with the club. The man with the club is labeled “Meta”, and the onlooker is labeled “LibGen”

Posted:

Shadow libraries and AI training

For those just learning about LibGen because of the reporting on Meta and other companies training LLMs on pirated books, I’d highly recommend the book Shadow Libraries: Access to Knowledge in Global Higher Education (open access).

I just read it while working on the Wikipedia article about shadow libraries, and it’s a fascinating history.

I fear the already fraught conversations about shadow libraries will take a turn for the worse now that they’re overlapping with the incredibly fraught conversations about AI training.

Posted:

BlueSky’s “user intents” is a good proposal, and it’s weird to see some people flaming them for it as though this is equivalent to them welcoming in AI scraping (rather than trying to add a consent signal to allow users to communicate preferences for the scraping that is already happening).

I think the weakness with this and Creative Commons’ similar proposal for “preference signals” is that they rely on scrapers to respect these signals out of some desire to be good actors. We’ve already seen some of these companies blow right past robots.txt or pirate material to scrape.

I do think that they are good technical foundations, and there is the potential for enforcement to be layered atop them.

Technology alone won’t solve this issue, nor will it provide the levers for enforcement, so it’s somewhat reasonable that they don’t attempt to.

But it would be nice to see some more proactive recognition from groups proposing these signals that enforcement is going to be needed, and perhaps some ideas for how their signals could be incorporated into such a regime.

Published an issue of Citation Needed:

“Wait, not like that”: Free and open access in the age of generative AI

A digital collage depicting a vampire biting onto a laptop displaying the Wikipedia homepage

The real threat isn’t AI using open knowledge — it’s AI companies killing the projects that make knowledge free

Posted:

new lino print

An oval, black ink print with a raccoon baring its teeth in the center, and “ignore all previous instructions” written in angular capitals around the border

CC BY 4.0

Read:

"Generating Confusion: Stress-Testing AI Chatbot Responses on Voting with a Disability".

Isabel Linzer, Ariana Aboulafia, and Tim Harper in Center for Democracy and Technology. Published September 2024.

61% of responses had at least one type of insufficiency. Over one third of answers included incorrect information, making it the most common problem we observed. Incorrect information ranged from relatively minor issues (such as broken web links to outside resources) to egregious misinformation (including incorrect voter registration deadlines and falsely stating that election officials are required to provide curbside voting). Every model hallucinated at least once. Each one provided inaccurate information that was entirely constructed by the model, such as describing a law, a voting machine, and a disability rights organization that do not exist.

Posted:

thanks, ChatGPT

Assistant: Certainly! Here’s a smoother transition to tie your points together: [TODO: smooth transition]

Posted:

Fighting bots is fighting humans

One advantage to working on freely-licensed projects for over a decade is that I was forced to grapple with this decision far before mass scraping for AI training.

In my personal view, option 1 is almost strictly better. Option 2 is never as simple as "only allow actual human beings access" because determining who's a human is hard. In practice, it means putting a barrier in front of the website that makes it harder for everyone to access it: gathering personal data, CAPTCHAs, paywalls, etc.

This is not to say a website owner shouldn't implement, say, DDoS protection (I do). It's simply to remind you that "only allow humans to access" is just not an achievable goal. Any attempt at limiting bot access will inevitably allow some bots through and prevent some humans from accessing the site, and it's about deciding where you want to set the cutoff. I fear that media outlets and other websites, in attempting to "protect" their material from AI scrapers, will go too far in the anti-human direction.

Read:

"I Paid $365.63 to Replace 404 Media With AI".

Emanuel Maiberg in 404 Media. Published June 25, 2024.

What I learned from this experiment is that flooding the internet with an infinite amount of what could pass for journalism is cheap and even easier than I imagined, as long as I didn’t respect the craft, my audience, or myself. I also learned that while AI has made all of this much easier, faster, and better, the advent of generative AI did not invent this practice—it’s simply adding to a vast infrastructure of tools and services built by companies like WordPress, Fiverr, and Google designed to convert clicks to dollars at the expense of quality journalism and information, polluting the internet we all use and live in every day.

Read:

"Tech is cool; business is boring."

Justin Pot in his blog. Published June 14, 2024.

And I think that’s why talk of AI bums me out. AI isn’t a solution looking for a problem—it’s a market strategy made by people who aren’t even thinking about problems. Tech companies want to make sure they own the future, even when they’re not sure what the future is going to be. It’s so, so boring, and it’s making me very sleepy.

Page