Micro feed

Thoughts tagged "artificial intelligence"

Short thoughts, notes, links, and musings by Molly White. RSS

always a little weird to see in politicians' footnotes

URL fragment containing utm_source=chatgpt.com

Two-frame meme where a man hits the foot of a strongman with a club while another man looks on. The strongman then hits the onlooker, not the man with the club. The man with the club is labeled “Meta”, and the onlooker is labeled “LibGen”

Shadow libraries and AI training

For those just learning about LibGen because of the reporting on Meta and other companies training LLMs on pirated books, I’d highly recommend the book Shadow Libraries: Access to Knowledge in Global Higher Education (open access).

I just read it while working on the Wikipedia article about shadow libraries, and it’s a fascinating history.

I fear the already fraught conversations about shadow libraries will take a turn for the worse now that they’re overlapping with the incredibly fraught conversations about AI training.

BlueSky’s “user intents” is a good proposal, and it’s weird to see some people flaming them for it as though this is equivalent to them welcoming in AI scraping (rather than trying to add a consent signal to allow users to communicate preferences for the scraping that is already happening).

I think the weakness with this and Creative Commons’ similar proposal for “preference signals” is that they rely on scrapers to respect these signals out of some desire to be good actors. We’ve already seen some of these companies blow right past robots.txt or pirate material to scrape.

I do think that they are good technical foundations, and there is the potential for enforcement to be layered atop them.

Technology alone won’t solve this issue, nor will it provide the levers for enforcement, so it’s somewhat reasonable that they don’t attempt to.

But it would be nice to see some more proactive recognition from groups proposing these signals that enforcement is going to be needed, and perhaps some ideas for how their signals could be incorporated into such a regime.

new lino print

An oval, black ink print with a raccoon baring its teeth in the center, and “ignore all previous instructions” written in angular capitals around the border

CC BY 4.0

thanks, ChatGPT

Assistant: Certainly! Here’s a smoother transition to tie your points together: [TODO: smooth transition]

Fighting bots is fighting humans

One advantage to working on freely-licensed projects for over a decade is that I was forced to grapple with this decision far before mass scraping for AI training.

In my personal view, option 1 is almost strictly better. Option 2 is never as simple as "only allow actual human beings access" because determining who's a human is hard. In practice, it means putting a barrier in front of the website that makes it harder for everyone to access it: gathering personal data, CAPTCHAs, paywalls, etc.

This is not to say a website owner shouldn't implement, say, DDoS protection (I do). It's simply to remind you that "only allow humans to access" is just not an achievable goal. Any attempt at limiting bot access will inevitably allow some bots through and prevent some humans from accessing the site, and it's about deciding where you want to set the cutoff. I fear that media outlets and other websites, in attempting to "protect" their material from AI scrapers, will go too far in the anti-human direction.

Here's the problem: establishing that AI training requires a copyright license will not stop AI from being used to erode the wages and working conditions of creative workers. The companies suing over AI training are also notorious exploiters of creative workers, union-busters and wage-stealers.

Telling creative workers that they can solve their declining wages with more copyright is a denial that creative workers are workers at all. It treats us as entrepreneurial small businesses, LLCs with MFAs negotiating B2B with other companies. That's how we lose.
On the other hand, if we address the problems of AI and labor as workers, and insist on labor rights – like the Writers Guild did when it struck last summer – then we ally ourselves with every other worker whose wages and working conditions are being attacked with AI.
Our path to better working conditions lies through organizing and striking, not through helping our bosses sue other giant multinational corporations for the right to bleed us out.

One flaw of the LLMs I've used: they will never give you harsh criticism. While it would be nice to think all my writing is just that good, I know there are no circumstances where someone will ask for feedback and it will say “throw the whole thing out and start again.”

Assistant: You've got a compelling narrative and some valuable insights here.

Rubber duck editing with LLMs

Here are some prompts I'm using as I experiment with LLMs¹ to see if they can help refine a draft of a talk I'm writing.

M: Here is a draft of a talk I wrote:
<draft>
It feels a bit scattered and disorganized to me. Please help me clean it up.

LLM: You've got a compelling narrative and some valuable insights here. However, it could be more tightly organized and concise for better impact. Here’s a revised and more structured version of your talk:
<clip>

M: Rather than rewriting it, can you just identify some flaws with the existing piece and recommendations for how to improve it? I don't want to reduce the length much, but I am happy to cut some pieces and add different lines of thought.

LLM: Certainly! Here are some observations and recommendations for how to refine your existing piece:
### Introduction
**Flaw:**
- Initial repetition
- Possibly too personal and less engaging for the audience who may not relate directly to your journey.
**Recommendation:**
- Streamline the introduction and reduce redundancy.
- Kickstart with a slightly broader hook that can capture the audience's interest more universally before diving into your personal connection.<clip>

M: Is the order of the talk good, or should it be rearranged?

LLM: The structure of your talk is generally logical, but it could benefit from some reordering to enhance clarity and impact. Here’s a suggested structure:
<clip>

M: Should I remove the discussion of [topic]?

LLM: The discussion about [topic] in your talk serves several purposes: it establishes your credibility, connects to your audience through a relatable experience, and sets the stage for your interest in research and documentation. However, the details can be streamlined to serve these purposes more efficiently. Here are some considerations:
<clip>

M: Thank you. Can you rewrite the talk to incorporate this feedback? Maintain approximately the same length, and do not change the voice or tone. If there are places where you think new ideas need to be incorporated, feel free to mark them with [TKTK] and a brief summary of what should be added if you are not confident you can match the tone or message.

Unsurprisingly, I still found the LLM was not at all helpful when it was trying to write or rewrite the talk. It tends to slip in to that very inauthentic “LLM voice” I so dislike, and the output is inevitably bland, uninteresting, and... well, ChatGPT-sounding. I didn't really expect this to be much good, and it wasn't, but I like to try it once in a while just to see if it spits out anything interesting.

While much of this comes down to the limitations of LLMs, I also think that most writers² just aren't going to like any kind of writing that substantially isn't their own, especially when the topic is personal to them or where they are trying to express their own unique ideas.

However, when I asked “Is the order of the talk good, or should it be rearranged?”, the LLM provided an outline that really helped me to clarify the natural structure that had formed in the draft of the talk I had already written. From there, I was able to tighten things up quite a bit, and by the end it really resolved for me my feeling that the talk was a little bit meandering.

I also liked the format of the replies when I asked it to identify flaws, and when I asked if I should remove a portion. The flaws/recommendations layout was very clear and helped me think more about the potential weaknesses it had identified (some of which I agreed with, and some of which I didn't).

Similarly, when I asked it if I should remove a portion of the talk, it responded with a list of reasons I should keep it, and a list of potential upsides to removing it, before providing its suggestion. I ended up agreeing that it was worth keeping, though that was certainly a portion of the talk that needed some polish.

Altogether, the exercise reinforced my previous opinion: LLMs are terrible writers, but decent editorial rubber ducks.

In this case, GPT-4o. ↩
With the exception of high schoolers trying to finish a forced essay so they can go play video games. ↩

Page