Default No to AI Training on Your Stories

Fair use in the age of AI: Credit, compensation and consent are required.

Tony Stubblebine
The Medium Blog

--

This is an evolving issue. I’m putting out our new stance, along with everything we currently understand about how AI trains on stories you publish online. Our goal is to get feedback from you — our writers, readers, editors, and curators.

We’re in an exciting moment for artificial intelligence. Companies have recently made awe-inspiring advances in the ability for AI to generate text and images. I’m not a hater, but I also want to be plain-spoken that the current state of generative AI is not a net benefit to the Internet.

These AI advances were made by training on publicly available text and images on the Internet — text like the stories you write on Medium.

Unfortunately, the AI companies have nearly universally broken fundamental issues of fairness: they are making money on your writing without asking for your consent, nor are they offering you compensation and credit. There’s a lot more one could ask for, but these “3 Cs” are the minimum.

From our experience, the reality of AI is even worse than an issue of fairness. You would hope that these AI innovations would lead to a better Internet. For example, AI as a writing aid has the potential to empower new voices who may have previously gone unheard for lack of writing ability.

But in practice, the overwhelming experience of our readers, editors, and curators is that they are seeing more AI-generated writing that amounts to nothing more than spam: plausible sentences that are unreliable in substance and fact.

To give a blunt summary of the status quo: AI companies have leached value from writers in order to spam Internet readers.

In response, we’ve already made clear that Medium is a home for human writing. Then, we built a human-led recommendation system to protect our readers from purely AI-generated text (human curators spot it easily).

Now, we’re adding one more dimension to our response. Medium is changing our policy on AI training. The default answer is now: No.

We are doing what we can to block AI companies from training on stories that you publish on Medium and we won’t change that stance until AI companies can address this issue of fairness. If you are such an AI company, and we aren’t already talking, contact us.

This is such a deep issue. That’s why I want to explain it. In particular, our end game is to get concessions in terms of credit, compensation, and consent from AI companies on behalf of Medium writers. But the question of how much of each is unclear, so it’s crucial that we hear from some of our writers. When we surveyed our authors, 92.2% of you said that you want us to take active measures against AI companies until these issues of fairness can be sorted out.

How we’re blocking AI training

Anyone even remotely technical will know that our new “disallow” policy is going to be difficult to enforce. We agree. But there are some things we have done and can do in the future.

We’ve updated our Terms of Service to be more clear about disallowing spiders without prior written consent, and we’ve started adding explicit blocks to our robots.txt file so that AI companies can know our position.

Unfortunately, the robots.txt block is limited in major ways. As far as we can tell, OpenAI is the only company providing a way to block the spider they use to find content to train on. So we are joining Reuters, New York Times, CNN and many other companies in making a site-wide block for OpenAI. We should all give some credit to OpenAI for leading here. They’re addressing one of the fairness issues: consent.

The other problem with this approach is that, right now, the block is site-wide. What we need for our writers is a fine-grained approach that works at the level of individual writers and individual stories. A more robust protocol would likely look like a search engine sitemap, allowing a site to explicitly say what is available for AI training and what isn’t. Medium would happily give writers tools to set these permissions. But first, we need some sort of standard.

I’m saying this because our end goal isn’t to block the development of AI. We are opting all of Medium out of AI training sets for now. But we fully expect to opt back in when these protocols are established.

We don’t think we can block companies other than OpenAI perfectly, but we can continually ramp up our efforts like we do with a similar group of actors who are leaching value: spammers.

This all has some element of theater, and yes, it may be that we are entering a “get your popcorn” moment where corporations, celebrities, and rich people fight in public. See: Sarah Silverman, Barry Diller, StackOverflow, Reddit.

But I do want to say one thing that is very different about Medium’s position. We don’t own what’s published here, and we don’t sell your data. This is not a case where we’re trying to get paid. Rather, this is a case where we are trying to make sure our writers get paid. One of our writers called this “negotiating as a service”, and that’s essentially right.

Why copyright is (probably) not a protection

A lot of you are wondering if we have enough power here to get meaningful concessions from any AI companies. It may be that opting out is relatively meaningless. Even if AI companies don’t train on what you publish on Medium, they will find other places to train. But I think the answer is that we do have some power and so I’m walking through the available levers, starting with copyright.

The problem is that copyright law doesn’t cover this use case. Cory Doctorow summed it up well: Copyright law prevents someone from republishing or reselling your work, but it “doesn’t prevent someone from doing a statistical analysis of text you publish online.”

The AI training is essentially that — consuming your writing, analyzing it, and putting the analysis into a big mathematical model. The technical term for the mathematical output of this act is Large Language Model (LLM).

There’s also a limitation for Medium: we can’t use copyright law to protect you, because if there are copyright violations, they are violations of your rights, not ours. We take seriously that when you publish on Medium, you retain ownership and rights to your writing. But a side effect of that is that we don’t have legal standing when there is a copyright violation. (This also comes up when spammers plagiarize your writing somewhere else — you can file DMCA complaints with their hosting provider, but we can’t.)

This idea that AI training does not violate copyright is not settled legal precedent, and some publishers are currently suing to prove an alternate view of the copyright issues: “The authors alleged that their writing was included in ChatGPT’s training dataset without their permission, arguing that the system can accurately summarize their works and generate text that mimics their styles.”

This does leave individual Medium writers in a pickle. It’s almost certainly impractical for you to sue even if the copyright violation was clear, and clarity is not a settled issue yet. Suing is costly, victory is unclear, and even in victory, you won’t see compensation for years.

There is one copyright caveat: Third parties don’t have a license to sell your content. If an AI company spiders a story you publish online, then there is likely (unless the above lawsuit succeeds) no copyright violation. However, if they bought your story from a reseller who had spidered your story, then that is a copyright violation. In the default license on Medium stories, you retain exclusive right to sell your work. In an act of incredible audacity, it’s turning out that some of these AI companies may have violated this right. This seems to be a core part of the Sarah Silverman lawsuit:

The suits alleges, among other things, that OpenAI’s ChatGPT and Meta’s LLaMA were trained on illegally-acquired datasets containing their works, which they say were acquired from “shadow library” websites like Bibliotik, Library Genesis, Z-Library, and others, noting the books are “available in bulk via torrent systems.

Other ways platforms and media companies may block AI companies

Using our Terms of Service and robots.txt amounts to a soft ban on AI, but that is likely enough to work on the largest AI companies. If not, we can follow up with a cease and desist. Here’s how our lawyer described the enforceability of scraping bans.

The LinkedIn-hiQ Labs dispute that put a spotlight on scraping was finally settled by a consent judgment between the parties in December of last year. The consent judgment includes a broad prohibition against hiQ scraping or accessing LinkedIn’s platform in violation of the LinkedIn User Agreement. Just before the consent judgment, a California district court had ruled that hiQ did indeed breach LinkedIn’s User Agreement in scraping the LinkedIn platform. The upshot of the CA district court decision and the consent judgment is that terms of use prohibiting scraping are enforceable.

That leaves smaller and less compliant AI companies. This is where the parallel to spammers makes sense. No media company has ever had a perfect track record against spammers. But we all have an active and ever-changing strategy to resist their work.

Medium’s approach would start by blocking their spiders. But I wouldn’t be surprised if it eventually led to inserting and rewriting content in order to poison their results. See, get your popcorn.

Of course, that path is ugly and nobody wants this outcome. Our goal is to get to a world where consent, credit, and compensation are normalized.

Why Medium wants to work with a coalition

There’s another big piece to this, which is that Medium is not alone. We are actively recruiting for a coalition of other platforms to help figure out the future of fair use in the age of AI.

I’ve talked to <redacted>, <redacted>, <redacted>, <redacted> and <redacted>. These are the big organizations that you could probably guess, but they aren’t ready to publicly work together. I’m not sure what is going to tip us over toward a coalition, but I’m sure that we’ll be more successful if we do.

I think we have a mixed role in a coalition. On the one hand, Medium is one of the few social media platforms that is fully incentivized to pass a negotiated settlement directly to our writers. If an AI company were to pay us money, our intention would be to pass 100% of that money on to Medium writers.

On the other hand, as a commercial entity, I don’t think anyone is looking to us to say what is right for the entire Internet.

So my hopes for an effective coalition are pinned on organizations like Creative Commons and Wikipedia. They have the clout to cut through the commercial goals of other coalition members. Having them involved would let us work toward concessions that would work for the entire Internet, including people who self-host.

I’m hopeful that this coalition will come together and that we’ll have something like an ai.txt file and some standardized ways for exchanging value.

Here are some plausible solutions to consent, credit, and compensation

Of the concessions, consent seems like the most likely to arrive first. OpenAI has offered one version of consent, in the form of a site-wide opt out. Based on the number of companies that have opted out, we know now that consent protocols will be eagerly adopted.

However, we don’t want to be handling different consent protocols for every single AI company. There needs to be a standard.

Plus, for platforms like Medium, consent has to be put in the hands of the individual creators. So eventually we need a standardized protocol that operates at the granularity of a page.

But what about credit and compensation? I’ll start with credit.

Credit is a bit like Google search. We let Google spider our site — in fact, we optimize for this. But the only reason we do this is because Google sends a lot of readers to your stories. For the vast majority of writers, this is a fair exchange of value.

Ideally, when an AI generates text that was trained on your stories, it would also give you credit by linking back to you and sending you readers. Along these lines, Jeff Jarvis has proposed an expanded rights framework in the form of creditright:

“This is not the right to copy text but the right to receive credit for contributions to a chain of collaborative inspiration, creation, and recommendation of creative work. Creditright would permit the behaviors we want to encourage to be recognized and rewarded. Those behaviors might include inspiring a work, creating that work, remixing it, collaborating in it, performing it, promoting it. The rewards might be payment or merely credit as its own reward.”

The current popular AIs were built in a way where this is not possible. The text they generate is literally the result of an analysis of billions of inputs, and they can’t back out the primary inputs in order to give you credit.

However, this is not the only way to build AI text generators. We have worked with miso.ai in the past specifically because they produce “citation driven” results as a way to get around the AI hallucination problem. Google’s new AI-assisted search also appears to work this way.

If an AI company won’t build a citation-driven system, then that leaves only compensation as a way to negotiate for access.

Basically, that’s where we are with the current crop of AI companies: They need to pay for access or get cut off. But it also may be that they will evolve to choose credit instead. I’d want both, but I think Medium (and the rest of the Internet) has shown through our relationship with Google that we’d accept just credit.

Should we negotiate and accept compensation on behalf of our writers?

This is a real question and we’ve received some indication that some AI companies are willing to pay for access to Medium articles. I’ll believe it when I see it, but I also don’t want to pretend that nobody has ever broached the idea that we could negotiate a deal.

The case for rejecting a deal is high and urgent for many of you because you are seeing more and more AI-generated content coming into your publications, leaking into your feeds, and flooding the rest of the Internet.

It’s a fair moral question: Why would we let AI companies leach off our stories, just to turn around and spam our favorite corners of the Internet?

The other hurdle to negotiated compensation is that it’s hard to determine the value of an individual piece and so there are many outcomes where a flat distribution won’t make sense.

My own prediction is that if negotiated compensation is meaningful, the best meaningful way to distribute compensation would be to add it to our existing pool of author payments and distribute it according to our same incentives.

I want to leave you with two concrete outcomes to react to.

  1. How would you feel about a deal that offered you the ability to opt out of allowing AI companies to train on your writing, but offered a 10% boost in earnings on Medium to people who opted-in? Would you opt out? Opt in? Leave Medium?
  2. Would you allow a search engine to train on your writing in order to generate AI summarized answers that credited you? Google Bard seems headed in this direction. Would your answer change if the amount of traffic the search engine sent you dropped in half?

--

--