Fair Use in the Age of AI

Editorial illustration — marble scales transitioning into pixel art

Executive summary

01.The transformative-use test remains the primary battleground for training data claims.
02.Market-substitution analysis is evolving as generative output quality approaches human work.
03.Recent litigation draws a sharp line between ingestion for training and output that competes with the source.

I · The doctrine

The four factors, briefly

The Fair Use doctrine codified at 17 U.S.C. § 107 is a balancing test, not a checklist. Courts weigh four factors: (1) the purpose and character of the use, including whether it is commercial and whether it is transformative; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.

Before generative AI, the doctrine was elastic but legible. After Campbell v. Acuff-Rose Music, transformative use rose to near-dispositive prominence — a transformation so significant that commercial character, factor two, and even the amount used could be outweighed. After Authors Guild v. Google, the digitization of entire libraries for a search function was held transformative because the purpose of the copy differed fundamentally from the purpose of the original. The question for AI is whether that elasticity survives contact with models that ingest millions of works and produce outputs that can substitute for the source material in the source’s own market.

The answer is unfolding across at least fifteen active US cases and a growing European docket. Courts have begun to draw lines that were not visible when the cases were filed. Those lines tend to fall between training (the one-time ingestion of works to build a model) and output (the generation of content by the trained model), and between factual abstractions (style, structure, patterns) and expressive reproduction (verbatim or near-verbatim output). The following analysis walks those lines in order.

II · Training

Factor-by-factor: training data

Factor one — purpose and character. Training is unambiguously commercial when performed by a for-profit lab. Whether it is transformative is the battleground. The Authors Guild v. HathiTrust and Google Books line of cases supports transformation when the use serves a materially different purpose from the original — indexing, search, or research. Critics argue that AI training does not transform in the relevant sense because the model’s downstream purpose (generating content in the same medium) is not materially different from the purpose of the source works. The counter-argument is that training extracts abstractions — statistical correlations between features — that are not themselves expressive works and therefore cannot infringe. That argument succeeded when applied to scanning books for indexing; whether it survives when applied to scanning books to generate competing books is the open doctrinal question.

Factor two — nature of the work. Courts have consistently weighted factor two lightly. Training sets are indiscriminate. They include factual works (weak copyright protection), highly creative works (strong protection), code, news, and everything in between. Because defendants copy everything, plaintiffs can always point to highly creative works in the corpus. But because factor two has historically been the least consequential factor, this rarely moves the needle. Expect factor two to be mentioned in every opinion and decisive in none.

Factor three — amount and substantiality. AI training ingests whole works. In traditional doctrine, copying the whole weighs strongly against fair use. Defendants invoke Sony Corp. of America v. Universal City Studios and the Google Books holding that whole-work copying is permissible when the whole is necessary for the transformative purpose. For AI training, the whole work is argued to be necessary because partial ingestion would produce a defective model. This is the strongest procedural argument for training fair use, and it has persuaded at least one district court considering the question.

Factor four — market impact. This is where training fair use tips. If the output markets of the trained model substitute for the input markets from which the training data came, factor four becomes dispositive. The Getty, NYT, and Andersen complaints all argue this directly: the trained model produces images that compete with licensed stock photography; the trained model produces text that competes with licensed news; the trained model produces art that competes with the artists whose work trained it. The defendants argue that this conflates potential market harm (always speculative) with demonstrated market harm (rarely provable). The Supreme Court in Warhol v. Goldsmith recently emphasized that factor four is “often the single most important element of fair use” and that substitute use weighs heavily against the copier. That holding, post-dated to most AI training decisions, is reshaping the fair-use analysis in real time.

III · Output

Fair use applied to generation

The NYT v. OpenAI complaint changed the fair-use conversation permanently. With 100+ examples of verbatim reproduction in the appendix, the Times shifted the question from “could the model infringe?” to “here are the actual reproductions — explain.” That recalibration matters: memorization is not merely statistical abstraction. If a language model can produce the opening paragraph of a New York Times article word-for-word when prompted with the headline, the model has, in a literal sense, stored that article.

OpenAI’s response — that the verbatim outputs required adversarial prompting and do not represent ordinary use — is a double-edged sword. On one edge, it weakens the claim that ordinary commercial operation of the product infringes; on the other edge, it concedes that the model memorizes copyrighted training data, which undermines the “mathematical abstraction” framing that the same defendants use when defending training. If the model is a mathematical abstraction, it cannot reproduce its training corpus verbatim. If it can reproduce its training corpus verbatim, it is not merely a mathematical abstraction. Courts are noticing the tension.

The emerging consensus in the academic literature — Samuelson, Sag, Gervais, and others — is that training and generation deserve separate fair-use analyses. Training is a one-time ingestion for research-adjacent purposes; generation is a repeatable, commercial act that produces specific output. A defendant could theoretically win on training (as transformative) and lose on generation (as substitutive). That bifurcation is now being actively briefed. The Thomson Reuters v. Ross Intelligence summary-judgment ruling in 2025 signaled judicial openness to exactly this framing — the court held that training a research product on Westlaw headnotes for the purpose of building a competing research product was not transformative and not fair use. Other courts are beginning to cite it.

IV · Detection

Fair use and detection tools

AI content detectors present a surprising fair-use wrinkle. Detection services often ingest protected training data — samples of AI output paired with known source works — to build their classifiers. Those paired samples are themselves derivative works. Whether training a detector is fair use is an under-theorized question, but one that will be litigated as detection becomes a regulated compliance technology under the EU AI Act and US state laws.

The policy argument for detection as fair use is strong. Detection serves a public interest in authenticity. It is defensive rather than substitutive — a detector does not compete with the source material, it identifies it. And the output is narrow: a classification score, not a substituted work. All three arguments line up with transformative-use doctrine. But the doctrinal argument is messier. Factor three is weak: detectors ingest whole works. Factor four is uncertain: does a detector harm the market for the works it detects? Arguably no, arguably yes. A robust detection regime may actually increase the market for licensed content by making unlicensed content easier to identify and penalize.

The first detection-fair-use case will likely come from an AI lab, not a publisher. A model provider whose output is consistently flagged by a third-party detector may sue the detector for its training methodology. That case is a three-to-five-year horizon, but worth tracking — the detection industry is concentrating, and concentrated industries attract litigation.

V · Jurisdictions

How other legal systems are treating the question

The US fair-use analysis is not the only doctrinal lens. Three other jurisdictions matter for global AI content law.

United Kingdom. UK copyright law does not have fair use; it has a narrower list of statutory exceptions, none of which cleanly cover AI training. A 2023 government proposal to create a text-and-data-mining exception for commercial training was withdrawn after publisher pressure. The current UK position is therefore precarious for AI labs: training on copyrighted works without license is arguably infringement, subject to limited research exceptions. Getty Images’ parallel UK action against Stability AI — distinct from the US case — turns on this narrower UK framework.

European Union. The EU Directive on Copyright in the Digital Single Market contains a text-and-data-mining exception (Article 4) that permits training on lawfully accessed content unless the rights-holder has opted out. The EU AI Act’s Article 53 reinforces this by requiring providers of general-purpose AI models to comply with Article 4 opt-outs and to publish training-data summaries. The practical effect: rights-holders can opt out of EU training, but the opt-out signal must be machine-readable, which favors large publishers over individual creators.

Japan. Japan’s copyright law has the most permissive training regime among major jurisdictions. Article 30-4, amended in 2018, expressly permits use of copyrighted works “for data analysis purposes” without the rights-holder’s consent, provided the use does not unreasonably prejudice the rights-holder’s interests. Japan is consequently emerging as a preferred jurisdiction for model training, though the “unreasonable prejudice” caveat is actively contested in the courts.

VI · Where this is heading

Synthesis

Three forecasts, in descending order of confidence:

Statutory licensing becomes inevitable. The music industry built ASCAP and BMI because litigation was more expensive than a compulsory licensing regime. AI training will follow the same path — either by legislation, by class-action-driven settlement, or by a de facto licensing market anchored by the largest labs. Getty’s active licensing pilots and the nascent collective-licensing initiatives at the Authors Guild are early proofs of concept.
Technical watermarking becomes a compliance floor. Expect providers to implement content-provenance metadata (C2PA and successors) not because courts require it but because regulators will. The EU AI Act Article 50 already mandates machine-readable labeling for synthetic content; China’s Interim Measures do the same. US state laws are moving in the same direction. Within three years, any commercial AI provider that cannot produce C2PA-compliant output will be effectively locked out of regulated markets.
The era of unbridled scraping is ending. Provenance, chain-of-custody, and licensable-data-only training will become standard representations in enterprise contracts. We already see this in SOC 2 audits and enterprise procurement questionnaires. The boilerplate clause “we represent that our training data is lawfully obtained” is being upgraded to specific, auditable commitments. Labs that cannot meet those commitments will lose enterprise deals regardless of whether they win their copyright cases.

What none of this resolves is the individual creator problem. A solo artist whose style has been absorbed into a model trained on scraped social-media data has no practical remedy in the current framework. Class actions are slow and narrowly certified; statutory licensing benefits the well-organized; regulatory regimes favor large rights-holders. The gap between legal recognition of the problem and practical relief for individual creators is the untended wound of AI content law — and the source of most of the political pressure that will eventually reshape the doctrine itself.

SD Frivolous Editorial

The SD Frivolous editorial team combines legal practitioners, journalists, and technologists focused on AI content law. Analysis is peer-reviewed by counsel before publication.

Legal Disclaimer

This analysis is journalism and commentary, not legal advice. Laws governing AI content change rapidly. Consult qualified counsel for specific legal questions.