Copyright, AI Training, and Innovation
The rapid evolution and adoption of generative artificial intelligence (AI) have sparked numerous debates, ranging from the impact on employment to broader concerns about societal implications. One of the most contentious aspects is the intersection of AI and copyright law, with 39 copyright lawsuits underway against AI companies. The U.S. Copyright Office recently weighed in with a report focused on the implications of using copyrighted materials to train generative AI models. The study examines the technical processes underlying AI training as well as potential legal implications, including a fair-use assessment and the role of licensing. While the Copyright Office acknowledges the transformative potential of AI training, the report’s tentative conclusions suggest an overly cautious interpretation of fair use and a limited evaluation of the transformative aspects of developing AI models. The report’s pre-publication status leaves its policy recommendations in doubt, potentially allowing opportunities for additional research to ensure that copyright law does not unnecessarily impede technological progress.
Fair Use Flexibility and Technological Progress
Courts use the four-factor test established in Section 107 of the Copyright Act to determine whether a particular use of copyrighted material is fair and therefore non-infringing. The first factor is used to determine the nature and purpose of the use, including whether it is commercial or nonprofit, and whether it serves a new purpose and adds new meaning or function to the original work (with such “transformative” uses being highly favored by fair use). The second factor focuses on the nature of the work or works used, providing more protection to creative works than to simple factual compilations. The third factor is used to evaluate “the amount and substantiality of the portion used in relation to the copyrighted work as a whole.” Finally, the fourth factor considers the economic impact on the original work’s market. While judges consider and weigh all four factors, the Copyright Office report places greatest emphasis on the first and fourth factors.
The Transformative Nature of AI Training
With respect to the first factor, while the report acknowledges that the use of copyrighted materials in training AI may be transformative, it also recognizes the potential for infringement depending on the specifics of a given model and its intended purpose. The report understates the inherently transformative nature of machine learning. AI training exemplifies the concept of “fair learning”—using works to extract unprotected elements rather than to appropriate copyrightable expression. The Copyright Office remains skeptical of this position, suggesting that AI models retain expressive elements (e.g., word selection) and noting that image models are trained on aesthetic images, which implies expressive activity.
But the report’s definition of expressive purpose may be too broad. Although AI models process expressive content, their primary training purpose is to recognize statistical patterns and extract features—not to reproduce original content for public consumption. Typically, reconstructing original copyrighted works from the internal parameters of a large language model (LLM) is challenging, and AI systems are designed to minimize such outcomes. AI training transforms data into knowledge; it is not a replication of the original expression, which fundamentally differs from the original purpose for which the works were created.
Machine Learning, Human Learning
AI models, particularly LLMs and image generation models, do not directly generate human-readable copies of training data. Rather, they use the data to “learn” patterns, relationships, styles, and statistical correlations. The analogy to human learning in this context does not refer to unlimited copying or other infringing human activities. Instead, it refers to the cognitive process of learning from existing works to generate new, original works—a process very much in line with copyright law’s true purpose: to promote progress.
An AI model’s “knowledge” is stored in its parameters, specifically its weights and biases, which are numerical representations rather than compressed versions of the original works. It is not possible to “view” a copyrighted image or text by looking at the internal parameters of a trained AI model because the original work has been processed and used to adjust the model’s vast network of connections. Consequently, the original work is no longer directly accessible within the model in its expressive form. This is more akin to a human reading many books to understand grammar, plot structure, or artistic styles rather than memorizing and reproducing books verbatim. The outputs of these models typically bear no resemblance to any specific training example, thereby demonstrating the transformative nature of the underlying process.
However, it is essential to recognize that the Copyright Office approaches this analogy with caution, noting that fair use does not protect “all human acts done for the purpose of learning,” such as copying all the books in a library. They also highlight AI’s ability to potentially make perfect copies “at superhuman speed and scale”—the main way in which AI learning differs from human learning. Moreover, while AI developers continuously employ sophisticated techniques to prevent the direct reproduction (or “regurgitation”) of training data, it does not fully eliminate the possibility of replication. Developers generally view direct reproduction as an unintended technical challenge (or “bug”) that is actively being mitigated through evolving tactics like data curation, regularization during training, and reinforcement learning from human feedback.
A core aspect of fair use analysis evaluates whether the new use is “transformative,” meaning it serves a new, different purpose and does not simply generate copies of the works for public consumption. In this respect, courts have found intermediate copying to be fair use when it serves an analytical or functional purpose, such as in the context of search engines caching websites or software reverse engineering. AI training can be viewed similarly as an analytical process that uses copyrighted works as data inputs to build a new computational tool. The act of training an AI model does not directly substitute for the original copyrighted work in its original market, as its purpose is fundamentally different.
An AI model’s output could potentially infringe if it is substantially similar to a copyrighted work; however, this is a separate legal question that judges are amply suited to handle. The training process itself is about building a tool, not about distributing copies of the inputs—a key distinction in the argument that AI training represents fair use.
Market Dilution: A Flawed Fourth-Factor Framework
The Copyright Office’s analysis of the fourth fair use factor—“the effect of the use upon the potential market for or value of the copyrighted work”—is particularly problematic. The report suggests that AI training may harm the market for copyrighted works in several ways, including through market dilution and lost licensing opportunities. Not only is this a significant and unwarranted expansion of the fourth factor, but it is also economically flawed.
Fourth-factor analysis should focus on whether AI training serves as a direct market substitute for the original works, which it does not. Creators do not produce books, photographs, or music with the expectation that they will be sold for AI training purposes. In this respect, the report shifts from an examination of original works used as AI inputs (i.e., training) to the broader question of AI outputs competing with other works in the marketplace, thereby incorrectly refocusing the question.
Moreover, the Copyright Office’s theory of market dilution raises significant economic questions. Fundamentally, the analysis incorrectly characterizes what is essentially a pecuniary externality as a relevant market harm. In economic terms, a pecuniary externality occurs when an activity affects others only through changes in relative prices. Economists view this as normal market behavior that does not create a true economic inefficiency warranting regulatory action. Designing a better widget may reduce the value of existing widgets, but this is the nature of competition—not a market failure that requires government intervention.
The alleged “market dilution” described in the report, where AI-generated content might compete with human-created works (e.g., flooding the market with AI-generated romance novels), represents precisely this type of pecuniary externality. AI-generated content that competes with existing works is simply market competition, not copyright infringement. The impact of AI-powered competition will be felt by every other work in the market, regardless of its presence in the training data. Yet, the theory of market dilution suggested by the Copyright Office views such competitive impacts as cognizable copyright harms, overlooking the need to connect alleged market harm directly to the use of a specific copyrighted work in the training data.
Copyright was never intended to protect creators from market competition or changes in consumer preferences; rather, it protects original expression from unauthorized reproduction. Changes in consumer demand, competition from other creators, and technological advancements all impact the market. The ability of AI to produce similar content may affect the market for existing works, but this is no different from any technological innovation that increases productivity and lowers costs. The Copyright Office’s theory of market dilution would transform copyright from an exclusive right of expression into a regulatory barrier that shields existing business models from technological progress, which would contradict both the letter and the spirit of copyright law while significantly impeding innovation.
Additionally, a broader economic analysis must consider the public benefits of AI. Restricting access to training data would significantly increase costs for AI developers, create barriers to entry for new companies and researchers, limit competition within the AI industry, and potentially slow innovation across numerous sectors that deploy AI technologies, such as healthcare, energy, and scientific research. These substantial public benefits should be weighed against the speculative market harms identified in the report.
Economic Impact of Licensing Requirements
Another critical aspect of the copyright and AI debate is the role of licensing. While the Copyright Office report suggests that voluntary licensing markets are developing to address potential infringement concerns, it overestimates the feasibility of such markets at scale. The sheer volume and diversity of works utilized for effective AI training make comprehensive licensing practically impossible. (Common Crawl, for example, has an open-source dataset of more than 250 billion pages, with three to five billion added each month.) This is particularly true for the “long tail” of smaller content creators working outside professional creative industries.
While licensing possibilities are indeed evolving and AI developers are entering into agreements with high-value content providers (e.g., news organizations and stock photography companies), the report underestimates the significant economic burdens of widespread licensing requirements for AI training, particularly for smaller companies, researchers, and the open-source community. Sweeping licensing requirements would stifle innovation in AI development, fostering a monopolistic environment dominated by large technology companies while encouraging opportunistic litigation and raising costs for consumers.
Acquiring individual licenses for the vast amount of data required for comprehensive AI training is not only burdensome, but also unachievable in many cases. Transaction costs pose significant challenges, especially for works created outside professional creative industries, as well as those typically not intended for monetization, or those whose ownership is diffuse and difficult to determine. These transaction costs may even exceed the training value of the works, making direct licensing infeasible. As both creative industries and AI technologies evolve, data needs and licensing markets will continue to change. The Copyright Office rightly recommends letting the market evolve as interested parties develop voluntary licensing agreements.
Voluntary licensing may be feasible for valuable content licensed in high volumes (e.g., popular music, stock photography) or in fields with limited copyright owners. However, practical challenges remain in many areas, and a growing market for licensing some content does not guarantee that voluntary licensing is feasible or scalable for all AI training needs. It is essential to remember that licensing is not a substitute for fair use; rather, it serves as a method of compensation for non-fair use.
Conclusion
The debate over copyright and AI raises important questions—not just about legal technicalities, but about the continued evolution of AI as well. How the controversy is resolved will determine whether AI continues to be a catalyst for progress or whether copyright becomes a barrier to innovation. As noted in the Copyright Office’s report, fair use is a flexible doctrine designed to address emerging technologies. However, the report understates the doctrine’s resilience in accommodating transformative uses across technological revolutions while downplaying the transformative nature of AI models. Historically, the fair use doctrine has proved sufficiently flexible to address the challenges posed by technological progress, such as the introduction of home video recording and the creation of the World Wide Web. Recognizing AI training as predominantly non-infringing under existing copyright law would continue to promote creativity and innovation in the adoption of AI technologies. The courts are well positioned to apply established fair-use principles to the specific factual contexts of AI development as they have done with previous technological innovations.