On behalf of the R Street Institute (R Street), we respectfully submit these comments in response to the United States Patent and Trademark Office (USPTO) Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation.1

R Street is a free-market think tank that takes a pragmatic approach to public policy challenges.2 R Street has written broadly about the importance of progress and competition in the development and application of artificial intelligence (AI)3. Given the rapidly advancing state of AI across many different domains, questions surrounding the intersection of intellectual property and algorithmic inputs/outputs are of vital importance.

Below is an area we would like to highlight as needing further study and consideration by the USPTO as AI applications continue to be deployed across the economy.

Clarify the fair-use exemption for AI training data
Imagine a hypothetical startup focused on creating a natural language processing application. One readily available source of human dialogue the company might consider using to train the application would be the last 50 years of Hollywood scripts, many of which are scrapable from various online databases. Such an endeavor, however, would stand on legally dubious grounds. These scripts remain copyrighted works, and there are no clear legal guidelines established to delineate what is allowable as fair use in machine learning (ML) training data and what is not.4 More likely, this startup would avoid this potential legal minefield and consider what other, less risky datasets might be available.

This is the ambiguous state of copyright enforcement in ML today, and it is problematic. As legal scholar Amanda Levendowski has argued, the de facto privileging of frequently low- quality data that exist in the public domain (such as the Enron emails) has inadvertently biased the many AI applications that are built on them.5

This reality may also have important and underexplored implications for the state of competition in AI. While large incumbent firms typically have available vast reams of consumer data that can be used to improve the performance of algorithmic tools, startups and smaller firms are more reliant on datasets scraped from the internet to help offset this advantage.6 If startups can acquire a sufficient amount of relevant data, they can often launch a new product or service, which begets new data that they can use to maintain and improve their services. This virtuous cycle of sorts can help these firms compete with larger, more established ones.

An enormous number of copyrighted works are scrapable from the internet. These works could provide new arbitrage opportunities for scrappy startups willing to find and leverage interesting data sets. Indeed, considering the massive amount of data that might be included in these efforts, the full scope of what is possible admittedly difficult to fully grasp. Yet the data of these works are currently underexploited in part because of the legal ambiguities surrounding their use in ML.

Google has already showcased one use case for which this type of data might be leveraged. In 2016, a research division within Google used a corpus of 11,000 free e-books to show the potential improvements that could be made to a conversational AI program.7 This effort sparked considerable controversy with groups like the Authors Guild who considered it a violation of the author’s intended purpose and arguably a copyright violation.8 Because this instance involved a research paper and was not used for commercial purposes, no suit was pursued. Notably, however, the original ‘BookCorpus’ dataset is no longer publicly hosted.9

If they choose, large incumbent firms like Google have the resources to fight these lengthy legal battles, given their significant legal teams. Startups and smaller companies, however, are far less likely to have these resources on staff. In practice, this means the current ambiguity surrounding the fair use exemption disproportionally hurts smaller firms.

Given the existing legal ambiguity and the significant potential benefits to be reaped, further study and clarification of the legal status of training data in copyright law should be a top priority when considering new ways to boost the prospects of competition and innovation in the AI space.

We appreciate the opportunity to comment on the USPTO’s Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation and look forward to further participation in these discussions.

Respectfully submitted,
Caleb Watney

Technology Policy Fellow

R Street Institute