News

Lawsuits Galore: Authors Sue Meta, Open-AI Over Novel Infringement!

the burgeoning legal challenges facing AI companies, particularly Meta and OpenAI, brought forth by a group of prominent authors and creators. These lawsuits contend that these companies unlawfully harvested copyrighted works for AI system training.

Art Evolves

Sep 28, 2023 • 6 min read

💡

EVERY IMAGE IN THIS ARTICLE WAS MADE WITH AI

Courts Take Aim at Artificial Intelligence Giants: Decisive Battles for Language Model Training Legality

In a mounting campaign, authors are intensifying their push to halt artificial intelligence companies from utilizing their copyrighted content to train AI systems. This time, the crosshairs are fixed on Meta and OpenAI as they face proposed class-action lawsuits.

On Tuesday, a group of distinguished authors, including Michael Chabon, filed a lawsuit against Meta in a California federal court. They have accused the company of copyright infringement, asserting that Meta harvested extensive volumes of books from the internet to create works that allegedly infringe upon their copyrights. This lawsuit mirrors the one filed against OpenAI on September 8, alleging that both companies have financially benefited from the unauthorized and unlawful collection of authors' literary works. The authors are seeking a court order mandating the destruction of AI systems trained on copyrighted materials.

his lawsuit represents the most recent action in a series of legal battles launched by content creators, all centered around the legality of training large language models. OpenAI is currently contending with a proposed class action initiated by author Paul Tremblay, alongside another suit brought forth by Sarah Silverman, which also includes Meta as a defendant. Notably, artists have pursued copyright infringement claims against AI art generators like Stability AI, Midjourney, and DeviantArt as well.

As compelling evidence of AI systems being exposed to authors' books, the lawsuit highlights instances where ChatGPT generated both summaries and in-depth analyses of the novels' themes upon request. The lawsuit asserts that such outcomes could only be achievable if the foundational GPT model had undergone training using the authors' literary works.

The complaint argues, 'When ChatGPT is instructed to produce content in the style of a specific author, GPT generates text by drawing upon patterns and associations it acquired through the analysis of that author's body of work within its training dataset.' This assertion closely echoes the claims made in Tremblay's lawsuit, upon which the complaint largely draws.

As these massive language models inherently rely on data extracted from copyrighted materials, the lawsuit against Meta contends that the responses generated by ChatGPT are, in essence, 'infringing derivative works'.

The authors contend that OpenAI and Meta constructed their training datasets by 'gathering text data from web scraping.' In the complaint, it is noted that in June 2018, OpenAI disclosed that GPT-1, the initial version of its extensive language model, was fed with a compilation of more than 7,000 novels sourced from BookCorpus.

The lawsuit highlights the contentious nature of the BookCorpus dataset, originally compiled in 2015 by an AI research team supported by Google and Samsung. It was created with the explicit aim of training language models like GPT by extracting content from Smashwords, a platform that offers self-published novels to readers free of charge. The lawsuit emphasizes that this dataset includes a substantial portion of copyrighted novels, all incorporated without the authors' consent, acknowledgment, or remuneration.

The lawsuit asserts that subsequent iterations of OpenAI's large language models were also trained using unlawfully acquired literary works. In a 2020 paper introducing GPT-3, OpenAI disclosed that the training dataset was drawn from 'two internet-based book corpora' referred to as 'Books1' and 'Book2.' While OpenAI never explicitly listed the books within the dataset, the authors contend that 'Books1' primarily consists of material from the Project Gutenberg archive—a digital collection of books with expired copyrights that has become popular among AI firms. They further allege that 'Books2' is sourced from shadow library websites, including Library Genesis, Z-Library, and Bibliotik, as these align closely with the nature and size described by OpenAI for the dataset.

OpenAI ceased the practice of disclosing dataset sources, citing 'the competitive landscape and the safety considerations associated with large-scale models like GPT-4,' as communicated by the company in the past year.

The lawsuit alleges that Meta, in a manner similar to OpenAI, does not provide information regarding the sources of the books within its dataset employed for LLaMA's training. Although Meta mentioned that these works were sourced from the 'Books3 section of The Pile,' a dataset openly accessible for large language models, it refrains from offering additional details regarding the dataset's content.

However, as stated in the complaint, this information is accessible through alternative sources. The lawsuit contends that the 'Books3' dataset consists of books sourced from Bibliotik. Public statements from the individual who compiled the 'Books3' dataset have affirmed that it encompasses the entirety of Bibliotik's collection, comprising a total of 196,640 books.

Leading the charge in these class-action lawsuits, which aim to represent authors across the United States whose content contributed to AI system training, are acclaimed figures such as Michael Chabon, renowned for works like 'The Mysteries of Pittsburgh,' 'Wonder Boys,' and 'The Amazing Adventures of Kavalier & Clay.' Joining the legal battle are distinguished authors like David Henry Hwang and Matthew Klam, along with others known for their contributions to literature and screenplays. The lawsuits level a range of allegations, including direct copyright infringement, vicarious copyright infringement, violations of the Digital Millennium Copyright Act, unjust enrichment, and negligence.

The outcome of the litigation will be heavily influenced by two Supreme Court cases, both of which legal experts consider as pivotal. On one hand, there's precedent supporting the practice of copying works to generate non-infringing text responses. This precedent emerged during the 2005 lawsuit filed by the Authors Guild against Google for digitizing a vast library of books to create a search function. In that case, a federal judge dismissed the copyright infringement claims, deeming Google's use of copyrighted works as fair use. A key factor in this ruling was that Google allowed users to access only snippets of text, refraining from providing full book content.

Conversely, authors can cite a recent Supreme Court ruling in Andy Warhol Foundation for the Visual Arts v. Goldsmith, where the Court dismissed a fair use defense. In this decision, the justices emphasized the significance of potential commercial overlap in their evaluation. They concluded that fair use is less likely to be upheld when the original work and its derivative serve the 'same or strikingly similar purpose' and when the secondary use carries commercial implications

Ed Klaris, an intellectual property lawyer and a professor at Columbia Law School, suggests that, in light of these two Supreme Court cases, the courts are poised to give substantial consideration to the nature of usage."

Additionally, the article outlines an interesting scenario where users can instruct ChatGPT to generate screenplays in the style of specific books or authors. For instance, when prompted to create a screenplay in the style of 'The Dance and The Railroad,' ChatGPT produced a script that closely emulated Plaintiff Hwang's distinctive writing style. This generated a screenplay featuring a Chinese laborer laboring on the Central Pacific Railroad, whose narrative revolves around the 'belief in the power of art to sustain their spirits,' as outlined in the complaint.

The forthcoming decision by the copyright office regarding the copyrightability of AI-generated works, particularly when companies claim ownership under the work-for-hire doctrine, could open up new possibilities for studios. This could lead to scenarios where studios explore the option of acquiring the rights to a book and employing AI to craft the screenplay. Such a shift could potentially disrupt the market dynamics for authors. Notably, accomplished authors like Stephen Chbosky ('The Perks of Being a Wallflower'), Emma Donoghue ('Room'), and Gillian Flynn ('Gone Girl') have traditionally been the ones to adapt their novels into screenplays.

Klaris anticipates that, when it comes to assessing fair use, the courts are likely to issue rulings favorable to creators. He underscores the authors' and artists' assertions that AI companies are negatively impacting their economic stakes by generating competing works based on their original material. As a result, Klaris suggests that AI companies will inevitably be compelled to establish a structured licensing framework.

OpenAI didn’t respond to a request for comment. Meta declined to comment.

Choban v. meta by THR

Courts Take Aim at Artificial Intelligence Giants: Decisive Battles for Language Model Training Legality

Sign up for more like this.