New York Times Sues OpenAI and Microsoft for Copyright Infringement

The NYTimes lawsuit has the potential to significantly shape copyright and AI policy

By Christina Catenacci

Aug 2, 2024

Key Points:

The Times has sued both OpenAI and Microsoft, alleging copyright infringement, trademark dilution, and unfair competition by misappropriation

OpenAI has responded to the Complaint on its website stating, “We support journalism, partner with news organizations, and believe The New York Times lawsuit is without merit”

A decision in this case may provide much-needed clarification regarding the use of copyrighted works in the development of generative AI tools

On December 27, 2023, the New York Times (The Times) sued OpenAI and Microsoft (Defendants) for copyright infringement in the United States District Court in New York. In its Complaint, The Times explained that its work is made possible through the efforts of a large and expensive organization that provides legal, security, and operational support, as well as editors who ensure their journalism meets the highest standards of accuracy and fairness.

In fact, The Times has evolved into a diversified multi-media company with readers, listeners, and viewers around the globe with more than 10 million subscribers. But according to The Times, the joint efforts of the Defendants have harmed The Times, as seen by lost advertising revenue and fewer subscriptions to name a few. The Times alleges that that OpenAI unlawfully used its works to create artificial intelligence products. The Times argued in its Complaint that unauthorized copying of The Times works without payment to train Large Language Models (LLMs) is a substitutive use that is “not justified by any transformative purpose”.

The Times has sued the Defendants as follows:

Copyright infringement against all Defendants: by building training datasets containing millions of copies of The Times works (including by scraping copyrighted works from The Times’s websites and reproducing them from third-party datasets), the Defendants have directly infringed The Times’s exclusive rights in its copyrighted works. Also, by storing, processing, and reproducing the training datasets containing millions of copies of The Times works to train the GPT models on Microsoft’s supercomputing platform, Microsoft and the OpenAI Defendants have jointly directly infringed The Times’s exclusive rights in its copyrighted works

Vicarious copyright Infringement against Microsoft and OpenAI: Microsoft controlled, directed, and profited from the infringement perpetrated by the OpenAI Defendants. Microsoft controls and directs the supercomputing platform used to store, process, and reproduce the training datasets containing millions of The Times works, the GPT models, and OpenAI’s ChatGPT offerings. The Times alleges that Microsoft profited from the infringement perpetrated by the OpenAI Defendants by incorporating the infringing GPT models trained on The Times works into its own product offerings, including Bing Chat

Contributory copyright infringement against Microsoft: Microsoft materially contributed to and directly assisted in the direct infringement that is attributable to the OpenAI Defendants. The Times alleged that Microsoft provided the supercomputing infrastructure and directly assisted the OpenAI Defendants in: building training datasets containing millions of copies of Times Works; storing, processing, and reproducing the training datasets containing millions of copies of The Times works used to train the GPT models; providing the computing resources to host, operate, and commercialize the GPT models and GenAI products; and providing the Browse with Bing plug-in to facilitate infringement and generate infringing output. The Times said that Microsoft was fully aware of the infringement and OpenAI’s capabilities regarding ChatGPT-based products

Digital Millennium Copyright Act–Removal of Copyright Management Information against all Defendants: The Times included several forms of copyright-management information in each of The Times’s infringed works, including: copyright notice, title and other identifying information, terms and conditions of use, and identifying numbers or symbols referring to the copyright-management information. However, The Times claimed that without The Times’s authority, the Defendants copied The Times’s works and used them as training data for their GenAI models. The Times believed that the Defendants removed The Times’s copyright-management information in building the training datasets containing millions of copies of The Times works, including removing The Times’s copyright-management information from Times Works that were scraped directly from The Times’s websites and removing The Times’s copyright-management information from The Times works reproduced from third-party datasets. Moreover, the Times asserted that the Defendants created copies and derivative works based on The Times’s works, and by distributing these works without their copyright-management information, the Defendants violated the Copyright Act.

Unfair competition by misappropriation against all Defendants: by offering content that is created by GenAI but is the same or similar to content published by The Times, the Defendants’ GPT models directly compete with The Times content. The Defendants’ use of The Times content encoded within models and live Times content processed by models produces outputs that usurp specific commercial opportunities of The Times. In addition to copying The Times’ content, it altered the content by removing links to the products, thereby depriving The Times of the opportunity to receive referral revenue and appropriating that opportunity for Defendants. The Times now competes for traffic and has lost advertising and affiliate referral revenue

Trademark dilution against all Defendants: in addition, The Times has registered several trademarks and argued that the Defendants’ unauthorized use of The Times’s marks on lower quality and inaccurate writing dilutes the quality of The Times’s trademarks by tarnishment. The Times asserts that the Defendants are fully aware that their GPT-based products produce inaccurate content that is falsely attributed to The Times, and yet continue to profit commercially from creating and attributing inaccurate content to The Times. The Defendant’s unauthorized use of The Times’s trademarks has resulted in several harms including damage to reputation for accuracy, originality, and quality, which has and will continue to cause it economic loss.

The Times has asked for statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity. Additionally, The Times has requested that there be a jury trial.

What can we take from this development?

This has the makings of a landmark copyright case and can go a long way to shape copyright and AI policy for years to come. In fact, some have referred to this case as, “The biggest IP case ever”.

In terms of a response to the Complaint, OpenAI has made a public statement in January 2024 on its website stating, “We support journalism, partner with news organizations, and believe The New York Times lawsuit is without merit”. The company set out its position as follows:

“Our position can be summed up in these four points, which we flesh out below:

We collaborate with news organizations and are creating new opportunities

Training is fair use, but we provide an opt-out because it’s the right thing to do

“Regurgitation” is a rare bug that we are working to drive to zero

The New York Times is not telling the full story”

Interestingly, OpenAI has stated that training AI models using publicly available internet materials is “fair use”, as supported by long-standing and widely accepted precedents. It stated, “We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness”. However, the Defendants in this case may run into problems with this argument because The Times’ copyrighted works are behind a paywall. The Defendants are familiar with what this means—it is necessary to pay in order to read (with subscriptions) or use (with proper licensing).

It is concerning that OpenAI refers to regurgitation (word-for-word memorization and presentation of content) as a bug that they are working on, but then says, “Because models learn from the enormous aggregate of human knowledge, any one sector—including news—is a tiny slice of overall training data, and any single data source—including The New York Times—is not significant for the model’s intended learning”.

Essentially, OpenAI has downplayed the role that The Times’ works play in the training process, yet not addressing The Times’ arguments that the Defendants have ingested millions of copyrighted works without consent or compensation and have been outputting The Times works practically in their entirety.

Another point of interest is that, in the Complaint, The Times stated that it reached out to OpenAI in order to build a partnership, but the negotiations never resulted in a resolution. However, OpenAI has stated in its website post that the discussions with The Times had appeared to be progressing constructively. It said that the negotiations focused on a high-value partnership around real-time display with attribution in ChatGPT, where The Times would gain a new way to connect with their existing and new readers, and their users would gain access to The Times reporting. It stated,

“We had explained to The New York Times that, like any single source, their content didn't meaningfully contribute to the training of our existing models and also wouldn't be sufficiently impactful for future training. Their lawsuit on December 27—which we learned about by reading The New York Times—came as a surprise and disappointment to us”.

Clearly, there are two different sides to this story, and the court will need to sort out what took place in order to make a determination.

Ultimately, this case will have a significant impact on the relationship between generative AI and copyright law, particularly with respect to fair use . In particular, a decision in this case may provide much-needed clarification regarding the use of copyrighted works in the development of generative AI tools, such as OpenAI’s ChatGPT and Microsoft’s Bing Chat (Copilot), both of which are built on top of OpenAI’s GPT model.