Intellectual Property and Generative Artificial Intelligence: Regulating Model Training

Carolina Rambaldi
English Law and French Law student at King’s College London and Paris-Panthéon-Assas University, Master 2 in European Law and Market Regulation

Amid rapid technological advances, the convergence of artificial intelligence (AI) and intellectual property rights has become a critical issue, sparking numerous legal disputes. Disagreements over the methods used to train generative artificial intelligence (GAI) models are intensifying. These models, which emulate human creativity, are trained on vast amounts of existing content.

This approach has triggered a series of lawsuits, with OpenAI—prominent for its ChatGPT model—frequently at the centre. While several major media groups, such as Axel Springer, Dotdash Meredith, The Financial TimesThe Associated Press, and Le Monde, have collaborated with tech companies to regulate the use of their protected content,[1] others have chosen a more adversarial route. For instance, eight newspapers owned by Alden Global Capital have accused OpenAI of infringing on their intellectual property by incorporating their works into training datasets without permission.[2] Companies like MidJourney and Meta AI are also facing similar actions.[3] However, the lawsuit filed by The New York Times against OpenAI and Microsoft on December 9, 2023, stands out, presenting concrete examples that, according to the newspaper, demonstrate that the actions of OpenAI and Microsoft require prior authorisation from copyright holders.[4]

The New York Times emphasises the need to protect its rights to uphold independent journalism, a cornerstone of democracy.[5] It argues that if news organisations lose control over their content, their ability to fund essential production investments will suffer, limiting resources for investigative and public-interest reporting. This could leave critical stories untold, to the detriment of society. The New York Times also criticises OpenAI’s transformation: founded in 2015 as a nonprofit, OpenAI restructured in 2019 to create a profit-driven subsidiary supported by a multi-billion-dollar investment from Microsoft.[6] Although this structure limits returns for investors and redirects excess profits back to the original nonprofit entity, concerns have been raised. OpenAI now generates about $80 million per month,[7] and its planned restructuring as a public benefit corporation—a profit-oriented entity committed to the public good—could attract major investors such as Apple and Nvidia, a leading chip manufacturer.[8] The New York Times highlights a shift away from OpenAI’s original values, which once prioritised transparency and safety. This restructuring would help OpenAI attract new investments to compete with well-funded rivals like Google and Anthropic while addressing the high costs of developing advanced AI. Yet experts, including OpenAI co-founder Elon Musk, warn of a possible concentration of power and a shift toward profit over safety and ethics.[9] The tech industry, led by companies like OpenAI and Microsoft, appears to be racing toward increasingly powerful AI systems, where caution may be left behind.[10]

In its lawsuit, The New York Times raises multiple claims against OpenAI and Microsoft. First, it alleges direct copyright infringement, asserting that OpenAI incorporated protected works from the newspaper into its training datasets without authorisation. Microsoft is accused of secondary infringement, both vicarious (having controlled and benefited from OpenAI’s actions) and contributory (having technically facilitated these infringements). The New York Times also cites a violation of the Digital Millennium Copyright Act due to the removal of copyright management information. Claims of unfair competition and trademark dilution are also included, with the newspaper arguing that the unauthorised use of its trademarks in AI-generated content weakens its distinctiveness and harms its commercial reputation. In response to these violations, the newspaper is seeking billions of dollars in damages and a permanent injunction.[11]

A historical comparison can be drawn with New York Times Co. v. Tasini (2001), where the newspaper was accused of including freelance authors’ articles in databases without authorization.[12] At that time, the newspaper argued that removing the content could compromise the integrity of digital databases. Today, however, by demanding the removal of GPT models containing its works, The New York Times appears to be taking the opposite stance.

This illustrates that the complexity of this issue goes beyond mere copyright protection, raising broader challenges related to technological innovation. Generative AI models rely on vast datasets to generate innovative outputs, with significant implications for sectors such as research, finance, law, and education. The key issue is to reconcile the rights of those creators with the technological progress of AI within an appropriate legal framework.

This article addresses two regulatory approaches to generative AI training. First (I), we will examine copyright exceptions by comparing the U.S. fair use doctrine with European text and data mining (TDM) rules. Second (II), we will analyse the shift towards a transparency obligation, driven by legislative initiatives in both Europe and the United States.

I – Copyright exceptions: between Fair Use and Text and Data Mining

In its public statement on January 8, 2024, titled OpenAI and Journalism, OpenAI argued that the use of protected works to train its models falls under the fair use exception.[13] The company emphasised the transformative nature of data usage in training its AI models. However, the court’s decision on this point is highly anticipated. The fair use doctrine, codified in the Copyright Act of 1976, permits limited use of protected works without prior authorisation, based on four key criteria courts use to determine whether a use qualifies as fair use or constitutes copyright infringement.[14]

  1. The purpose and character of the use: This criterion examines whether the use is transformative, meaning it alters the original work to create something new. Transformative use, particularly for non-commercial purposes, is more likely to qualify as fair use. Commercial use is subject to stricter scrutiny.
  2. The nature of the protected work: Creative works, such as novels or films, enjoy stronger protection, while factual works, like textbooks or scientific articles, are more likely to fall under the fair use exception.
  3. The amount and substantiality of the use: This criterion assesses the proportion of the work used. Reproducing an entire work makes fair use harder to justify, though using small portions can still be problematic if they are essential to the work.
  4. The effect of the use on the market: This criterion examines the impact of the use on the potential market for the original work. If the use diminishes demand for the work or competes directly with it, fair use will be harder to justify.

A notable example of this doctrine’s application is the Google Books case.[15] In 2004, Google launched Google Book Search, a service that digitised out-of-print books in partnership with several libraries. Some works were fully digitised, while others were only available as excerpts. Accused of copyright infringement by the Association of American Publishers and the Authors Guild in 2005, Google argued that its service was transformative, as it increased the visibility of works without harming their market. The court ruled in Google’s favour, determining that the purpose of the service—facilitating book search and discovery—did not negatively impact book sales. This case illustrates how the fair use exception could apply to training generative AI models. Just as Google Books digitised significant amounts of content to create a new product, training generative AI models could be seen as transformative use.

In its lawsuit against OpenAI, The New York Times claims that training AI models like ChatGPT on its protected works constitutes unauthorised reproduction. OpenAI has previously argued that its AI models merely analyse concepts without reproducing the actual texts. The company likens this approach to human learning, where assimilating concepts from protected content does not constitute copyright infringement.[16] Additionally, OpenAI maintains that the training process only extracts unprotected elements, such as ideas or facts. This defence recalls the 2019 CJEU decision in Pelham,[17] where the Court ruled that the use of a modified, unrecognisable sound sample did not require authorisation. Similarly, if copyrighted material used to train GAI models is altered to the point of being unrecognisable, this may not constitute copyright infringement.

This debate highlights the legal uncertainty surrounding the application of fair use to AI, an approach based on interpretative criteria left to judicial discretion. In Europe, copyright exceptions are governed by the DSM Directive (2019/790),[18] which regulates text and data mining (TDM). However, these provisions do not specifically address the training of GAI models. Article 3 of the Directive allows research organisations and heritage institutions to conduct TDM for scientific research without prior authorisation. Article 4 extends this exception to commercial uses, provided rights holders have not explicitly opted out. These exceptions remain limited to reproduction rights and do not permit public disclosure of extracted data. Additionally, access to protected works must be lawful, raising questions about the availability of content online without legal restrictions.

Recital 18 of the DSM Directive specifies that these exceptions apply to AI operating for purely statistical purposes, with copies retained only for the duration necessary for data mining. Consequently, some argue that this Directive was not designed to regulate GAI models, which require massive datasets for training.[19] Additionally, questions remain regarding the compatibility of the TDM exception with the European three-step test, which mandates that exceptions neither impair the normal exploitation of the work nor unjustifiably harm rights holders.[20]

Nevertheless, the European Commission confirmed the applicability of these exceptions in a statement by Thierry Breton on March 31, 2023, while the European Union’s Artificial Intelligence Act, adopted in May 2024, goes further.[21] Article 53(1)(c) establishes the principle of a data mining exception, allowing providers to use protected works unless rights holders explicitly opt-out. This provision may therefore apply to GAI model training, with its scope extended by Article 2 to cover any AI model use within the EU, regardless of the provider’s or developer’s location.

II – Trend Towards a Data Disclosure Requirement

To ensure that opposition rights, particularly through opt-out mechanisms, are respected, the European AI Act imposes transparency obligations. Article 53(1)(d) requires AI providers to publish a sufficiently detailed summary of the content used to train their models, in a format defined by the models. However, questions remain about how effectively this requirement will be applied: Will the summary be detailed enough to allow the identification of copyright-protected content? In response to these uncertainties, France tasked the Higher Council for Literary and Artistic Property (CSPLA) in April 2024 with defining the specific information that AI providers must disclose.[22]

A parallel development is underway in the United States, where the Generative AI Copyright Disclosure Act,[23] introduced to Congress on April 9, 2024, mandates similar transparency. Any entity developing or modifying training datasets must submit a detailed summary of the data used to the U.S. Copyright Office before the commercialization of models. For online datasets, a simple URL is sufficient, and a public registry will centralize this information.

Unlike the European regulation, which applies broadly to all AI providers, the U.S. legislation differentiates between companies creating datasets and those modifying them, offering a more nuanced approach.[24] Another significant difference is the U.S. disclosure requirement, which must be met at least 30 days before commercialization and applies retroactively to models already released before the law’s enactment. This measure addresses concerns raised by the Federal Trade Commission, which, in its June 29, 2023, analysis, highlighted the competitive advantage gained by companies with unrestricted data access in the past, creating barriers for new entrants. The FTC has called for measures to restore fair competition.[25] The practical implementation of these provisions remains to be seen, particularly regarding retroactivity, where the “machine unlearning” process appears especially complex. This underlines the importance of rigorously regulating the future use of data by GAI models.


[1] Benjamin Mullin, ‘OpenAI and News Corp Strike Deal Over Use of Content’ New York Times (22 May 2024)

[2] Benjamin Mullin, ‘Newspapers Sued Microsoft and OpenAI Over AI Copyright Infringement’ New York Times (30 April 2024) 

[3]  Stéphanie Carre, ‘Intelligence artificielle générative : entre adoption d’un règlement européen et nouvelle action américaine contre la violation massive du copyright du New York Times’ (Dalloz actualité, 15 février 2024)

[4] Benjamin Mullin, ‘New York Times Sues OpenAI and Microsoft Over Copyright Infringement’ New York Times (27 December 2023) 

[5] Complaint, New York Times v OpenAI, December 2023′ (2023) New York Times

[6] Dan Milmo, ‘OpenAI Planning to Become For-Profit Company, Say Reports’ The Guardian (26 September 2024)

[7] Le Figaro, ‘OpenAI, l’entreprise créatrice de ChatGPT, valorisée désormais à 80 milliards de dollars’ Le Figaro (18 February 2024)

[8]  Aaron Tilley, ‘OpenAI in Talks with Apple for Funding to Develop ChatGPT’ Wall Street Journal (18 October 2024)

[9]  Dan Milmo, ‘Why Is OpenAI Planning to Become a For-Profit Business and Does It Matter?’ The Guardian (26 September 2024)

[10] Nidhi Subbaraman, ‘OpenAI Restructuring Is a “Natural Consequence” of an AI Arms Race’ (Cornell University, 13 October 2023)

[11] Graeme Massie, ‘New York Times Sues Microsoft and OpenAI over Copyright Infringement’ The Independent (27 December 2023)

[12] NYT v. OpenAI: The Times’s About-Face’ (2024) Harvard Law Review Blog, 2 April 2024

[13] OpenAI and Journalism’ (2024) OpenAI, 8 January 2024

[14] Copyright Act 1976, 17 USC §§ 101-810 (1976)

[15] Authors Guild v Google Inc [2015] 804 F 3d 202 (2nd Cir)

[16] Anthropic, ‘Response to the Copyright Office’s Notice of Inquiry on Copyright and Artificial Intelligence [Docket No. 2023-6]’ (2023);  Google LLC, ‘Comments in Response to Notice of Inquiry, “Artificial Intelligence and Copyright”’, 88 Fed. Reg. 59942 (COLC-2023-0006) (30 October 2023)

[17]  Pelham GmbH v Hütter and Schneider-Esleben (C-476/17) [2019] ECLI:EU:C:2019:624.

[18] Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market [2019] OJ L130/92

[19]  Anne-Laure Caquet, ‘L’intelligence artificielle générative : l’Union européenne relaie le droit d’auteur au rang des exceptions’ (Village de la Justice, 24 mai 2024)

[20] ibid

[21] Thierry Breton, ‘Communiqué du 31 mars 2023’

[22] Anne-Laure Caquet, ‘L’intelligence artificielle générative : l’Union européenne relaie le droit d’auteur au rang des exceptions’ (Village de la Justice, 24 mai 2024)

[23]  Betty Jeulin, ‘Analyse du projet de loi américain sur la divulgation des données d’entraînement des IA génératives’ (Dalloz actualité, 27 mai 2024).

[24] ibid

[25] ibid

Share this article
Shareable URL
Prev Post

Propriété Intellectuelle et Intelligence Artificielle Générative : quelle régulation pour l’entraînement des modèles ?

Next Post

Menaces cybernétiques alimentées par l’Intelligence Artificielle : Un défi mondial

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *

Ce site utilise Akismet pour réduire les indésirables. En savoir plus sur comment les données de vos commentaires sont utilisées.

Read next