Why Facebook-parent Meta may face same 'AI copying' problem as ChatGPT-maker OpenAI, Microsoft
Facebook parent Meta ’s newest AI model, Llama 3.1, has been found to replicate passages from well-known books, including Harry Potter, far more frequently than anticipated, as per a new report which also says that many of these works remain under copyright. Researchers claim that the AI has memorised roughly 42% of the first Harry Potter book and can accurately reproduce 50-word sections about half the time. The study, conducted by experts from Stanford, Cornell, and West Virginia University, examined how five leading AI models processed the Books3 dataset, which includes thousands of copyrighted titles.
"Llama 3.1 70B—a mid-sized model Meta released in July 2024—is far more likely to reproduce Harry Potter text than any of the other four models, the researchers found.
"Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer's Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3," the researchers wrote.
Meta’s Llama 3.1 has been noted for retaining large portions of well-known books, including The Hobbit, 1984, and Harry Potter and the Sorcerer’s Stone. In contrast, earlier versions, such as Llama 1, only memorized around 4% of Harry Potter. This suggests that the newer model is preserving significantly more copyrighted content.
Why Meta's models are reproducing exact text
Researchers suggest several reasons why Meta's AI models may be copying text verbatim. One possibility is that the same books were repeatedly used during training, reinforcing memorisation rather than generalising language patterns.
Others speculate that training data could include excerpts from fan websites, reviews, or academic papers, leading the model to inadvertently retain copyrighted content. Additionally, adjustments to the training process may have amplified this issue without developers realizing the extent of its impact.
What this means for Meta
These findings intensify concerns about how AI models are trained and whether they might be violating copyright laws. As authors and publishers push back against unauthorised use of their work, this could become a major challenge for tech companies like Meta.
Earlier this year, The New York Times sued OpenAI and Microsoft for copyright infringement, alleging that their AI models, including ChatGPT, were trained on copyrighted articles without permission. According to the Times, OpenAI, “can generate output that recites Times' content verbatim, closely summarizes it, and mimics its expressive style.” It said that the AI company essentially stole their intellectual property.
"Llama 3.1 70B—a mid-sized model Meta released in July 2024—is far more likely to reproduce Harry Potter text than any of the other four models, the researchers found.
"Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer's Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3," the researchers wrote.
Meta’s Llama 3.1 has been noted for retaining large portions of well-known books, including The Hobbit, 1984, and Harry Potter and the Sorcerer’s Stone. In contrast, earlier versions, such as Llama 1, only memorized around 4% of Harry Potter. This suggests that the newer model is preserving significantly more copyrighted content.
Why Meta's models are reproducing exact text
Others speculate that training data could include excerpts from fan websites, reviews, or academic papers, leading the model to inadvertently retain copyrighted content. Additionally, adjustments to the training process may have amplified this issue without developers realizing the extent of its impact.
What this means for Meta
These findings intensify concerns about how AI models are trained and whether they might be violating copyright laws. As authors and publishers push back against unauthorised use of their work, this could become a major challenge for tech companies like Meta.
Earlier this year, The New York Times sued OpenAI and Microsoft for copyright infringement, alleging that their AI models, including ChatGPT, were trained on copyrighted articles without permission. According to the Times, OpenAI, “can generate output that recites Times' content verbatim, closely summarizes it, and mimics its expressive style.” It said that the AI company essentially stole their intellectual property.
Next Story