Meta AI can reproduce half of Harry Potter book verbatim

Hero Image


Meta AI can reproduce half of Harry Potter book verbatim


A recent study has revealed that Meta's LLaMA 3.1 AI model can reproduce significant portions of copyrighted texts, including the first Harry Potter book.

The research was conducted by a group of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University.

They tested five leading open-weight models—three from Meta and one each from Microsoft and EleutherAI—for their ability to reproduce text from Books3, a collection of books commonly used to train large language models (LLMs).


Meta AI has memorized 42% of first Harry Potter book


The study found that LLaMA 3.1 70B, a mid-sized model released by Meta in July 2024, was more likely to reproduce text from Harry Potter and the Sorcerer's Stone than any of the other four models tested.

The research estimates that this particular model has memorized as much as 42% of the first Harry Potter book well enough to recall 50-token excerpts at least half the time.

This is a striking measure of memorization rather than merely statistical prediction.


Study's methodology and limitations


The researchers divided 36 books into overlapping 100-token passages and used the first 50 tokens as a prompt to estimate how likely the model was to reproduce those exact next 50 tokens.

They considered a passage "memorized" if the model had more than a 50% chance of reproducing it verbatim.

The study provides evidence that significant portions of Harry Potter and the Sorcerer's Stone were copied into LLaMA 3.1's weights, but it doesn't definitively explain why or how this occurred.


More likely to reproduce text from widely-read texts


The study also found that LLaMA 3.1 70B was more likely to reproduce text from other popular books such as The Hobbit and George Orwell's 1984 than less known ones.

This suggests that the model has a higher memorization rate for widely-read texts, not just Harry Potter.

James Grimmelmann, a Cornell law professor who worked with some of the paper's authors, noted "there are really striking differences among models in terms of how much verbatim text they have memorized."


Findings complicate the ongoing AI copyright debate


The study's findings complicate the ongoing AI copyright debate. Critics of the industry may argue that memorization is not a fringe phenomenon for some models and books.

However, the study only found significant memorization for a few popular books, like Harry Potter.

This could pose challenges for law firms pursuing class-action lawsuits against AI companies, as it raises questions about whether authors can be grouped together in a single mass lawsuit based on divergent results across different texts.