Subscribe

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Service

OpenAI Faces New Copyright Claims Over GPT-4o Data

OpenAI Faces New Copyright Claims Over GPT-4o Data OpenAI Faces New Copyright Claims Over GPT-4o Data
IMAGE CREDITS: NPR

A new study from the AI Disclosures Project is raising tough questions about the data used to train OpenAI GPT-4o language model. The findings suggest the model demonstrates a strong ability to recognise paywalled, copyrighted content from O’Reilly Media books. Content that should not have been used without permission.

The research comes from a team led by technologist Tim O’Reilly and economist Ilan Strauss. Who are calling for greater transparency in the AI sector. Through their project, they aim to push for clearer corporate disclosures and responsible data practices. Their latest working paper argues that, just as financial markets rely on strong disclosure standards. The AI world needs similar mechanisms to protect creators and maintain trust.

To test whether OpenAI’s models were trained on copyrighted materials. The researchers used a legally obtained dataset of 34 proprietary O’Reilly books. They applied a technique known as the DE-COP membership inference attack to see if OpenAI’s language models could distinguish between original book content and LLM-generated paraphrases.

Their results were revealing. GPT-4o showed what the researchers call a “strong recognition” of paywalled O’Reilly content. The model achieved an AUROC score of 82%, indicating a high likelihood that it had encountered the material before. In contrast, OpenAI’s earlier GPT-3.5 Turbo model showed a much lower AUROC score of just above 50%. Suggesting no significant recognition.

Even more telling, GPT-4o performed better at recognising non-public book content than publicly accessible material—an 82% versus 64% AUROC comparison. GPT-3.5 Turbo reversed that trend, recognising public content more than private samples. Meanwhile, GPT-4o Mini, a smaller variant of the latest model. Showed no measurable recognition of either dataset, with an AUROC score hovering around 50%.

The report points to possible data sourcing from LibGen, a known repository of pirated books. Including all the O’Reilly titles used in the test. Although the researchers can’t confirm how the data was obtained, they note that the presence of this content on LibGen raises serious concerns. They also acknowledge that newer models are better at distinguishing human-written text from machine-generated text, but that doesn’t reduce the credibility of their method.

The researchers took care to account for “temporal bias”, where older content might appear more novel to a model trained on newer data. To control for this, they tested both GPT-4o and GPT-4o Mini, which were trained on data from the same period. Despite the careful controls, GPT-4o still displayed clear recognition of non-public content.

While the report focuses on OpenAI and O’Reilly Media, it warns this is likely a broader, systemic problem in how large language models are built. If companies continue to train on copyrighted materials without compensating creators, the result could be a long-term decline in content quality and diversity across the web. When creators can’t earn revenue from their work, fewer of them can afford to produce it.

To address these issues, the AI Disclosures Project calls for stronger accountability and legal frameworks. They argue that requiring AI developers to disclose their training data sources—and making them liable for using protected content—could lead to the creation of new markets for data licensing and fair compensation for creators.

Regulatory developments may soon help enforce these ideas. The EU AI Act, which includes provisions for transparency in AI model training, could be a first step. If these rules are clearly defined and properly enforced, they could push the industry toward higher standards and more ethical practices.

Despite the serious findings, the report also notes positive developments. A market is starting to form around licensed data. Companies like Defined.ai are helping developers access high-quality training material with the consent of data providers. These services strip out personal data and ensure legal compliance, showing that fair and transparent AI training is possible.

The study ends by stating it provides empirical evidence that OpenAI likely trained GPT-4o using non-public, copyrighted data. By analysing responses to 34 O’Reilly Media books, the researchers offer one of the clearest signals yet that AI training practices need urgent reform.

Share with others