OpenAI mistakenly deleted potential evidence in a New York Times copyright lawsuit

[ad_1]

Lawyers for The New York Times and the Daily News, which are suing OpenAI for allegedly eliminating their work to train its AI models without permission, say OpenAI engineers mistakenly deleted data potentially relevant to the case.

Earlier this fall, OpenAI agreed to provide two virtual machines so consultants for The Times and Daily News could conduct searches for copyrighted content in their AI training sets. (Virtual machines are software-based computers that exist within another computer’s operating system, often used for testing purposes, backing up data, and running applications.) In a letter, the publishers’ lawyers say they and the experts they hired have spent more than 150 hours since November 1 poring over OpenAI training data.

But on November 14, OpenAI engineers erased all publisher search data stored on one of the virtual machines, according to the letter mentioned above, which was filed with the US District Court for the Southern District of New York late Wednesday.

OpenAI attempted to recover the data, and was mostly successful. However, because the folder structure and file names were “irretrievably” lost, the recovered data “cannot be used to determine where the news plaintiffs’ copied articles were used to build (OpenAI) models,” according to the letter.

“Plaintiffs in the news industry were forced to recreate their work from scratch using long hours of labor and computer processing time,” the Times and Daily News attorney wrote. “Plaintiffs learned only yesterday that the recovered data was unusable and that the work of experts and attorneys would have to be re-done for a full week, which is why this supplemental letter is being filed today.”

Plaintiffs’ counsel explain that they have no reason to believe that the omission was intentional. But they say the incident confirms that OpenAI is “in the best position to search its own datasets” for potentially infringing content using its own tools.

An OpenAI spokesperson declined to make a statement.

In this case and others, OpenAI has maintained that training models that use publicly available data — including articles from The Times and Daily News — is fair use. In other words, when creating models like GPT-4o, which “learns” from billions of examples from e-books, articles, and more to create human-looking text, OpenAI believes it’s not required to license or pay for those examples — even if it makes money from those models.

However, OpenAI has signed licensing agreements with a growing number of new publishers, including the Associated Press, Business Insider owner Axel Springer, the Financial Times, People’s parent company Dotdash Meredith, and News Corp. OpenAI has declined to set terms for these agreements. The deals are public, but so is the sole content partner, Dotdash It is said To earn no less than $16 million annually.

OpenAI has neither confirmed nor denied that it trained its AI systems in any specific copyrighted works without permission.

[ad_2]

Leave a Comment Cancel reply