OpenAI mistakenly deleted potential evidence in New York Times copyright lawsuit (updated)

[ad_1]

Lawyers for The New York Times and the Daily News, which are suing OpenAI for allegedly eliminating their work to train its AI models without permission, say OpenAI engineers mistakenly deleted data potentially relevant to the case.

Earlier this fall, OpenAI agreed to provide two virtual machines so consultants for The Times and Daily News could conduct searches for copyrighted content in their AI training sets. (Virtual machines are software-based computer hardware that exists within another computer’s operating system, often used for testing purposes, backing up data, and running applications.) letterLawyers for the publishers say they and the experts they hired have spent more than 150 hours since November 1 poring over OpenAI training data.

But on November 14, OpenAI engineers wiped out all publisher search data stored on one of the virtual machines, according to the aforementioned letter, which was filed with the U.S. District Court for the Southern District of New York late Wednesday.

OpenAI attempted to recover the data, and was mostly successful. However, because the folder structure and file names were “irretrievably” lost, the recovered data “cannot be used to determine where the news plaintiffs’ copied articles were used to build (OpenAI) models,” according to the letter.

“Plaintiffs in the news industry were forced to recreate their work from scratch using long hours of labor and computer processing time,” the Times and Daily News attorney wrote. “Plaintiffs learned only yesterday that the recovered data was unusable and that the work of experts and attorneys would have to be re-done for a full week, which is why this supplemental letter is being filed today.”

Plaintiffs’ counsel explain that they have no reason to believe that the omission was intentional. But they say the incident confirms that OpenAI is “in the best position to search its own data sets” for potentially infringing content using its own tools.

An OpenAI spokesperson declined to make a statement.

But late on Friday, November 22, OpenAI’s attorney filed suit answer To the letter sent by lawyers for The Times and Daily News on Wednesday. In their response, OpenAI’s lawyers unequivocally denied that OpenAI omitted any evidence, and instead suggested that the plaintiffs were responsible for the misconfiguration of the system that led to the technical issue.

“Plaintiffs requested a configuration change to one of the many machines OpenAI provided to search training data sets,” OpenAI’s attorney wrote. “However, implementing the change requested by Plaintiffs resulted in the removal of the folder structure and some file names on a single hard drive — a drive that was intended to be used as a temporary cache… In any event, there is no reason to believe that any files were lost.” actually.”

In this case and others, OpenAI has maintained that training models that use publicly available data — including articles from The Times and Daily News — is fair use. In other words, when creating models like GPT-4o, which “learns” from billions of examples from e-books, articles, and more to create human-looking text, OpenAI believes it’s not required to license or pay for those examples — even if it makes money from those models.

However, OpenAI has signed licensing agreements with a growing number of new publishers, including the Associated Press, Business Insider owner Axel Springer, the Financial Times, People’s parent company Dotdash Meredith, and News Corp. OpenAI has declined to set terms for these agreements. The deals are public, but so is the sole content partner, Dotdash It is said To earn no less than $16 million annually.

OpenAI has neither confirmed nor denied that it trained its AI systems in any specific copyrighted works without permission.

Update: Added OpenAI’s response to the allegations.

[ad_2]

Leave a Comment