Harvard and Google release 1 million books into the public domain as a dataset for AI training

[ad_1]

AI training data has a high price tag, and is best suited to technology companies with deep pockets. That’s why Harvard Plans to release A dataset of 1 million books in the public domain, spanning genres, languages, and authors, including Dickens, Dante, and Shakespeare, that are no longer protected by copyright due to their age.

The new data set is not yet available, and it is not clear when or how it will be released. However, it contains books drawn from Google’s book-scanning project, Google Books, so Google will be involved in releasing “this treasure trove at scale.”

Raised by Harvard University for the first time Enterprise Data Initiative (IDI) Back in Marchexplaining its plans to create a “trusted channel for AI legal data.” However, not many had heard of it until its release Official launch todaywhich came with confirmation that IDI includes financial support from Microsoft and OpenAI.

Executive Director of IDI Greg Lippert He says the dataset is designed to “level the playing field” by opening up this massive data set to anyone — from research labs to AI startups — who wants to train their own large language models (LLMs).

[ad_2]

Leave a Comment Cancel reply