AI models from Apple, Salesforce, Anthropic and various first-generation players were trained on thousands of YouTube videos without the creators’ consent and possibly in violation of YouTube’s terms, according to a declassified document appearing in the news. And stressed.
Corporations trained their models partly through the use of “The Pile”, a set of non-profit EleutherAI created in an effort to tap into an invaluable dataset for those individuals or corporations Which have not been able to compete with wet tech, even though it has since been damaged by those larger companies.
The collection includes books, Wikipedia articles and more. It includes YouTube captions collected via YouTube’s Captions API, extracted from 173,536 YouTube videos on over 48,000 channels. This includes videos from big YouTubers like MisterBeast, PewDiePie, and popular tech commentator Marques Brownlee. On He wrote:
Apple has obtained information from many companies for its AI
One of them dug up a lot of information/transcripts from YouTube videos, including mine
Apple technically avoids this “mistake” right here because they’re not supposed to be scraping anymore.
However, this is becoming a chronic and progressive infection.
It also includes channels from various mainstream and online media brands, including films written, produced and printed by Ars Technica and its personnel and films written, produced and printed by various Condé Nast brands such as Stressed and The New Yorker.
Coincidentally, one of the most significant movies spoiled in the dataset was the below film produced by Ars Technica in which the story of the shaggy dog was written in advance through AI. Evidence Information’s article also mentions that it was once trained on parrot videos, so the AI models are replicating a parrot, replicating human pronunciation, in addition to parroting the alternative AI , are turning humans into parrots.
As AI-generated content continues to proliferate on the web, it will become more difficult to create datasets to train AI that do not come with content already produced through AI.
To be clear, some of this is not virgin information. The collection is regularly referenced and referenced in AI circles and has been identified as currently being exploited by tech companies for training. This has been mentioned in more than one complaint by intellectual capacity owners towards AI and tech companies. The defendants in those complaints, including OpenAI, say this type of scraping is a true utility. The complaints have not yet been resolved in court.
Evidence Information, on the other hand, did some digging to get specific information about the usefulness of YouTube captions and created a tool that you’ll be able to use to search collections for individual videos or channels.
The painting shows how powerful information collection is and brings attention to how little intellectual capacity homeowners have to monitor how their painting degrades if it is on the Internet traceable.
It is impressive to see that this knowledge is not necessarily used to train models to create aggressive content that reaches end customers, on the other hand. For example, Apple may have trained on the dataset for analytics purposes, or to strengthen autocomplete for text typing on its devices.
Creators’ reactions
Evidence reached out to many of these creators as well as the companies behind the datasets for statements. Most creators were surprised that their content was defaced in this way, and those who did made statements criticizing EleutherAI and the companies that defaced its datasets. For example, David Pakman david pakman display mentioned:
Nobody came to me and said, “We’d like to use this”… This is my livelihood, and I put momentum, resources, money, and personnel into developing this stuff. In fact there has been a denial of labor shortage.
Julia Walsh, CEO of manufacturing company Complexly, is responsible for science show And the optional Hank and John Green tutorial material, mentioned:
We are disappointed to learn that our thoughtfully designed educational content has been defaced in this way without our consent.
There is also the question of whether scraping this content violates YouTube’s terms, which forbid gaining access to videos through “automated means”. EleutherAI founder Sid Deem said he developed a script to retrieve captions through YouTube’s API, just as an Internet browser does.
Anthropic is one of the companies that has trained on the fashion dataset, and for its segment, it claims that there is no denial violation here. Spokeswoman Jennifer Martinez noted:
The archive contains a very small subset of YouTube subtitles… YouTube’s phrases block the direct utility of its platform, which is separate from the utility of the archive dataset. At the level of potential violations of YouTube’s Terms of Service, we will refer you to the archive authors.
A Google spokesperson told Evidence Information that Google “has taken action for years to stop abusive, unauthorized scraping” but did not provide a specific response. This isn’t the first time AI and tech companies were the subject of complaints for training models on YouTube videos without permission. It is noteworthy that OpenAI (the company behind ChatGPT and video generation tool Sora) is believed to have harvested YouTube data to train its models, although all its allegations have not been proven.
In an interview with Nilay Patel of The Verge, Google CEO Sundar Pichai suggested that the use of YouTube videos to train OpenAI’s Sora may have violated YouTube’s rules. Granted, this usage is different from scraping captions via the API.