Tech

Apple, Anthropic and other companies used YouTube videos to train AI

Share on facebook
Share on twitter
Share on linkedin
Share on pinterest
Share on telegram
Share on email
Share on reddit
Share on whatsapp
Share on telegram


More than 170,000 YouTube videos are part of a massive dataset that has been used to train AI systems for some of the biggest technology companies. according to an investigation by Race news and co-published with Wired. Apple, Anthropic, Nvidia and Salesforce are among the technology companies that have used “YouTube Captions” data that was scraped from the video platform without permission. The training dataset is a collection of subtitles taken from YouTube videos belonging to more than 48,000 channels — it does not include images from the videos.

Videos from popular creators like MrBeast and Marques Brownlee appear in the dataset, as do clips from news outlets like ABC News, BBC and The New York Times. More than 100 videos On the edge appear in the dataset, along with many other videos from Vox.

“Apple obtained data for its AI from several companies, Brownlee, known by his nickname MKBHD, wrote in a post on X. “One of them copied tons of data/transcripts from YouTube videos, including mine.” He added: “This will be an evolving issue for a long time.”

YouTube did not respond immediately On the edgein request for comment.

As part of its investigation, Proof News also released an interactive search tool. You can use the search feature to see if your content – ​​or that of your favorite YouTuber – appears in the dataset.

The caption dataset is part of a larger collection of material from the nonprofit EleutherAI called The Pile. The open source collection known as Pile also contains datasets from books, Wikipedia articles, and more. Last year, an analysis of a dataset called Books3 revealed which authors’ works were used to train AI systems, and the dataset was cited in plaintiffs’ lawsuits against companies that used it to train AI.

AI companies are rarely voluntarily transparent about the data entering their AI systems; How YouTube content specifically is being used has been a hot question in recent months. In March, when OpenAI unveiled its powerful video generation tool Sora, CTO Mira Murati repeatedly dodged questions about whether the system was trained on YouTube videos.

“I’m not going to go into detail about the data that was used, but it was publicly available or licensed data,” she said. Wall Street Journal at the time. When pressed by The newspaper specifically about YouTube content, Murati said he was “not sure about that.”

In previous interviews, YouTube CEO Neal Mohan has said that using video content to train AI – including transcripts – would violate the platform’s terms.





Source link

Support fearless, independent journalism

We are not owned by a billionaire or shareholders – our readers support us. Donate any amount over $2. BNC Global Media Group is a global news organization that delivers fearless investigative journalism to discerning readers like you! Help us to continue publishing daily.

Support us just once

We accept support of any size, at any time – you name it for $2 or more.

Related

More

1 2 3 9,595

Don't Miss

Sheriff says man kills himself after killing three people outside home near Atlanta

Sheriff says man kills himself after killing three people outside home near Atlanta

SHARPSBURG, Georgia – A man killed three people in a
Rouen Cathedral evacuated after fire in spire

Rouen Cathedral evacuated after fire in spire

The spire of a famous Gothic cathedral in the French