AI Controversy: Tech Companies Use Unconsented Data to Train AI

In their quest to build smarter AI, some of the richest tech companies are using sneaky methods. They’re secretly taking data from creators’ content without consent.

A recent investigation by Proof News revealed a major AI controversy: tech giants like Anthropic, Nvidia, Apple, and Salesforce have been using subtitles from 173,536 YouTube videos, taken from over 48,000 channels, to train their AI models. This was done despite YouTube’s strict rules against unauthorised data harvesting from its platform.

The YouTube Subtitles dataset

The dataset, known as YouTube Subtitles, includes transcripts from educational and online learning channels such as Khan Academy, MIT, and Harvard. Major media outlets like The Wall Street Journal, NPR, and the BBC also had their content used without permission. Even popular TV shows like “The Late Show with Stephen Colbert” and “Last Week Tonight with John Oliver” had their videos utilised in this manner.

High-profile creators affected

Proof News found that material from YouTube megastars, including MrBeast, Marques Brownlee and more, were used in training these AI models. Some of these creators expressed their frustration and disappointment upon discovering their content had been used without their consent.

David Pakman, host of “The David Pakman Show,” a popular politics channel, had nearly 160 of his videos included in the dataset. Pakman, whose channel is a significant source of his livelihood, emphasised the need for compensation, especially since some media companies have struck deals to be paid for the use of their work to train AI.

More on the AI controversy: the use of data

Dave Wiskus, CEO of Nebula, a streaming service partially owned by creators, called this unauthorised use of content “theft” and “disrespectful.” He highlighted the irony that these AI models, trained on creators’ work without permission, could potentially replace those very creators in the future.

Investigation reveals that a dataset used for gen AI training by Apple & others contains copyrighted YouTube transcripts accessed without permission. More info:

- The Pile dataset contains transcripts of 170k YouTube videos
- Used by Apple, Anthropic, Nvidia, Salesforce & more… pic.twitter.com/RE0UjhumA3
— Ed Newton-Rex (@ednewtonrex) July 16, 2024

The mechanics behind the dataset

YouTube Subtitles contain plain text from video subtitles, often with translations into languages like Japanese, German, and Arabic. According to a research paper by EleutherAI, the dataset is part of a larger compilation called the Pile, which also includes material from the European Parliament, English Wikipedia, and Enron Corporation employees’ emails.

Tech giants like Apple, Nvidia, and Salesforce have used the Pile to train their AI models, as indicated in their research papers and posts. These companies, valued in the hundreds of billions to trillions of dollars, have benefited from this publicly available data.

A call for regulation

Amy Keller, a consumer protection attorney, is ringing the alarm bells. She says tech companies have “run roughshod” over creators’ rights, and it’s time for stricter rules. Are these companies stealing creators’ hard work? Keller thinks so and emphasises that creators don’t have a say in how their content is used.

Proof News also found that YouTube Subtitles included over 12,000 videos that have since been deleted from YouTube. This brings up another question: What happens to our data once AI has it?

The future of AI and creators after this AI controversy

Many creators are uncertain about the future, fearing that AI could generate content similar to theirs or produce outright copycats. David Pakman experienced this firsthand when he encountered a TikTok video labeled as a Tucker Carlson clip but was actually his own script read by an AI voice clone of Carlson.

This situation highlights a troubling pattern in the AI industry. One organisation scrapes copyrighted work, claiming ignorance of its future use, while another, often a large tech company, trains commercial AI models on that scraped work. This practice exploits creators and raises serious questions about the balance between technological advancement and ethical responsibility.

The AI controversy underscores the urgent need for clearer regulations and fair compensation for creators. As AI technology continues to evolve, it is crucial to develop ethical and legal frameworks that protect the rights and livelihoods of content creators.

This AI controversy serves as a stark reminder that innovation should not come at the expense of ethical responsibility and respect for creators’ work.

Follow Conn3cted for the lates news on techology.

Tech companies used YouTube videos to train AI models