OpenAI utilized YouTube videos to train GPT-4 AI model

This move, if proven true, could pose legal challenges for the AI firm, which is already entangled in multiple lawsuits regarding the use of copyrighted data.

New reports suggest that OpenAI employed data from YouTube videos, amounting to over a million hours, to train its latest AI model, GPT-4.

It's alleged that OpenAI resorted to utilizing transcribed data from YouTube videos after exhausting its existing text-word resources for training AI models.

This move, if proven true, could pose legal challenges for the AI firm, which is already entangled in multiple lawsuits regarding the use of copyrighted data. Recently, a report shed light on mini chatbots in OpenAI's GPT Store that reportedly violated the company's guidelines.

According to The New York Times, OpenAI developed an automatic speech recognition tool named Whisper to transcribe YouTube videos and utilize the data for training its models, after facing a shortage of unique text words. Whisper was publicly launched by OpenAI in September 2022, and the firm stated that it was trained on 6,80,000 hours of "multilingual and multitask supervised data collected from the web".

Unnamed sources familiar with the matter claimed that OpenAI employees deliberated over the potential breach of YouTube's guidelines and the risk of legal consequences. Notably, Google prohibits the use of its videos for applications external to the platform.

Despite the concerns, OpenAI allegedly proceeded with the plan, transcribing over a million hours of YouTube videos to feed the text into GPT-4. The report also alleges direct involvement from OpenAI President Greg Brockman, who reportedly assisted in data collection from videos.

OpenAI spokesperson Matt Bryant responded to the reports, calling them unconfirmed and denying any unauthorized scraping or downloading of YouTube content, citing the company's robots.txt files and Terms of Service.

Another spokesperson, Lindsay Held, mentioned that OpenAI utilizes various sources, including publicly available and non-public data partnerships, for its data sources. Additionally, Held stated that the AI firm is exploring the potential use of synthetic data for training future AI models.

OpenAI utilized YouTube videos to train GPT-4 AI model

OpenAI opens ChatGPT to third-party app submissions for AI-powered...

Gemini 3 challenge: OpenAI CEO issues ‘code red’ to improve ChatGPT

OpenAI launches AI-powered shopping research tool ahead of holidays

OpenAI to permit verified adults to generate erotic content on ChatGPT

Stripe partners with OpenAI to enable Etsy purchases via ChatGPT

OpenAI considers alerting authorities over suicidal users: Sam Altman

ChatGPT maker OpenAI to open first India office in New Delhi

ChatGPT's new subscriptions in India: lowest to cost Rs 399 a month

OpenAI could spend trillions on AI infrastructure: Sam Altman

OpenAI offers million-dollar bonuses to nearly 1,000 employees

Upgraded ChatGPT a significant step but still can’t do humans’ jobs:...

OpenAI releases two open-source AI models matching o3, o3-Mini in...