New reports suggest that OpenAI employed data from YouTube videos, amounting to over a million hours, to train its latest AI model, GPT-4.
It's alleged that OpenAI resorted to utilizing transcribed data from YouTube videos after exhausting its existing text-word resources for training AI models.
This move, if proven true, could pose legal challenges for the AI firm, which is already entangled in multiple lawsuits regarding the use of copyrighted data. Recently, a report shed light on mini chatbots in OpenAI's GPT Store that reportedly violated the company's guidelines.
According to The New York Times, OpenAI developed an automatic speech recognition tool named Whisper to transcribe YouTube videos and utilize the data for training its models, after facing a shortage of unique text words. Whisper was publicly launched by OpenAI in September 2022, and the firm stated that it was trained on 6,80,000 hours of "multilingual and multitask supervised data collected from the web".
Unnamed sources familiar with the matter claimed that OpenAI employees deliberated over the potential breach of YouTube's guidelines and the risk of legal consequences. Notably, Google prohibits the use of its videos for applications external to the platform.
Despite the concerns, OpenAI allegedly proceeded with the plan, transcribing over a million hours of YouTube videos to feed the text into GPT-4. The report also alleges direct involvement from OpenAI President Greg Brockman, who reportedly assisted in data collection from videos.
OpenAI spokesperson Matt Bryant responded to the reports, calling them unconfirmed and denying any unauthorized scraping or downloading of YouTube content, citing the company's robots.txt files and Terms of Service.
Another spokesperson, Lindsay Held, mentioned that OpenAI utilizes various sources, including publicly available and non-public data partnerships, for its data sources. Additionally, Held stated that the AI firm is exploring the potential use of synthetic data for training future AI models.