According to the revised terms of service, all forms of New York Times content, such as texts, photographs, images, audio and video clips, metadata and compilations, are off-limits for AI training. Even automated tools like web crawlers, which access and collect the content of the publication, now require written permission from the New York Times before use. For non-compliance, the newly implemented restrictions are subjected to fines and penalties.
However, the New York Times has not modified its robots.txt file, which instructs search engine crawlers about which URLs they can access, possibly due to the $100 million deal between the New York Times and Google last February. The agreement allows Google to feature the materials of the New York Times on some of its platforms for the next three years. (Related: Google unveils plan to use AI to completely destroy journalism.)
Meaning to say, Google may collect public data from the web, including the New York Timesto train Bard and Cloud AI, and the updated terms of service only apply to other companies like Microsoft and OpenAI's ChatGPT.
News organizations seek transparency to protect journalism copyright in AI training
This move by the New York Times may prompt other publications to create their own policies against using journalism materials in AI training.
For instance, News Media Alliance and the European Publishers' Council, along with Agence France-Presse, the European Pressphoto Agency, Gannet, Getty Images, the National Press Photographers Association, the National Writers Union, the Associated Press and The Authors Guild,led the call for revisions in regulations in the use of copyrighted journalism material in training generative AI and large language models through an open letter.
Media organizations explained in the letter that AI-driven technologies have the capability to create and disseminate content like news articles and other forms of media without credit to the original creators. This practice, the letter contends, severely undermines the very foundations of the media industry's business models.
"In addition to violating copyright law, the resulting impact is to meaningfully reduce media diversity and undermine the financial viability of companies to invest in media coverage, further reducing the public’s access to high-quality and trustworthy information," the letter said.
The letter emphasized the urgent need for a framework that empowers media companies to engage in collective negotiations with AI model operators. By doing so, media organizations would secure fair compensation for the use of their intellectual property while maintaining the integrity of their content.
The rise of OpenAI's ChatGPT and Google's Bard, which rely on generative AI, has led to an AI-content explosion in different online platforms. However, these AI-generated content do not disclose the datasets used to train their models, even though earlier iterations of these technologies have been known to use vast amounts of information scraped from the internet, including content from news websites.