Apple wants to scrape content for Apple Intelligence training — but few publishers have agreed terms to let it happen

Real Hacker StaffAugust 29, 2024

1 1 minute read

Apple wants to scrape content for Apple Intelligence training — but few publishers have agreed terms to let it happen

In order to work properly, generative AI (like Apple Intelligence) needs to pull from ‘training information’. That can be gained from a range of sources, although if you want to make sure that as few people as possible are cross about your artificial intelligence model, then you’ll need to make sure that the date used is ‘ethical’.

That means you need permission from those range of sources so that you can use the data without concern. Otherwise, you end up in a mire of copyright infringement issues and legal trouble — as many are doing. Or, if you’re Google, then you petition governments so that copyright law doesn’t apply to you, and you can use whatever data you want to train your AI without needing permission, or to pay.

Because you’re special.

Apple reckons, however, that it does need to pay for the data it uses to train Apple Intelligence, although it’s finding its own issues — namely, that a whole bunch of some of the largest publishers on the internet want nothing to do with the AI that it has created.

Publishers give a resounding ‘No’

According to Wired, lots of publishers such as the New York Times and even Facebook have used a feature that stops Apple from scraping their content when training Apple Intelligence. It’s called, imaginatively, ‘Robot.txt‘, a text file that tells Apple’s scraping bot to avoid the content at hand.

Apple’s offer to pay for the data hasn’t stopped 25% of websites from blocking the scraper, a number which could rise as the official launch of Apple Intelligence gets ever closer. AI is something that the publishing world is becoming increasingly aware and wary of — particularly when you remember that the New York Times is currently suing the largest generative AI model, ChatGPT, over its use of NYT content to train itself.