New AI standards group wants to make data scraping opt-in
The first wave of major generative AI tools largely were trained on “publicly available” data—basically, anything and everything that could be scraped from the Internet. Now, sources of training data are increasingly restricting access and pushing for licensing agreements. With the hunt for additional data sources intensifying, new licensing startups have emerged to keep the source material flowing.
The Dataset Providers Alliance, a trade group formed this summer, wants to make the AI industry more standardized and fair. To that end, it has just released a position paper outlining its stances on major AI-related issues. The alliance is made up of seven AI licensing companies, including music copyright-management firm Rightsify, Japanese stock-photo marketplace Pixta, and generative-AI copyright-licensing startup Calliope Networks. (At least five new members will be announced in the fall.)
The DPA advocates for an opt-in system, meaning that data can be used only after consent is explicitly given by creators and rights holders. This represents a significant departure from the way most major AI companies operate. Some have developed their own opt-out systems, which put the burden on data owners to pull their work on a case-by-case basis. Others offer no opt-outs whatsoever.
The DPA, which expects members to adhere to its opt-in rule, sees that route as the far more ethical one. “Artists and creators should be on board,” says Alex Bestall, CEO of Rightsify and the music-data-licensing company Global Copyright Exchange, who spearheaded the effort. Bestall sees opt-in as a pragmatic approach as well as a moral one: “Selling publicly available datasets is one way to get sued and have no credibility.”
Ed Newton-Rex, a former AI executive who now runs the ethical AI nonprofit Fairly Trained, calls opt-outs “fundamentally unfair to creators,” adding that some may not even know when opt-outs are offered. “It’s particularly good to see the DPA calling for opt-ins,” he says.
Shayne Longpre, the lead at the Data Provenance Initiative, a volunteer collective that audits AI datasets, sees the DPA’s efforts to source data ethically as admirable, although he suspects the opt-in standard could be a tough sell, because of the sheer volume of data most modern-day AI models require. “Under this regime, you’re either going to be data-starved or you’re going to pay a lot,” he says. “It could be that only a few players, large tech companies, can afford to license all that data.”
In the paper, the DPA comes out against government-mandated licensing, arguing instead for a “free market” approach in which data originators and AI companies negotiate directly. Other guidelines are more granular. For example, the alliance suggests five potential compensation structures to make sure creators and rights holders are paid appropriately for their data. These include a subscription-based model, “usage-based licensing” (in which fees are paid per use), and “outcome-based” licensing, in which royalties are tied to profit. “These could work for anything from music to images to film and TV or books,” Bestall says.
“Looking to standardize compensation structures is potentially a good thing,” says Bill Rosenblatt, a technologist who studies copyright. “The Dataset Providers Alliance is in a very good position to put terms out there.” As Rosenblatt sees it, AI companies need incentives to adopt licensing. While the legal reasons (fear of lawsuits, regulation mandating licenses) are the most obviously compelling, Rosenblatt says it’s also important for would-be licensors to make the process as easy and convenient as possible. Standardizing payment models, he argues, helps smooth the road for mainstream adoption.
The DPA also endorses some uses of synthetic data—that which is generated by AI—arguing that it will “constitute the majority” of training data in the near future. “Some copyright holders probably won’t like it,” Bestall says. “But it’s inevitable.” The alliance advocates for “proper licensing” of the pre-training information used to create synthetic data and transparency on how the latter is made. It also calls for regular “evaluation” of the synthetic data models to “mitigate biases and ethical issues.”
Of course, the DPA needs to get the industry’s power players on board, which is easier said than done. “There are standards emerging for how to license data ethically,” Newton-Rex says. “But not enough AI companies are adopting them.”
Still, the very existence of the DPA demonstrates that the AI Wild West days appear to be coming to an end. “Everything is changing so fast,” Bestall says.
This story originally appeared on wired.com.
Source link