4 mins read

What you should know about tech companies using AI to teach their own AI

OpenAI, Google, and other technology companies train their chatbots on massive amounts of data collected from books, Wikipedia articles, news, and other sources on the Internet. However, they hope to be able to use so-called synthetic data in the future.

That's because tech companies may be exploiting the high-quality texts the Internet has to offer to develop artificial intelligence. And the companies face copyright lawsuits from authors, news organizations and computer programmers for using their works without permission. (In one such lawsuit, the New York Times sued OpenAI and Microsoft.)

They believe that synthetic data will help reduce copyright issues and increase the supply of AI training materials. Here's what you should know about it.

It is data generated by artificial intelligence.

Yes. Instead of training AI models with text written by humans, tech companies like Google, OpenAI and Anthropic hope to train their technology with data generated by other AI models.

Not exactly. AI models get things wrong and make things up. They have also shown that they recognize the biases that emerge in the internet data they were trained on. So when companies use AI to train AI, they can end up reinforcing their own mistakes.

No. Tech companies are experimenting with it. But because of the potential shortcomings of synthetic data, it doesn't play a major role in the way AI systems are built today.

The companies believe they can refine the way synthetic data is created. OpenAI and others have been exploring a technique in which two different AI models work together to generate synthetic data that is more useful and reliable.

An AI model generates the data. Then a second model assesses the data, much like a human, and decides whether the data is good or bad, accurate or not. AI models can actually judge texts better than they can write them.

“If you give the technology two things, you can pretty much decide which one looks best,” said Nathan Lile, the chief executive of AI startup SynthLabs.

The idea is that this will provide the high-quality data needed to train an even better chatbot.

Somehow. It all comes down to this second AI model. How well can you judge texts?

Anthropic has been the most vocal about its efforts to make this a success. It optimizes the second AI model based on a “constitution” curated by the company’s researchers. This teaches the model to select a text that supports certain principles, such as Liberty, Equality, and Fraternity or Life, Liberty, and Personal Security. Anthropic’s method is known as “Constitutional AI.”

Here's how two AI models work together to generate synthetic data using a process like Anthropic's:

Still, humans are needed to ensure the second AI model stays on track. This limits how much synthetic data this process can generate. And researchers are divided over whether a method like Anthropic's will continue to improve AI systems.

The AI ​​models that generate synthetic data were, in turn, trained on human-generated data, much of which was copyrighted. Therefore, copyright holders can still argue that companies like OpenAI and Anthropic have used copyrighted text, images and videos without permission.

Jeff Clune, a computer science professor at the University of British Columbia who previously worked as a researcher at OpenAI, said AI models could eventually become more powerful than the human brain in some ways. But they will do it because they have learned from the human brain.

“To paraphrase Newton, AI sees further by standing on the shoulders of massive human data sets,” he said.