How the ChatGPT watermark works and why it could be defeated
8 mins read

How the ChatGPT watermark works and why it could be defeated

OpenAI’s ChatGPT introduced a way to automatically create content, but plans to introduce a watermark feature to make it easier to spot are making some people nervous. This is how ChatGPT watermark works and why there may be a way to bypass it.

ChatGPT is an incredible tool that online publishers, affiliates and SEOs love and fear at the same time.

Some marketers love it because they’re discovering new ways to use it to create content descriptions, outlines, and complex articles.

Online publishers fear that AI content will flood search results and displace professional articles written by humans.

Consequently, the news of a watermarking feature that will unlock recognition of ChatGPT-authored content is also awaited with concern and hope.

Cryptographic Watermark

A watermark is a semi-transparent character (a logo or text) embedded in an image. The watermark signals who the original author of the work is.

It can be seen mostly in photos and increasingly in videos.

Watermarking text in ChatGPT involves cryptography in the form of embedding a pattern of words, letters, and punctuation in the form of a secret code.

Scott Aaronson and ChatGPT watermark

An influential computer scientist named Scott Aaronson was hired by OpenAI in June 2022 to work on AI security and alignment.

AI safety is a research area concerned with investigating ways in which AI could harm people and finding ways to prevent these types of negative disruptions.

The scholarly journal Distill, with authors associated with OpenAI, defines AI security as follows:

“The goal of long-term artificial intelligence (AI) security is to ensure that advanced AI systems are reliably aligned with human values ​​- that they are reliably doing things that humans expect of them.”

AI alignment is the area of ​​artificial intelligence that takes care of making sure the AI ​​is aligned to the intended goals.

A Large Language Model (LLM) such as ChatGPT can be used in ways that may run counter to the goals of AI alignment defined by OpenAI, which is to create AI that benefits humanity.

So the reason for watermarking is to prevent misuse of AI in a way that harms humanity.

Aaronson explained the reason behind the ChatGPT output watermark:

“Of course, this could be helpful to prevent academic plagiarism, but also, for example, to the mass production of propaganda…”

How does ChatGPT watermark work?

ChatGPT watermarking is a system that embeds a statistical pattern, a code, into the wording and even punctuation.

Content created by artificial intelligence is generated with a fairly predictable pattern of word choice.

The words written by humans and AI follow a statistical pattern.

Changing the pattern of words used in generated content is one way to “watermark” the text so a system can easily tell if it’s the product of an AI text generator.

The trick that makes the watermark of AI content undetectable is that the distribution of words still has a random appearance, similar to normal AI-generated text.

This is called pseudo-random distribution of words.

Pseudo-random is a statistically random sequence of words or numbers that are not truly random.

ChatGPT watermarks are not currently used. However, OpenAI’s Scott Aaronson announces that this is planned.

ChatGPT is currently in preview, allowing OpenAI to detect “misalignment” through real-world usage.

Presumably watermark may be introduced in a final version of ChatGPT or earlier.

Scott Aaronson wrote about how watermarks work:

“My main project so far has been a tool for statistically watermarking the output of a text model like GPT.

Basically, we want every time GPT generates a long piece of text, there’s an otherwise unnoticeable secret signal in its wording that you can later use to prove that it came from GPT.”

Aaronson went on to explain how ChatGPT watermarks work. But first, it’s important to understand the concept of tokenization.

Tokenization is a step in natural language processing where the machine takes the words in a document and breaks them down into semantic units such as words and sentences.

Tokenization transforms text into a structured form that can be used in machine learning.

The process of text generation is the machine guessing which token is next based on the previous token.

This is done using a mathematical function that determines the probability of what the next token will be, called the probability distribution.

Which word comes next is predicted, but it is random.

The watermark itself is what Aaron describes as pseudo-random, since there is a mathematical reason for a particular word or punctuation mark being there, but it’s still statistically random.

Here is the technical explanation of the GPT watermark:

“For GPT, each input and output is a set of tokens, which can be words, but also punctuation marks, parts of words or more – there are about 100,000 tokens in total.

At its core, GPT constantly generates a probability distribution about the next token to be generated, depending on the chain of previous tokens.

After the neural network generates the distribution, the OpenAI server actually tests a token according to that distribution – or a modified version of the distribution, depending on a parameter called “temperature”.

However, as long as the temperature is non-zero, the choice of the next token will usually be random: you could run with the same prompt over and over and get a different completion (i.e. set of output tokens) each time.

Instead of randomly selecting the next token, the idea with the watermark is to select it pseudo-randomly using a cryptographic pseudo-random function whose key is known only to OpenAI.”

The watermark looks completely natural to those reading the text, as the word choice mimics the randomness of all other words.

But this randomness contains a distortion that can only be recognized by someone who has the key to deciphering it.

This is the technical explanation:

“To illustrate, in the specific case that GPT had a set of possible tokens that it judged to be equally likely, you could just pick the token with maximized g. The choice would look uniformly random to someone who didn’t know the key, but someone who did know the key could later sum g over all the n-grams and see that it was anomalously large.

Watermarks are a privacy-first solution

I’ve seen discussions on social media where some people suggested that OpenAI record every output it generates and use that for detection.

Scott Aaronson confirms that OpenAI could do that, but that it poses a privacy issue. The possible exception is the law enforcement situation, which he did not elaborate on.

How to recognize ChatGPT or GPT watermark

Something interesting that doesn’t seem to be known yet is that Scott Aaronson noticed that there is a way to bypass the watermark.

He didn’t say it possible to defeat the watermark, he said that it can be defeated.

“Well, all of that can be defeated with enough effort.

For example, if you used another AI to paraphrase the output of GPT – well, we won’t be able to tell.”

It seems like the water sign can be defeated at least as of November when the above statements were made.

There is no indication that the watermark is currently in use. But when deployed, it may be unknown whether this gap has been closed.

Citation

Read Scott Aaronson’s blog post here.

Featured image by Shutterstock/RealPeopleStudio

Leave a Reply

Your email address will not be published. Required fields are marked *