Will this data set be used for Google’s AI search?
Google has published a research paper on a new type of data set for training a language model to retrieve sentences that accurately answer a question in an open dialogue.
We do not know if Google uses this dataset. However, researchers claim that it outperforms models trained on other datasets.
Many research papers, such as that published for LaMDA, do not mention specific contexts of how it might be used.
For example, the LaMDA research paper (PDF) vaguely concludes:
“LaMDA is a step towards practical and secure open dialogue systems, which in turn can open up a wide range of useful applications.”
This research paper states that the problem they solve is how to create a dataset to train a machine for an open-ended dialogue by choosing a sentence from a web page.
Why this record is important
What is interesting about this research is that the researchers conclude that it could be used to factually justify generative AI outputs, such as those seen in Google’s new Search Generative Experience.
Given that the research paper was presented at an information retrieval conference (Proceedings of the 45th International ACM SIGIR Conference on Research and Development), it is fairly safe to assume that this algorithm is related to information retrieval, i.e. search .
One last thing to note is that research on this new type of dataset was unveiled last year in 2022 but appears to have gone unnoticed… Until now.
What Google wanted to achieve with the new data set
The researchers explain what they are focusing on:
“In this paper, we focus on open dialogues: two parties take turns conversing on any number of topics, with no restrictions on topic switching and the manner of discussing each topic.
Furthermore, the dialog is not locked to a specific document, contrary to the setting used in some previous work…
The task we are dealing with is to retrieve sentences from a document corpus that contain information useful for generating the next round in dialogue (either automatically or by humans).
We would like to point out that the rounds of talks can involve questions, queries, arguments, statements, etc.
A new type of data set for language model training
The problem the researchers are solving is retrieving a phrase from a webpage as the answer to an open-ended question, a type of question that requires more than a yes or no answer.
The research paper explains that a suitable conversation data set is missing for this capability to be implemented in a machine.
They explain that existing records are used for two reasons:
- For evaluating dialog responses by a generative AI, but not for use in training to actually retrieve the relevant information for that response.
- Records for use by a search engine or to answer questions that focus on a single passage of a question and answer.
They explain the shortcomings of existing data sets:
“…in most of these records, the returned search results are not considered part of the dialog.
…in both conversational passage retrieval and conversational QA datasets, there is a user asking questions or queries that reflect explicit intent with information needs, as opposed to natural dialogue where intent may only be implicitly represented, e.g. B. in affirmative statements.
In summary, existing conversation datasets do not combine natural human-human conversations with relevance annotations for sentences retrieved from a large document corpus.
That’s why we created such a dataset…”
How the new record was created
The researchers created a dataset that can be used to train an algorithm that can retrieve a sentence that represents the correct answer in an open-ended dialogue.
The dataset consists of Reddit conversations matched to answers from Wikipedia, and human annotations (relevance ratings) of these question and answer pairs.
Reddit data was downloaded from Pushshift.io, an archive of Reddit conversations (Pushshift FAQ).
The research paper explains:
“To cover a broader spectrum of this task, where any type of dialogue can be used, we created a dataset that includes open dialogues from Reddit, candidate sentences from Wikipedia for each dialogue, and human annotations for the sentences.
The dataset includes 846 dialogs created from Reddit threads.
For each dialogue, 50 sentences were retrieved from Wikipedia using an unattended initial retrieval method.
These sentences were judged by crowdworkers for their relevance, that is, whether they contained information that was useful for generating the next turn in the dialogue.”
The dataset they created is available on GitHub.
Example of a dialogue question:
“What came first, the chicken or the egg?”
An example of an irrelevant answer:
“Domestic chickens have been around for about 10,000 years. Eggs have been around for hundreds of millions of years.”
An example of a correct web page sentence to use as an answer is:
“More simply put by Neil deGrasse Tyson:
“Which came first: the chicken or the egg? ‘The egg laid by a bird that wasn’t a chicken.'”
retrieval methodology
For the retrieval part, they cite previous research on language models and other methods and opt for a weak supervision approach.
You explain:
“Fine-tuning retrieval models requires relevancy labels for training examples in a target task.
These are sometimes scarce or unavailable.
One approach to get around this is to automatically generate annotations and train a weakly supervised model on those annotations.
…We follow the weak supervision paradigm in our model training with a novel weak Reddit annotator for retrieval in a dialog context.”
Is the record successful?
Google and other organizations publish many research papers showing varying degrees of success.
Some research comes about with limited success and changes the state of the art only slightly, if at all.
The research papers that are of interest (to me) are those that are clearly successful and exceed the current state of the art.
Such is the case in developing this dataset to train a language model to retrieve sentences that serve precisely as a locution in an open-ended dialogue.
They indicate how a BERT model trained with this dataset becomes even more powerful.
You write:
“While RANKBERTMS outperforms all non-fine-tuned models, the RANKBERTMS→R model, further refined using our weakly supervised training set, improves performance.
This method achieves the highest performance, with any gains in performance over other methods being statistically significant.
This result also demonstrates the effectiveness of our weak annotator and weakly supervised training set and shows that performance can be improved without manual annotation for training.”
Elsewhere, the researchers report:
“We show that a neural ranger fine-tuned using our weakly supervised training set outperforms all other models tested, including a neural ranger fine-tuned using the MS Marco passage retrieval data set.”
They also write that as successful as this approach is, they are keen to advance the state of the art even further than they have already done.
The research paper concludes:
“In future work, we aim to develop BERT-based retrieval models, trained solely on the basis of weak supervision, using a pre-trained BERT without requiring large annotated training sets like MS Marco.”
We also want to anchor generative language models with our retrieval models and examine the conversations that emerge from this anchoring.”
Could this approach be applied?
Google rarely confirms when specific searches are used. In some cases, such as BERT, Google confirms that it uses it.
But in general, the standard answer is like this Just because Google publishes a research paper or patent doesn’t mean they use it in their search algorithm.
However, the research paper, which dates from mid-2022, suggested that a future direction might be to explore how to ground generative language models (similar to Bard and Google’s Search Generative Experience) with it.
A generative AI chat experience can cause the AI output to make things up, which is technically called a hallucination.
Grounding means linking AI chat output to facts, typically from online sources, to prevent hallucinations.
Bing uses a system called Bing Orchestrator that checks web pages to factually base GPT output.
Grounding the AI output helps keep it fact-based, which this dataset may be capable of, in addition to selecting phrases from web pages as part of an answer.
Read the research report:
Abstract Web Page: A sentence retrieval record for open dialogs
Current Research: A sentence retrieval data set for open dialogues
Featured image from Shutterstock/Camilo Concha