Hugging Face, the machine learning community and AI tools platform, announced the release of HuggingChat, an open-source ChatGPT clone that anyone can use for themselves or download.
Hugging Face is a company and an AI community. It provides access to free, open-source tools for developing machine learning and AI applications.
One of Hugging Face’s recently completed projects is a 176 billion parameter language model called Bloom, which is available to anyone who agrees to adhere to its Responsible AI license.
There is access to open source models in different categories such as Multimodal, Vision, Audio, Natural Language Processing and Reinforcement Learning.
Hugging Face also hosts open-source datasets and libraries and serves as a way for teams to collaborate, including a repository similar to GitHub.
Many of the services are free, available at Pro and Enterprise levels.
The HuggingChat ChatGPT clone is based on the Open Assistant Conversational AI Model.
Open Assistant itself is a project of the non-profit Large-scale Artificial Intelligence Open Network (LAION).
LAION is a global non-profit organization dedicated to providing access to cutting-edge technology as open source.
We believe that machine learning research and its applications have the potential to have a huge positive impact on our world and should therefore be democratized.
OUR MAIN GOALS
Share open datasets, code, and machine learning models.
We aim to teach the fundamentals of large-scale ML research and data management.
By making models, datasets and code reusable without having to constantly train from scratch, we want to promote efficient use of energy and computational resources to meet the challenges of climate change.”
The GitHub page for the Open Assistant chat model states:
“Open Assistant is a project that aims to give everyone access to a great chat-based big language model.
We believe that by doing this we will create a revolution in language innovation.
Just as Stable-Diffusion has helped the world create art and images in new ways, we hope Open Assistant can help improve the world by improving language itself.”
HuggingChat training record
HuggingChat was trained using the OpenAssistant Conversations Dataset (OASST1), which is very recent and includes data collected up to April 12, 2023.
The research paper for the dataset is dated April 2023 (OpenAssistant Conversations – Democratizing alignment of large language models – pdf).
This model uses the same training methodology developed by OpenAI, called Reinforcement Learning from Human Feedback (RLHF).
RLHF is a technique for creating a high-quality, human-annotated and quality-assessed dataset of questions and answers that can be used to train an AI to follow instructions.
With this release, they have achieved their goal of making the RLHF technique accessible to anyone who wants to train an AI.
The research paper states:
“In an effort to democratize large-scale alignment research, we are releasing OpenAssistant Conversations, a human-generated, human-annotated, assistant-style conversation corpus consisting of 161,443 messages spread across 66,497 conversation trees, in 35 different languages, commented with 461,292 Quality Ratings.”
The dataset is the result of a global crowdsourcing effort by over 13,000 volunteers.
Crowdsourcing was a good way to generate multilingual training data that contributed to a high-quality data set.
However, according to the researchers, the crowdsourcing approach also introduced limitations in the quality of the dataset in the form of cultural and subjective biases of the people who created and evaluated the training data.
They also warned that more engaged participants tended to contribute more, leading to an uneven distribution of their values and biases.
The researchers conclude that the dataset may not reflect the diversity of viewpoints of all contributors.
For example, they sent a poll to their Discord channel (English only) asking their open source contributors questions about their demographics (but not their ethnicity).
Linguistic bias aside, the results of the survey showed that of the 226 respondents, 201 were male, 10 were female, five identified as non-binary/other, and 10 chose not to answer.
While they don’t 100% guarantee that the record is free from malicious content, they still stand behind it as it was created with strict quality guidelines.
The researchers write:
“To ensure the quality of our dataset, we have established strict contributor guidelines that all users must follow.
These guidelines are designed to prevent harmful content from being added to our dataset and encourage contributors to generate quality responses.”
HuggingChat is available
HuggingChat is now open for users. Registration to create a login account is not required for use.
Don’t expect ChatGPT output level, the service is not at that level yet. The app page lists it as version 0.0, which should give an idea of how mature it is at this point.
Nonetheless, it’s a remarkable achievement and a first step for the open source community, and it’s absolutely free to use.
Visit the HuggingChat website here:
HuggingChat Website and User Interface