How LLMs Like ChatGPT Trained

The three principal sources of information through which OpenAI’s large language models, including those powering ChatGPT, are developed, consist of: information publicly available on the Internet, licensed third-party information, and information provided by users or human trainers.

This document is a high-level description of the information available in the public domain to develop our models and how we source and use such information, keeping privacy laws in mind. If you want to know more about how we collect and use information from users of our services or how to opt out of having your ChatGPT conversations used to help teach our models, please refer to our Privacy Policy and this help center article.

What is ChatGPT, and how does it work? ChatGPT is an artificial intelligence service that lives on the Internet. It can be used for reorganization or summarizing texts or creating new ones. It has been built in such a manner that answering questions and following instructions is possible through the ability to understand them. It “reads” copious amounts of existing text, learns how words tend to appear in context with other words, and does so for a very large sample of all the words that it knows. It then utilizes this knowledge of the likely word that can be used in response to the prompt by a user, and any word thereafter. The technique is similar to those used in auto-complete features of search engines, smartphones, and email applications.

For example, as part of the model’s learning process (called “training”), we may present a partially obscured sentence, such as “instead of turning left, she turned ___.” When untrained, the model replies with arbitrary words, but it reads and learns from lots of sentence lines. Over time, the model begins to comprehend this kind of sentence and is able to predict the next word accurately. It then continues with this process over an extremely large number of sentences.

As many words can possibly come after this sentence, it may be “she turned right,” “she turned around,” or even “she turned back,” and thus there is some randomness in the way a model can respond. In many cases, our models will answer the same question in different ways.

Machine learning models are long chains of numbers, better termed weights or parameters, not far from lines of code that interpret these numbers. A model itself does not contain a copy of the information it has learned; rather, as a model learns something, some of the numbers composing it change slightly in order to represent what it’s learned. In the last case, the model read some information that helped it learn from guessing random incorrect words to guessing more often correctly, but in fact, all that happened in the model itself was that the numbers changed a little bit: It changed, modified, or stored copies of sentences that it had read.

What kind of information does ChatGPT learn from?

As stated earlier, ChatGPT and all our other services are trained on a combination of (1) public information sourced from the internet, (2) third-party licensed sources, and (3) user or human trainer data. This article focuses on the first category: public information sourced from the internet. On this set of information, we use only public information—information available for free, open access to everyone on the internet; we do not look, for example, at information behind paywalls or from the “dark web.” We apply filters and take away all sorts of information that we don’t want our models to learn from or output, such as hate speech, adult content, sites with a major focus on the aggregation of personal information, and spam. What we then proceed to do is use the information to teach our models.

While I talked about it in the previous part, the training data is not copied or saved into a database; rather, ChatGPT learns word associations, and that learning helps update its numbers/weights. The model then uses those weights to predict and generate new words following a user prompt.

It doesn’t “copy and paste” training data – much like a person who has read a book and puts it down, our models don’t have access to training data once they’ve learned from it.

Is personal data being used to train ChatGPT?

Since most data on the web pertains to human beings, our training information incidentally contains personal information. We do not seek personal information to train our models. We will use training information only to help our models learn about language and how to understand and respond to it. We do not and will not use any personal information that is within the training information to build up profiles about people, contact them, advertise or try to sell anything to them, or sell the information itself. Our models may use personal information to learn things about how names and addresses can appear in language and sentences, or to learn about famous people and public figures. This is the way that our models can become good at giving the right responses. We also take steps to minimize the processing of personal information when training our models. For instance, we exclude from training those kinds of websites that scrape huge amounts of personal information, and then we try to train our models to not output requests where there might be private or sensitive information about people.

In what ways does the development of ChatGPT adhere to privacy laws?

We use information for training lawfully. Large language models are now used for so many big applications that bring invaluable benefits to people in content creation and customer service, software development and education tailored, science and more. It is possible to capitalize on these benefits from large amounts of data that are fed into the models. The training data is not specifically identified towards any individual, and the main sources for the information provided have been in the public domain already. For all these reasons, we base our collection and use of personal information that is included in training information on legitimate interests under privacy laws like the GDPR, as explained in more detail in our Privacy Policy.

Also, we have finalized the data protection impact assessment to help confirm our processes for collection and use are carried out legally and responsibly. We respond to objection requests and similar rights. Through the process of learning language, ChatGPT responses may sometimes include personal information relating to individuals whose personal information is publicly available multiple times on the internet (for example, public figures). Individuals in some jurisdictions may have a right to object to our models processing of personal information in the Privacy Portal, a right of access, a right of correction, a right of restriction, a right to be forgotten, and a right of transfer as regards their personal information that may be included in our training information. You can do so at dsar@openai.com. Please note that under privacy laws, some of these rights may not be absolute. We may refuse requests where we have lawful reasons for doing so, though we will always seek to protect personal information, fully respecting all relevant privacy laws. If you believe that you have not received a satisfactory response from us, you have the right to file a complaint with your local supervisory authority.

Comments (0)
Add Comment