Sundar Pichai, chief executive officer of Alphabet Inc., during the Google I/O Developers Conference in Mountain View, California, on Wednesday, May 10, 2023.
David Paul Morris | Bloomberg | Getty Images
Google’s new large language model, which the company announced last week, uses almost five times as much training data as its predecessor from 2022, allowing its to perform more advanced coding, math and creative writing tasks, CNBC has learned.
PaLM 2, the company’s new general-use large language model (LLM) that was unveiled at Google I/O, is trained on 3.6 trillion tokens, according to internal documentation viewed by CNBC. Tokens, which are strings of words, are an important building block for training LLMs, because they teach the model to predict the next word that will appear in a sequence.
Google’s previous version of PaLM, which stands for Pathways Language Model, was released in 2022 and trained on 780 billion tokens.
While Google has been eager to showcase the power of its artificial intelligence technology and how it can be embedded into search, emails, word processing and spreadsheets, the company has been unwilling to publish the size or other details of its training data. OpenAI, the Microsoft-backed creator of ChatGPT, has also kept secret the specifics of its latest LLM called GPT-4.
The reason for the lack of disclosure, the companies say, is the competitive nature of the business. Google and OpenAI are rushing to attract users who may want to search for information using conversational chatbots rather than traditional search engines.
But as the AI arms race heats up, the research community is demanding greater transparency.
Since unveiling PaLM 2, Google has said the new model is smaller than prior LLMs, which is significant because it means the company’s technology is becoming more efficient while accomplishing more sophisticated tasks. PaLM 2, according to internal documents, is trained on 340 billion parameters, an indication of the complexity of the model. The initial PaLM was trained on 540 billion parameters.
Google didn’t immediately provide a comment for this story.
Google said in a blog post about PaLM 2 that the model uses a “new technique” called “compute-optimal scaling.” That makes the LLM “more efficient with overall better performance, including faster inference, fewer parameters to serve, and a lower serving cost.”
In announcing PaLM 2, Google confirmed CNBC’s previous reporting that the model is trained on 100 languages and performs a broad range of tasks. It’s already being used to power 25 features and products, including the company’s experimental chatbot Bard. It’s available in four sizes, from smallest to largest: Gecko, Otter, Bison and Unicorn.
PaLM 2 is more powerful than any existing model, based on public disclosures. Facebook’s LLM called LLaMA, which it announced in February, is trained on 1.4 trillion tokens. The last time OpenAI shared ChatGPT’s training size was with GPT-3, when the company said it was trained on 300 billion tokens at the time. OpenAI released GPT-4 in March, and said it exhibits “human-level performance” on many professional tests.
LaMDA, a conversation LLM that Google introduced two years ago and touted in February alongside Bard, was trained on 1.5 trillion tokens, according to the latest documents viewed by CNBC.
As new AI applications quickly hit the mainstream, controversies surrounding the underlying technology are getting more spirited.
El Mahdi El Mhamdi, a senior Google Research scientist, resigned in February over the company’s lack of transparency. On Tuesday, OpenAI CEO Sam Altman testified at a hearing of the Senate Judiciary subcommittee on privacy and technology, and agreed with lawmakers that a new system to deal with AI is needed.
“For a very new technology we need a new framework,” Altman said. “Certainly companies like ours bear a lot of responsibility for the tools that we put out in the world.”
— CNBC’s Jordan Novet contributed to this report.