Google’s AI advancement allows chatbots to utilize six times less memory during discussions without sacrificing effectiveness.

in News

TurboQuant reconfigures data in active memory into a condensed format that AI models can then utilize as if it were the original data, but with a considerably smaller memory footprint.(Image credit: Google)Share this article 0Join the conversationFollow usAdd us as a preferred source on GoogleSubscribe to our newsletter

Google developers have devised a technique to condense artificial intelligence (AI) information, requiring as much as six times less active memory for operation.

Through the novel system, designated TurboQuant, AI algorithms can preserve the same volume of data and execute equally potent computations, albeit with substantially reduced memory hardware, the corporation states.

For instance, if you query ChatGPT regarding the upcoming weather forecast in your locale, it might retain terms such as “weather” and “tomorrow,” alongside your location and preliminary predictions, like “It could be wet,” within the KV cache while formulating its response. The larger an AI model’s KV cache, the greater its capacity to track simultaneous information, and consequently, its power.

A single sentence comprises merely a few dozen tokens — the fundamental units of AI prompts and output text — yet retaining hundreds of thousands of tokens in the KV cache for more intricate tasks can demand tens of gigabytes of memory. These memory demands escalate linearly with the number of users, and ChatGPT is known to handle billions of queries daily.

The compression methodology will diminish the active memory required by an AI model to perform identical computations. This is achieved through a process known as quantization, which results in values being represented by fewer bits.

While Google has employed quantization in its neural networks for numerous years, it has typically been applied statically — meaning the compression is performed once and remains unchanged as the model operates. The distinction with TurboQuant lies in its real-time reduction of the KV cache’s memory ‪—‬ a challenging undertaking given the necessity to maintain the accuracy and currency of the quantized data within the cache as the model generates outputs.

In a formal communication, Google representatives indicated that TurboQuant “demonstrated significant potential for alleviating key-value bottlenecks without compromising AI model performance” during evaluations on Meta’s Llama 3.1-8B, Google’s Gemma, and Mistral AI models.

“This holds potentially far-reaching consequences for all use cases reliant on compression, particularly within the realms of search and AI,” they added.

Is this Google’s “DeepSeek moment”?

Google asserts that TurboQuant can decrease the KV cache’s size by at least a sixfold factor, utilizing two approaches: PolarQuant and Quantized Johnson-Lindenstrauss (QJL).

To comprehend these methods, it is crucial to recognize that data within the AI’s active memory has been converted into vectors — collections of numbers possessing a defined magnitude (radius) and direction (angle). Vectors can be mathematically “rotated,” signifying their re-expression within a different, shared coordinate system.

PolarQuant quantization rephrases AI data from Cartesian coordinates (along X, Y, and Z axes) into polar coordinates (angles around a central point). This rotation standardizes the alignment of vector angles, thus enabling their compression into fewer bits with minimal supplementary scaling information. Subsequently, the vectors undergo the QJL optimization technique, wherein they are subtly adjusted to rectify any computational discrepancies arising from the quantization.

In a post on the social media platform X, Matthew Prince, CEO of the web security firm Cloudflare, referred to the compression advancement as “Google’s DeepSeek” — an allusion to the unexpected release of the Chinese company’s AI model that achieved comparable outcomes to leading chatbots at a substantially lower expenditure.

Google’s announcement of TurboQuant on March 24 caused a dip in the stocks of memory manufacturers such as SanDisk, Western Digital, and Seagate. However, while this innovation may prove instrumental in enhancing AI efficiency, it remains in the laboratory phase and has not yet been broadly deployed in practical models.

Related stories

  • Large language models can be miniaturized onto your phone — eliminating the need for thousands of servers — following a significant breakthrough
  • What is the primary obstacle to developing superior AI? It’s no longer the scarcity of computing power, but rather the generation of sufficient energy to sustain it.
  • MIT’s advancements in chip stacking could reduce energy consumption in energy-intensive AI processes

Furthermore, it will only compress the active memory utilized during inference, which is the phase when the model generates a response to a query. A model’s training phase typically consumes up to four times more memory than inference, thus the actual memory reduction will be comparatively minor.

This point was elaborated upon by Merrill Lynch banker Vivek Arya to concerned investors in a memo, as reported by ZDNet: “The 6x enhancement in memory efficiency will likely lead to a 6x increase in accuracy (model size) and/or context length (KV cache allocation), rather than a 6x reduction in memory.”

Google officially introduced TurboQuant at ICLR 2026, held from April 23-27 in Rio de Janeiro, and will formally present PolarQuant and QJL at AISTATS 2026 in Tangier, Morocco, in early May.

Sourse: www.livescience.com

Leave a Reply

Your email address will not be published. Required fields are marked *