Google's AI advancement allows chatbots to utilize six times less memory during discussions without sacrificing effectiveness.

in News

TurboQuant reconfigures data in active memory into a condensed format that AI models can then utilize as if it were the original data, but with a considerably smaller memory footprint.(Image credit: Google)Share this article 0Join the conversationFollow usAdd us as a preferred source on GoogleSubscribe to our newsletter

Google developers have devised a technique to condense artificial intelligence (AI) information, requiring as much as six times less active memory for operation.

Through the novel system, designated TurboQuant, AI algorithms can preserve the same volume of data and execute equally potent computations, albeit with substantially reduced memory hardware, the corporation states.

For instance, if you query ChatGPT regarding the upcoming weather forecast in your locale, it might retain terms such as “weather” and “tomorrow,” alongside your location and preliminary predictions, like “It could be wet,” within the KV cache while formulating its response. The larger an AI model’s KV cache, the greater its capacity to track simultaneous information, and consequently, its power.

A single sentence comprises merely a few dozen tokens — the fundamental units of AI prompts and output text — yet retaining hundreds of thousands of tokens in the KV cache for more intricate tasks can demand tens of gigabytes of memory. These memory demands escalate linearly with the number of users, and ChatGPT is known to handle billions of queries daily.

The compression methodology will diminish the active memory required by an AI model to perform identical computations. This is achieved through a process known as quantization, which results in values being represented by fewer bits.

While Google has employed quantization in its neural networks for numerous years, it has typically been applied statically — meaning the compression is performed once and remains unchanged as the model operates. The distinction with TurboQuant lies in its real-time reduction of the KV cache’s memory ‪—‬ a challenging undertaking given the necessity to maintain the accuracy and currency of the quantized data within the cache as the model generates outputs.

In a formal communication, Google representatives indicated that TurboQuant “demonstrated significant potential for alleviating key-value bottlenecks without compromising AI model performance” during evaluations on Meta’s Llama 3.1-8B, Google’s Gemma, and Mistral AI models.

“This holds potentially far-reaching consequences for all use cases reliant on compression, particularly within the realms of search and AI,” they added.

Is this Google’s “DeepSeek moment”?

Google asserts that TurboQuant can decrease the KV cache’s size by at least a sixfold factor, utilizing two approaches: PolarQuant and Quantized Johnson-Lindenstrauss (QJL).

To comprehend these methods, it is crucial to recognize that data within the AI’s active memory has been converted into vectors — collections of numbers possessing a defined magnitude (radius) and direction (angle). Vectors can be mathematically “rotated,” signifying their re-expression within a different, shared coordinate system.

PolarQuant quantization rephrases AI data from Cartesian coordinates (along X, Y, and Z axes) into polar coordinates (angles around a central point). This rotation standardizes the alignment of vector angles, thus enabling their compression into fewer bits with minimal supplementary scaling information. Subsequently, the vectors undergo the QJL optimization technique, wherein they are subtly adjusted to rectify any computational discrepancies arising from the quantization.

In a post on the social media platform X, Matthew Prince, CEO of the web security firm Cloudflare, referred to the compression advancement as “Google’s DeepSeek” — an allusion to the unexpected release of the Chinese company’s AI model that achieved comparable outcomes to leading chatbots at a substantially lower expenditure.

Google’s announcement of TurboQuant on March 24 caused a dip in the stocks of memory manufacturers such as SanDisk, Western Digital, and Seagate. However, while this innovation may prove instrumental in enhancing AI efficiency, it remains in the laboratory phase and has not yet been broadly deployed in practical models.

Google’s AI advancement allows chatbots to utilize six times less memory during discussions without sacrificing effectiveness.

Leave a ReplyCancel Reply