Why DeepSeek is such a game changer: Scientists explain how AI models work and why they were so cheap to build.

DeepSeek is a new artificial intelligence (AI) model from China. (Image credit: Thomas Fuller/SOPA Images/LightRocket via Getty Images)

Less than two weeks ago, a little-known Chinese company unveiled its latest artificial intelligence (AI) model, taking the world by surprise.

DeepSeek claims in a whitepaper posted on GitHub that its open-parameter R1 model has demonstrated comparable or even superior results to AI models developed by Silicon Valley giants like OpenAI’s ChatGPT, Meta’s Llama, and Anthropic’s Claude. And most astonishingly, the model achieved these results after being trained and running for a fraction of the cost.

The market reaction to the news on Monday was sharp and negative: after DeepSeek became the most downloaded free app in Apple's App Store, the market valuations of leading US tech companies were shaved by $1 trillion.

Nvidia, the company that makes the high-end H100 graphics chips that are supposedly needed to train AI, lost $589 billion in valuation, the largest one-day market decline in U.S. history. DeepSeek eventually revealed that it had trained its AI model without those chips, though it did use less powerful Nvidia chips. American tech companies reacted with panic and discontent, with OpenAI even suggesting that DeepSeek was plagiarizing elements from its models.

AI experts say DeepSeek has changed the fundamental dogma that underpins the industry's approach to development, demonstrating that bigger is not always better.

“The fact that DeepSeek was able to build their model at lower cost, with less compute resources, in less time, and run it locally on more affordable machines confirms that in our rush to make models bigger, we've missed the opportunity to build smarter, smaller solutions,” Christian Hammond, a professor of computer science at Northwestern University, told Live Science in an email.

But what makes the DeepSeek V3 and R1 so groundbreaking? According to scientists, the key is their efficiency.

What drives DeepSeek models?

“In terms of advances, DeepSeek is more evolutionary than revolutionary,” Ambuj Tewari, a professor of statistics and computer science at the University of Michigan, told Live Science. “They’re still operating within the dominant paradigm of very large models (hundreds of billions of parameters) on very large datasets (trillions of tokens) with very large budgets.”

If we take DeepSeek's claims at face value, the key innovation in the company's approach is how it leverages its powerful models to perform at the same level as other systems, but with fewer resources, Tewari said.

The key here is an “expert mixing” system that divides DeepSeek models into submodels, each with its own specialization in a particular task or data type. This is accompanied by a load balancing system that, instead of applying an overall penalty to slow down an overloaded system, as other models do, dynamically redistributes tasks from overloaded submodels to less loaded ones.

“This means that while the V3 model has 671 billion parameters, only 37 billion are actually activated for any given token,” Tewari explained. A token refers to a processing unit in the Large Language Model (LLM), equivalent to a piece of text.

Further load balancing is facilitated by a technique known as “inference-time compute scaling,” a governor in DeepSeek models that increases or decreases allocated resources depending on the complexity of the task being performed.

This efficiency also applies to training DeepSeek’s models, which experts believe is an unintended consequence of U.S. export restrictions. China has limited access to Nvidia’s latest H100 chips, so DeepSeek says it instead built its models using H800 chips, which have slower data transfer rates between chips. Nvidia developed this “weaker” chip in 2023 specifically to bypass export controls.

Sourse: www.livescience.com

Leave a Reply

Your email address will not be published. Required fields are marked *