As someone who spent ~10 years working as a generalist investment analyst at various long/short hedge funds (including stints at Millennium and Balyasny), while also being something of a math and computer nerd who has been studying deep learning since 2010 (back when Geoff Hinton was still talking about Restricted Boltzmann Machines and everything was still programmed using MATLAB, and researchers were still trying to show that they could get better results at classifying handwritten digits than by using Support Vector Machines), I'd like to think that I have a fairly unusual perspective on how AI technology is developing and how this relates to equity valuations in the stock market.
作为一名曾在多家多空对冲基金(包括千禧年和巴利亚斯尼等机构)担任通才投资分析师约 10 年的人,同时也是一个自 2010 年起便钻研深度学习的数学与计算机爱好者(那时 Geoff Hinton 还在谈论受限玻尔兹曼机,一切仍用 MATLAB 编程,研究人员还在努力证明他们在手写数字分类上能比支持向量机取得更好成果),我认为自己对人工智能技术的发展及其与股市中股票估值的关系持有相当独特的视角。
For the past few years, I have been working more as a developer, and have several popular open-source projects for working with various forms of AI models/services (e.g., see LLM Aided OCR, Swiss Army Llama, Fast Vector Similarity, Source to Prompt, and Pastel Inference Layer for a few recent examples). Basically, I am using these frontier models all day, every day, in about as intense a way as possible. I have 3 Claude accounts so I don't run out of requests, and signed up for ChatGPT Pro within minutes of it being available.
过去几年里,我更多地以开发者身份工作,并拥有多个流行的开源项目,这些项目涉及与各种形式的人工智能模型 / 服务协作(例如,可参考 LLM 辅助 OCR、瑞士军刀 Llama、快速向量相似度、源到提示转换以及 Pastel 推理层等近期实例)。基本上,我每天都在以尽可能高强度的方式使用这些前沿模型。我拥有三个 Claude 账户以防请求用尽,并在 ChatGPT Pro 上线几分钟内便完成了注册。
I also try to keep on top of the latest research advances, and carefully read all the major technical report papers that come out from the major AI labs. So I think I have a pretty good read on the space and how things are developing. At the same time, I've shorted a ton of stocks in my life and have won the best idea prize on the Value Investors Club twice (for TMS long and PDH short if you're keeping track at home).
我也努力跟进最新的研究进展,仔细阅读各大人工智能实验室发布的所有主要技术报告论文。因此,我认为我对该领域及其发展趋势有相当深入的了解。同时,我一生中做空过大量股票,并两次在价值投资者俱乐部赢得了最佳创意奖(如果你在家记录的话,分别是 TMS 做多和 PDH 做空)。
I say this not to brag, but rather to help establish my bona fides as someone who could opine on the subject without coming across as hopelessly naive to either technologists or professional investors. And while there are surely many people who know the math/science better, and people who are better at long/short investing in the stock market than me, I doubt there are very many who are in the middle of the Venn diagram to the extent I can claim to be.
我这么说并非自夸,而是为了确立我作为能够就这一主题发表见解的资格,以免在技术专家或专业投资者眼中显得过于天真。虽然肯定有许多人在数学 / 科学方面比我更精通,也有许多人在股市长线 / 短线投资上比我更擅长,但我怀疑很少有人能像我这样处于维恩图的交集之中。
With all that said, whenever I meet with and chat with my friends and ex colleagues from the hedge fund world, the conversation quickly turns to Nvidia. It's not every day that a company goes from relative obscurity to being worth more than the combined stock markets of England, France, or Germany! And naturally, these friends want to know my thoughts on the subject. Because I am such a dyed-in-the-wool believer in the long term transformative impact of this technology— I truly believe it's going to radically change nearly every aspect of our economy and society in the next 5-10 years, with basically no historical precedent— it has been hard for me to make the argument that Nvidia's momentum is going to slow down or stop anytime soon.
综上所述,每当我与来自对冲基金界的朋友和前同事会面聊天时,话题很快便会转向英伟达。一家公司从相对默默无闻发展到市值超过英国、法国或德国股市总和,这可不是每天都能见到的景象!自然,这些朋友都想知道我对此的看法。因为我深信这项技术将在长期内带来变革性影响 —— 我真心认为它将在未来 5 到 10 年内从根本上改变我们经济和社会的几乎每个方面,且基本上没有历史先例 —— 所以我很难提出英伟达的势头会在短期内放缓或停止的观点。
But even though I've thought the valuation was just too rich for my blood for the past year or so, a confluence of recent developments has caused me to flip a bit to my usual instinct, which is to be a bit more contrarian in outlook and to question the consensus when it seems to be more than priced in. The saying "what the wise man believes in the beginning, the fool believes in the end" became famous for a good reason.
尽管过去一年左右我一直认为估值对我来说太高了,但最近一系列的发展让我稍微改变了我通常的直觉,即更倾向于逆向思维,当共识似乎已被充分定价时提出质疑。那句 “智者起初相信的,愚者最终相信” 之所以出名,自有其道理。
The Bull Case 牛市论#
Before we get into the developments that give me pause, let's pause to briefly review the bull case for NVDA shares, which is basically now known by everyone and his brother. Deep learning and AI are the most transformative technologies since the internet, and poised to change basically everything in our society. Nvidia has somehow ended up with something close to a monopoly in terms of the share of aggregate industry capex that is spent on training and inference infrastructure.
在我们深入探讨那些让我犹豫的发展之前,让我们先稍作停顿,简要回顾一下 NVDA 股票的看涨理由,这基本上已是人尽皆知。深度学习和人工智能是自互联网以来最具变革性的技术,有望从根本上改变我们社会的方方面面。Nvidia 不知怎的,在用于训练和推理基础设施的行业总资本支出份额上,几乎达到了垄断的地位。
Some of the largest and most profitable companies in the world, like Microsoft, Apple, Amazon, Meta, Google, Oracle, etc., have all decided that they must do and spend whatever it takes to stay competitive in this space because they simply cannot afford to be left behind. The amount of capex dollars, gigawatts of electricity used, square footage of new-build data centers, and, of course, the number of GPUs, has absolutely exploded and seems to show no sign of slowing down. And Nvidia is able to earn insanely high 90%+ gross margins on the most high-end, datacenter oriented products.
世界上一些最大且盈利最多的公司,如微软、苹果、亚马逊、Meta、谷歌、甲骨文等,都已决定必须不惜一切代价投入资源以保持在这一领域的竞争力,因为它们根本无法承受被落下的后果。资本支出金额、消耗的电力吉瓦数、新建数据中心的面积,当然还有 GPU 的数量,都呈爆炸式增长,且似乎没有放缓的迹象。而英伟达能够在最高端、面向数据中心的产品上获得高达 90% 以上的惊人毛利率。
We've just scratched the surface here of the bull case. There are many additional aspects to it now, which have made even people who were already very bullish to become incrementally more bullish. Besides things like the rise of humanoid robots, which I suspect is going to take most people by surprise when they are rapidly able to perform a huge number of tasks that currently require an unskilled (or even skilled) human worker (e.g., doing laundry, cleaning, organizing, and cooking; doing construction work like renovating a bathroom or building a house in a team of workers; running a warehouse and driving forklifts, etc.), there are other factors which most people haven't even considered.
我们这里只是触及了看涨理由的表面。现在还有许多其他方面,使得那些原本就非常看好的人变得更加乐观。除了像人形机器人的崛起这样的事物 —— 我怀疑当它们能够迅速执行大量目前需要非熟练(甚至熟练)人类工人的任务时(例如,洗衣、清洁、整理和烹饪;在工人团队中进行浴室翻新或建造房屋等建筑工作;管理仓库和驾驶叉车等),大多数人会感到惊讶 —— 还有其他大多数人甚至没有考虑过的因素。
One major thing that you hear the smart crowd talking about is the rise of "a new scaling law," which has created a new paradigm thinking about how compute needs will increase over time. The original scaling law, which is what has been driving progress in AI since AlexNet appeared in 2012 and the Transformer architecture was invented in 2017, is the pre-training scaling law: that the more billions (and now trillions) worth of tokens we can use as training data, and the larger the parameter count of the models we are training, and the more FLOPS of compute that we expend on training those models on those tokens, the better the performance of the resulting models on a large variety of highly useful downstream tasks.
你常听到聪明人群讨论的一个重要话题是 “新扩展法则” 的兴起,这为思考计算需求如何随时间增长开创了新的范式思维。自 2012 年 AlexNet 出现及 2017 年 Transformer 架构发明以来,推动人工智能进步的原扩展法则是预训练扩展法则:即我们能够用于训练数据的 token 数量越多(如今已达万亿级别),所训练模型的参数量越大,以及我们在这些 token 上训练模型所消耗的 FLOPS 计算量越多,最终模型在众多极具实用性的下游任务上的表现就越好。
Not only that, but this improvement is somewhat knowable, to the point where the leading AI labs like OpenAI and Anthropic have a pretty good idea of just how good their latest models would be even before they started the actual training runs— in some cases, predicting the benchmarks of the final models to within a couple percentage points. This "original scaling law" has been vitally important, but always caused some doubts in the minds of people projecting the future with it.
不仅如此,这种改进在一定程度上是可预知的,以至于像 OpenAI 和 Anthropic 这样的领先 AI 实验室,在真正开始训练之前,就已经对它们最新模型的表现有了相当准确的预估 —— 在某些情况下,预测最终模型的基准误差仅在几个百分点之内。这一 “原始缩放定律” 至关重要,但总让用它来预测未来的人们心中存有些许疑虑。
For one thing, we seem to have already exhausted the world's accumulated set of high quality training data. Of course, that's not literally true— there are still so many old books and periodicals that haven't yet been properly digitized, and even if they have, are not properly licensed for use as training data. The problem is that, even if you give credit for all that stuff— say the sum total of "professionally" produced English language written content from the year 1500 to, say, the year 2000, it's not such a tremendous amount in percentage terms when you're talking about a training corpus of nearly 15 trillion tokens, which is the scale of current frontier models.
一方面,我们似乎已经耗尽了世界上积累的高质量训练数据集。当然,这并非字面意义上的真实 —— 仍有许多旧书籍和期刊尚未被妥善数字化,即便已经数字化,也未必获得作为训练数据使用的适当许可。问题在于,即便你把这些全部算上 —— 比如从 1500 年到 2000 年间 “专业” 生产的英语书面内容的总和,相对于当前前沿模型所需的近 15 万亿标记的训练语料库来说,这个比例并不算巨大。
For a quick reality check of those numbers: Google Books has digitized around 40mm books so far; if a typical book has 50k to 100k words, or 65k to 130k tokens, then that's between 2.6T and 5.2T tokens just from books, though surely a large chunk of that is already included in the training corpora used by the big labs, whether it's strictly legal or not. And there are lots of academic papers, with the arXiv website alone having over 2mm papers. And the Library of Congress has over 3 billion digitized newspaper pages. Taken together, that could be as much as 7T tokens in total, but since much of this is in fact included in training corpora, the remaining "incremental" training data probably isn't all that significant in the grand scheme of things.
为了快速核实这些数字的真实性:谷歌图书迄今已数字化约 4000 万本书;如果一本典型书籍包含 5 万至 10 万字,或 6.5 万至 13 万个标记,那么仅书籍一项就产生了 2.6 万亿至 5.2 万亿个标记,尽管其中很大一部分肯定已被各大实验室用于训练语料库,无论其是否严格合法。此外,还有大量学术论文,仅 arXiv 网站就有超过 200 万篇论文。美国国会图书馆则拥有超过 30 亿页的数字化报纸。综合来看,总标记数可能高达 7 万亿,但由于其中大部分实际上已包含在训练语料库中,剩余的 “增量” 训练数据在整体大局中可能并不那么重要。
Of course, there are other ways to gather more training data. You could automatically transcribe every single YouTube video for example, and use that text. And while that might be helpful on the margin, it's certainly of much lower quality than, say, a highly respected textbook on Organic Chemistry as a source of useful knowledge about the world. So we've always had a looming "data wall" when it comes to the original scaling law; although we know we can keep shoveling more and more capex into GPUs and building more and more data centers, it's a lot harder to mass produce useful new human knowledge which is correct and incremental to what is already out there. Now, one intriguing response to this has been the rise of "synthetic data," which is text that is itself the output of an LLM. And while this seems almost nonsensical that it would work to "get high on your own supply" as a way of improving model quality, it actually seems to work very well in practice, at least in the domain of math, logic, and computer programming.
当然,还有其他方法可以收集更多的训练数据。例如,你可以自动转录每一个 YouTube 视频,并使用这些文本。虽然这在边际上可能有所帮助,但它的质量肯定远低于,比如说,一本备受尊敬的有机化学教科书作为世界有用知识的来源。因此,在原始扩展定律方面,我们一直面临着迫在眉睫的 “数据墙”;尽管我们知道可以不断投入更多的资本支出到 GPU 和建设越来越多的数据中心,但要大规模生产正确且对现有知识有所增补的有用新人类知识则要困难得多。现在,一个有趣的应对措施是 “合成数据” 的兴起,这种数据本身就是 LLM 的输出。虽然这看起来几乎荒谬,认为 “自给自足” 能提高模型质量,但实际上在实践中效果非常好,至少在数学、逻辑和计算机编程领域是这样。
The reason, of course, is that these are areas where we can mechanically check and prove the correctness of things. So we can sample from the vast universe of possible math theorems or possible Python scripts, and then actually check if they are correct, and only include them in our corpus if they are. And in this way, we can very dramatically expand our collection of high quality training data, at least in these kinds of areas.
原因当然在于,这些领域我们可以机械地检查和验证事物的正确性。因此,我们可以从浩瀚的数学定理或 Python 脚本的可能性宇宙中抽样,然后实际检验它们是否正确,只有正确的才会被纳入我们的语料库。通过这种方式,我们能够显著扩展高质量训练数据的收集,至少在这些类型的领域中如此。
And then there are all the other kinds of data we could be training AI on besides text. For example, what if we take the entire whole genome sequencing (around 200 GB to 300 GB uncompressed for a single human being) for 100 million people? That's a lot of data obviously, although the vast majority of it would be nearly identical between any two people. Of course, this could be misleading to compare to textual data from books and the internet for various reasons:
此外,除了文本之外,我们还可以训练 AI 处理其他类型的数据。例如,如果我们获取 1 亿人的全基因组测序数据(每个人未压缩的数据量大约在 200GB 到 300GB 之间)会怎样?显然,这是一个庞大的数据量,尽管任意两个人之间的数据绝大部分几乎相同。当然,由于多种原因,将其与书籍和互联网上的文本数据进行比较可能会产生误导:
- Raw genome size isn't directly comparable to token counts
原始基因组大小不能直接与标记计数相比较 - The information content of genomic data is very different from text
基因组数据的信息内容与文本大不相同 - The training value of highly redundant data isn't clear
高度冗余数据的训练价值尚不明确 - The computational requirements for processing genomic data are different
处理基因组数据的计算需求有所不同
But it's still another large source of diverse information that we could train huge models on in the future, which is why I included it.
但它仍然是未来我们可以用来训练大型模型的另一个多样化信息来源,这就是我将其纳入的原因。
So while there is some hope in terms of being able to capture more and more additional training data, if you look at the rate at which training corpora have grown in recent years, it quickly becomes obvious that we are close to hitting a wall in terms of data availability for "generally useful" knowledge that can get us closer to the ultimate goal of getting artificial super-intelligence which is 10x smarter than John von Neumann and is an absolute world-class expert on every specialty known to man.
因此,尽管在获取更多额外训练数据方面存在一些希望,但若观察近年来训练语料库的增长速度,便会迅速发现,我们正接近触及 “普遍有用” 知识数据可用性的瓶颈。这类知识能让我们更接近实现人工超级智能的终极目标,即比约翰・冯・诺依曼聪明十倍,并在人类已知的每一个专业领域都是绝对的世界级专家。
Besides the limited amount of available data, there have always been a couple other things that have lurked in the back of the mind of proponents of the pre-training scaling law. A big one of these is, after you've finished training the model, what are you supposed to do with all that compute infrastructure? Train the next model? Sure, you can do that, but given the rapid improvement in GPU speed and capacity, and the importance of electricity and other opex in the economic calculations, does it even really make sense to use your 2 year old cluster to train your new model? Surely you'd rather use the brand new data center you just built that costs 10x the old data center and is 20x more powerful because of better technology. The problem is, at some point you do need to amortize the up-front cost of these investments and recoup it with a stream of (hopefully positive) operating profit, right?
除了可用数据量有限之外,预训练规模法则的支持者们心中还一直萦绕着几个问题。其中一大问题是,当你完成模型训练后,该如何处理所有的计算基础设施?训练下一个模型?当然可以这么做,但考虑到 GPU 速度和容量的快速提升,以及电力和其他运营支出在经济计算中的重要性,使用已有两年历史的集群来训练新模型真的合理吗?你肯定更愿意使用刚建成的全新数据中心,尽管其成本是旧数据中心的十倍,但由于采用了更先进的技术,其性能提升了二十倍。问题在于,到了某个时刻,你确实需要摊销这些投资的前期成本,并通过一系列(希望是正向的)运营利润来回收成本,对吧?
The market is so excited about AI that it has thankfully ignored this, allowing companies like OpenAI to post breathtaking from-inception, cumulative operating losses while garnering increasingly eye-popping valuations in follow-up investment rounds (although, to their credit, they have also been able to demonstrate very fast growing revenues). But eventually, for this situation to be sustainable over a full market cycle, these data center costs do need to eventually be recouped, hopefully with a profit, which over time is competitive with other investment opportunities on a risk-adjusted basis.
市场对人工智能的热情如此高涨,以至于幸运地忽略了这一点,使得像 OpenAI 这样的公司能够在成立初期就公布令人瞠目结舌的累计运营亏损,同时在后续融资轮次中获得越来越惊人的估值(尽管值得称赞的是,它们也展示了非常快速的收入增长)。但最终,为了使这种情况在整个市场周期内可持续,这些数据中心的成本确实需要最终得到回收,最好是能够盈利,随着时间的推移,在风险调整的基础上与其他投资机会相比具有竞争力。
The New Paradigm 新范式#
OK, so that was the pre-training scaling law. What's this "new" scaling law? Well, that's something that people really just started focusing on in the past year: inference time compute scaling. Before, the vast majority of all the compute you'd expend in the process was the up-front training compute to create the model in the first place. Once you had the trained model, performing inference on that model— i.e., asking a question or having the LLM perform some kind of task for you— used a certain, limited amount of compute.
好的,这就是预训练扩展法则。那么,这个 “新” 的扩展法则是什么呢?这是过去一年里人们才开始真正关注的东西:推理时间计算扩展。以前,你在整个过程中消耗的绝大部分计算资源都是前期用于创建模型的训练计算。一旦你有了训练好的模型,对该模型进行推理 —— 即提出问题或让 LLM 为你执行某种任务 —— 只会使用一定量的有限计算资源。
Critically, the total amount of inference compute (measured in various ways, such as FLOPS, in GPU memory footprint, etc.) was much, much less than what was required for the pre-training phase. Of course, the amount of inference compute does flex up when you increase the context window size of the models and the amount of output that you generate from them in one go (although researchers have made breathtaking algorithmic improvements on this front relative to the initial quadratic scaling people originally expected in scaling this up). But essentially, until recently, inference compute was generally a lot less intensive than training compute, and scaled basically linearly with the number of requests you are handling— the more demand for text completions from ChatGPT, for instance, the more inference compute you used up.
关键的是,推理计算的总量(以多种方式衡量,如 FLOPS、GPU 内存占用等)远远少于预训练阶段所需。当然,当你增加模型的上下文窗口大小以及一次性生成的输出量时,推理计算量确实会有所增加(尽管研究人员在这方面相对于最初预期的二次方扩展做出了令人惊叹的算法改进)。但基本上,直到最近,推理计算通常比训练计算要轻松得多,并且基本上随着处理请求的数量线性扩展 —— 例如,对 ChatGPT 文本补全的需求越多,消耗的推理计算量就越大。
With the advent of the revolutionary Chain-of-Thought ("COT") models introduced in the past year, most noticeably in OpenAI's flagship O1 model (but very recently in DeepSeek's new R1 model, which we will talk about later in much more detail), all that changed. Instead of the amount of inference compute being directly proportional to the length of the output text generated by the model (scaling up for larger context windows, model size, etc.), these new COT models also generate intermediate "logic tokens"; think of this as a sort of scratchpad or "internal monologue" of the model while it's trying to solve your problem or complete its assigned task.
随着过去一年革命性的 “思维链”(Chain-of-Thought,简称 COT)模型的引入,尤其是在 OpenAI 的旗舰 O1 模型(以及最近 DeepSeek 的新 R1 模型,我们稍后将详细讨论)中,一切都发生了变化。这些新型 COT 模型不再使推理计算量与模型生成的输出文本长度直接成正比(随着上下文窗口扩大、模型规模增加等而线性增长),而是还会生成中间的 “逻辑标记”;可以将其视为模型在尝试解决您的问题或完成其分配任务时的一种草稿纸或 “内心独白”。
This represents a true sea change in how inference compute works: now, the more tokens you use for this internal chain of thought process, the better the quality of the final output you can provide the user. In effect, it's like giving a human worker more time and resources to accomplish a task, so they can double and triple check their work, do the same basic task in multiple different ways and verify that they come out the same way; take the result they came up with and "plug it in" to the formula to check that it actually does solve the equation, etc.
这代表了推理计算工作方式的一次真正巨变:如今,您为这一内部思维链使用的标记越多,最终提供给用户的结果质量就越高。实际上,这就像给人类工作者更多的时间和资源来完成一项任务,使他们能够反复检查工作,用多种不同的方式执行相同的基本任务并验证结果是否一致;将得出的结果 “代入” 公式中,以验证其确实能解出方程,等等。
It turns out that this approach works almost amazingly well; it is essentially leveraging the long anticipated power of what is called "reinforcement learning" with the power of the Transformer architecture. It directly addresses the single biggest weakness of the otherwise phenomenally successful Transformer model, which is its propensity to "hallucinate".
事实证明,这种方法的效果几乎令人惊叹;它实质上是在利用被称为 “强化学习” 的长期预期力量与 Transformer 架构的力量相结合。它直接解决了 Transformer 模型在取得惊人成功之外的最大弱点,即其倾向于 “产生幻觉” 的问题。
Basically, the way Transformers work in terms of predicting the next token at each step is that, if they start out on a bad "path" in their initial response, they become almost like a prevaricating child who tries to spin a yarn about why they are actually correct, even if they should have realized mid-stream using common sense that what they are saying couldn't possibly be correct.
基本上,Transformer 模型在每一步预测下一个词的方式是,如果它们在初始响应中走上了一条错误的 “路径”,它们就会变得像一个闪烁其词的孩子,试图编造理由说明自己实际上是对的,即使它们本应在中途运用常识意识到自己所说的话不可能是正确的。
Because the models are always seeking to be internally consistent and to have each successive generated token flow naturally from the preceding tokens and context, it's very hard for them to course-correct and backtrack. By breaking the inference process into what is effectively many intermediate stages, they can try lots of different things and see what's working and keep trying to course-correct and try other approaches until they can reach a fairly high threshold of confidence that they aren't talking nonsense.
由于模型始终追求内部一致性,并力求每个生成的标记都能自然地从前面的标记和上下文中延续,因此它们很难进行路径修正和回溯。通过将推理过程分解为多个中间阶段,它们可以尝试多种不同的方法,观察哪些方法有效,并持续尝试修正路径和探索其他途径,直到达到相当高的置信度阈值,确保所表达的内容并非无稽之谈。
Perhaps the most extraordinary thing about this approach, beyond the fact that it works at all, is that the more logic/COT tokens you use, the better it works. Suddenly, you now have an additional dial you can turn so that, as you increase the amount of COT reasoning tokens (which uses a lot more inference compute, both in terms of FLOPS and memory), the higher the probability is that you will give a correct response— code that runs the first time without errors, or a solution to a logic problem without an obviously wrong deductive step.
这种方法最不寻常之处,除了它确实有效之外,还在于你使用的逻辑 / COT(链式思维)标记越多,效果就越好。突然间,你有了一个额外的调节旋钮,随着你增加 COT 推理标记的数量(这会消耗更多的推理计算资源,无论是浮点运算次数还是内存),你给出正确答案的概率就越高 —— 无论是首次运行无错的代码,还是在逻辑问题中避免明显错误的推理步骤的解决方案。
I can tell you from a lot of firsthand experience that, as good as Anthropic's Claude3.5 Sonnet model is at Python programming— and it is indeed VERY good— whenever you need to generate anything long and complicated, it invariably ends up making one or more stupid mistakes. Now, these mistakes are usually pretty easy to fix, and in fact you can normally fix them by simply feeding the errors generated by the Python interpreter, without any further explanation, as a follow-up inference prompt (or, more usefully, paste in the complete set of detected "problems" found in the code by your code editor, using what something called a Linter), it was still an annoying additional step. And when the code becomes very long or very complicated, it can sometimes take a lot longer to fix, and might even require some manual debugging by hand.
根据我大量的亲身经验,我可以告诉你,尽管 Anthropic 的 Claude3.5 Sonnet 模型在 Python 编程方面表现非常出色 —— 确实非常出色 —— 但每当你需要生成任何长而复杂的内容时,它总是不可避免地会犯一个或多个愚蠢的错误。这些错误通常很容易修复,事实上,你通常可以通过简单地输入 Python 解释器生成的错误信息,无需进一步解释,作为后续推理提示(或者更有用的是,粘贴代码编辑器使用所谓的 Linter 检测到的代码中所有 “问题” 的完整集合)来修复它们,但这仍然是一个令人烦恼的额外步骤。当代码变得非常长或非常复杂时,修复有时可能需要更长的时间,甚至可能需要一些手动调试。
The first time I tried the O1 model from OpenAI was like a revelation: I was amazed how often the code would be perfect the very first time. And that's because the COT process automatically finds and fixes problems before they ever make it to a final response token in the answer the model gives you.
第一次尝试 OpenAI 的 O1 模型时,仿佛经历了一次启示:我惊讶于代码首次运行就完美无缺的频率之高。这得益于 COT 过程,它能在模型给出的最终响应标记之前,自动发现并修复问题。
In fact, the O1 model used in OpenAI's ChatGPT Plus subscription for $20/month is basically the same model as the one used in the O1-Pro model featured in their new ChatGPT Pro subscription for 10x the price ($200/month, which raised plenty of eyebrows in the developer community); the main difference is that O1-Pro thinks for a lot longer before responding, generating vastly more COT logic tokens, and consuming a far larger amount of inference compute for every response.
事实上,OpenAI 在 ChatGPT Plus 订阅服务中使用的 O1 模型(每月 20 美元)与其新推出的 ChatGPT Pro 订阅服务中的 O1-Pro 模型(每月 200 美元,价格是前者的 10 倍,这一价格在开发者社区引起了广泛关注)基本上是相同的;主要区别在于,O1-Pro 在响应前会思考更长时间,生成更多的 COT 逻辑令牌,并且每次响应消耗的推理计算资源也远大于 O1 模型。
This is quite striking in that, even a very long and complex prompt for Claude3.5 Sonnet or GPT4o, with ~400kb+ of context given, generally takes less than 10 seconds to begin responding, and often less than 5 seconds. Whereas that same prompt to O1-Pro could easily take 5+ MINUTES before you get a response (although OpenAI does show you some of the "reasoning steps" that are generated during the process while you wait; critically, OpenAI has decided, presumably for trade secret related reasons,to hide from you the exact reasoning tokens it generates, showing you instead a highly abbreviated summary of these).
这一点相当引人注目,因为即便是针对 Claude3.5 Sonnet 或 GPT4o 的一个非常长且复杂的提示,提供了约 400kb 以上的上下文,通常也只需不到 10 秒即可开始响应,且往往少于 5 秒。而同样的提示对 O1-Pro 来说,可能轻易就需要 5 分钟以上才能得到回应(尽管 OpenAI 在等待过程中会向你展示一些生成的 “推理步骤”;关键的是,OpenAI 出于商业秘密相关的考虑,决定向你隐藏其生成的确切推理标记,转而展示这些标记的高度概括摘要)。
As you can probably imagine, there are tons of contexts where accuracy is paramount— where you'd rather give up and tell the user you can't do it at all rather than give an answer that could be trivially proven wrong or which involves hallucinated facts or otherwise specious reasoning. Anything involving money/transactions, medical stuff, legal stuff, just to name a few.
正如您可能想象的那样,在许多情境中,准确性至关重要 —— 在这些情况下,您宁愿放弃并告诉用户您无法完成任务,也不愿给出一个容易被证明是错误的答案,或者涉及虚构事实或似是而非的推理。涉及金钱 / 交易、医疗事务、法律事务等领域,仅举几例。
Basically, wherever the cost of inference is trivial relative to the hourly all-in compensation of the human knowledge worker who is interacting with the AI system, that's a case where it become a complete no-brainer to dial up the COT compute (the major drawback is that it increases the latency of responses by a lot, so there are still some contexts where you might prefer to iterate faster by getting lower latency responses that are less accurate or correct).
基本上,只要推理成本相对于与 AI 系统交互的人类知识工作者的每小时总薪酬微不足道,那么增加 COT 计算就完全是不费吹灰之力的事(主要缺点是这会大大增加响应延迟,因此在某些情况下,您可能仍倾向于通过获得延迟较低但准确性或正确性较低的响应来加快迭代速度)。
Some of the most exciting news in the AI world came out just a few weeks ago and concerned OpenAI's new unreleased O3 model, which was able to solve a large variety of tasks that were previously deemed to be out of reach of current AI approaches in the near term. And the way it was able to do these hardest problems (which include exceptionally tough "foundational" math problems that would be very hard for even highly skilled professional mathematicians to solve), is that OpenAI threw insane amount of compute resources at the problems— in some cases, spending $3k+ worth of compute power to solve a single task (compare this to traditional inference costs for a single task, which would be unlikely to exceed a couple dollars using regular Transformer models without chain-of-thought).
几周前,AI 领域传来了一些最激动人心的消息,涉及 OpenAI 尚未发布的新 O3 模型,该模型能够解决大量此前被认为短期内现有 AI 方法难以企及的任务。而它之所以能攻克这些最棘手的难题(包括那些即便是技艺高超的专业数学家也极难解决的异常艰深的 “基础” 数学问题),是因为 OpenAI 在这些问题上投入了惊人的计算资源 —— 在某些情况下,花费超过 3000 美元的计算能力来解决单一任务(相比之下,使用常规 Transformer 模型进行单任务推理的成本通常不会超过几美元,且无需思维链辅助)。
It doesn't take an AI genius to realize that this development creates a new scaling law that is totally independent of the original pre-training scaling law. Now, you still want to train the best model you can by cleverly leveraging as much compute as you can and as many trillion tokens of high quality training data as possible, but that's just the beginning of the story in this new world; now, you could easily use incredibly huge amounts of compute just to do inference from these models at a very high level of confidence or when trying to solve extremely tough problems that require "genius level" reasoning to avoid all the potential pitfalls that would lead a regular LLM astray.
无需 AI 天才也能意识到,这一进展开创了一条全新的扩展法则,完全独立于原有的预训练扩展法则。如今,你依然希望通过巧妙利用尽可能多的计算资源和数万亿高质量训练数据来训练出最佳模型,但这只是新世界故事的开始;现在,你可以轻松动用极其庞大的计算资源,仅用于从这些模型中进行极高置信度的推理,或是尝试解决那些需要 “天才级” 推理才能避开所有潜在陷阱、避免普通 LLM 误入歧途的极其棘手问题。
But Why Should Nvidia Get to Capture All The Upside?
但为什么英伟达能独占所有好处?#
Even if you believe, as I do, that the future prospects for AI are almost unimaginably bright, the question still remains, "Why should one company extract the majority of the profit pool from this technology?" There are certainly many historical cases where a very important new technology changed the world, but the main winners were not the companies that seemed the most promising during the initial stages of the process. The Wright Brothers' airplane company in all its current incarnations across many different firms today isn't worth more than $10b despite them inventing and perfecting the technology well ahead of everyone else. And while Ford has a respectable market cap of $40b today, it's just 1.1% of Nvidia's current market cap.
即便你像我一样相信,人工智能的未来前景几乎光明得难以想象,问题依然存在:“为何一家公司应从这项技术中获取大部分利润?” 历史上确实有许多重要新技术改变世界的案例,但主要赢家往往并非在技术初期阶段看似最有前途的公司。莱特兄弟的飞机公司,尽管他们发明并完善了这项技术,领先于所有人,但如今其所有化身分散于多家不同企业,总价值也不超过 100 亿美元。而福特虽然目前拥有 400 亿美元的可观市值,却仅为英伟达当前市值的 1.1%。
To understand this, it's important to really understand why Nvidia is currently capturing so much of the pie today. After all, they aren't the only company that even makes GPUs. AMD makes respectable GPUs that, on paper, have comparable numbers of transistors, which are made using similar process nodes, etc. Sure, they aren't as fast or as advanced as Nvidia's GPUs, but it's not like the Nvidia GPUs are 10x faster or anything like that. In fact, in terms of naive/raw dollars per FLOP, AMD GPUs are something like half the price of Nvidia GPUs.
要理解这一点,关键在于真正明白为何英伟达目前能占据如此大的市场份额。毕竟,他们并非唯一生产 GPU 的公司。AMD 也制造了性能不俗的 GPU,从纸面数据看,其晶体管数量相当,采用相似的制程节点等。诚然,AMD 的 GPU 在速度和先进性上不及英伟达,但英伟达的 GPU 也并非快上十倍或类似程度。实际上,若单纯以每浮点运算能力(FLOP)的美元成本计算,AMD 的 GPU 价格大约是英伟达的一半。
Looking at other semiconductor markets such as the DRAM market, despite the fact that it is also very highly consolidated with only 3 meaningful global players (Samsung, Micron, SK-Hynix), gross margins in the DRAM market range from negative at the bottom of the cycle to ~60% at the very top of the cycle, with an average in the 20% range. Compare that to Nvidia's overall gross margin in recent quarters of ~75%, which is dragged down by the lower-margin and more commoditized consumer 3D graphics category.
观察其他半导体市场,如 DRAM 市场,尽管其同样高度集中,仅有三大全球主要参与者(三星、美光、SK 海力士),但 DRAM 市场的毛利率从周期低谷时的负值到周期顶峰时的约 60% 不等,平均在 20% 左右。相比之下,英伟达近几个季度的整体毛利率约为 75%,这一数字被利润率较低且更商品化的消费级 3D 图形产品类别所拉低。
So how is this possible? Well, the main reasons have to do with software— better drivers that "just work" on Linux and which are highly battle-tested and reliable (unlike AMD, which is notorious for the low quality and instability of their Linux drivers), and highly optimized open-source code in popular libraries such as PyTorch that has been tuned to work really well on Nvidia GPUs.
那么,这是如何实现的呢?主要原因与软件有关 —— 更好的驱动程序在 Linux 上 “即插即用”,这些驱动程序经过严格测试,可靠性高(与 AMD 形成鲜明对比,后者因其 Linux 驱动程序质量低劣和不稳定而臭名昭著),以及在诸如 PyTorch 等流行库中高度优化的开源代码,这些代码已被调整以在 Nvidia GPU 上表现出色。
It goes beyond that though— the very programming framework that coders use to write low-level code that is optimized for GPUs, CUDA, is totally proprietary to Nvidia, and it has become a de facto standard. If you want to hire a bunch of extremely talented programmers who know how to make things go really fast on GPUs, and pay them $650k/year or whatever the going rate is for people with that particular expertise, chances are that they are going to "think" and work in CUDA.
不仅如此,程序员用来编写针对 GPU 优化的低级代码的编程框架 ——CUDA,完全由 Nvidia 专有,并已成为事实上的标准。如果你想雇佣一批极其有才华、懂得如何让 GPU 上的程序运行得飞快的程序员,并支付他们每年 65 万美元或该领域专家当前的市场薪酬,那么他们很可能将 “思考” 并工作在 CUDA 环境中。
Besides software superiority, the other major thing that Nvidia has going for it is what is known as interconnect— essentially, the bandwidth that connects together thousands of GPUs together efficiently so they can be jointly harnessed to train today's leading-edge foundational models. In short, the key to efficient training is to keep all the GPUs as fully utilized as possible all the time— not waiting around idling until they receive the next chunk of data they need to compute the next step of the training process.
除了软件优势外,英伟达的另一大优势在于其所谓的互连技术 —— 本质上,这是一种高效连接数千个 GPU 的带宽技术,使得它们能够协同工作,共同训练当今最前沿的基础模型。简而言之,高效训练的关键在于让所有 GPU 尽可能保持全时满载运行,而不是闲置等待接收下一块数据以计算训练过程的下一步。
The bandwidth requirements are extremely high— much, much higher than the typical bandwidth that is needed in traditional data center use cases. You can't really use traditional networking gear or fiber optics for this kind of interconnect, since it would introduce too much latency and wouldn't give you the pure terabytes per second of bandwidth that is needed to keep all the GPUs constantly busy.
带宽需求极高 —— 远超传统数据中心用例所需的典型带宽。对于这种互连,您无法真正使用传统的网络设备或光纤,因为它们会引入过多的延迟,并且无法提供每秒数 TB 的纯带宽,这是保持所有 GPU 持续忙碌所必需的。
Nvidia made an incredibly smart decision to purchase the Israeli company Mellanox back in 2019 for a mere $6.9b, and this acquisition is what provided them with their industry leading interconnect technology. Note that interconnect speed is a lot more relevant to the training process, where you have to harness together the output of thousands of GPUs at the same time, than the inference process (including COT inference), which can use just a handful of GPUs— all you need is enough VRAM to store the quantized (compressed) model weights of the already-trained model.
英伟达在 2019 年做出了一个极其明智的决定,以仅 69 亿美元的价格收购了以色列公司 Mellanox,正是这次收购为他们带来了行业领先的互连技术。值得注意的是,互连速度在训练过程中更为关键,因为需要同时协调数千个 GPU 的输出,而在推理过程(包括 COT 推理)中,仅需少量 GPU 即可 —— 只需足够的显存来存储已训练模型的量化(压缩)权重。
So those are arguably the major components of Nvidia's "moat" and how it has been able to maintain such high margins for so long (there is also a "flywheel" aspect to things, where they aggressively invest their super-normal profits into tons of R&D, which in turn helps them improve their tech at a faster rate than the competition, so they are always in the lead in terms of raw performance).
因此,这些可以说是英伟达 “护城河” 的主要组成部分,以及它为何能够长期维持如此高利润率的原因(这其中还存在一种 “飞轮效应”,即他们积极将超常利润投入大量研发,从而以比竞争对手更快的速度提升技术,因此在原始性能方面始终处于领先地位)。
But as was pointed out earlier, what customers really tend to care about, all other things being equal, is performance per dollar (both in up-front capex cost of equipment and in energy usage, so performance per watt), and even though Nvidia's GPUs are certainly the fastest, they are not the best price/performance when measured naively in terms of FLOPS.
但如前所述,在其他条件相同的情况下,客户真正关心的往往是每美元性能(既包括设备的初始资本支出成本,也包括能耗,即每瓦性能),尽管 Nvidia 的 GPU 确实是最快的,但如果单纯以 FLOPS 衡量,它们的性价比并非最佳。
But the thing is, all other things are NOT equal, and the fact that AMD's drivers suck, that popular AI software libraries don't run as well on AMD GPUs, that you can't find really good GPU experts who specialize in AMD GPUs outside of the gaming world (why would they bother when there is more demand in the market for CUDA experts?), that you can't wire thousands of them together as effectively because of lousy interconnect technology for AMD— all this means that AMD is basically not competitive in the high-end data center world, and doesn't seem to have very good prospects for getting there in the near term.
但问题是,其他条件并不相同,AMD 的驱动程序表现糟糕,流行的人工智能软件库在 AMD GPU 上运行不佳,在游戏领域之外很难找到真正擅长 AMD GPU 的专家(市场上对 CUDA 专家的需求更大,他们何必费心呢?),由于 AMD 糟糕的互连技术,无法像 NVIDIA 那样有效地将数千个 GPU 连接起来 —— 所有这些意味着 AMD 在高端数据中心领域基本上没有竞争力,并且在短期内似乎也没有很好的前景。
Well, that all sounds very bullish for Nvidia, right? Now you can see why the stock is trading at such a huge valuation! But what are the other clouds on the horizon? Well, there are few that I think merit significant attention. Some have been lurking in the background for the last few years, but too small to make a dent considering how quickly the pie has been growing, but where they are getting ready to potentially inflect upwards. Others are very recent developments (as in, the last 2 weeks) that might dramatically change the near-term trajectory of incremental GPU demand.
嗯,这一切对英伟达来说听起来都非常乐观,对吧?现在你明白为什么这只股票的估值如此之高了!但还有哪些潜在的风险呢?我认为有几个值得特别关注。有些问题在过去几年里一直潜伏在背景中,但由于市场增长迅速,它们的影响微乎其微,但现在它们可能即将显现出上升的趋势。另一些则是最近(过去两周内)才出现的新情况,可能会极大地改变短期内 GPU 需求的增长轨迹。
The Major Threats 主要威胁#
At a very high level, you can think of things like this: Nvidia operated in a pretty niche area for a very long time; they had very limited competition, and the competition wasn't particular profitable or growing fast enough to ever pose a real threat, since they didn't have the capital needed to really apply pressure to a market leader like Nvidia. The gaming market was large and growing, but didn't feature earth shattering margins or particularly fabulous year over year growth rates.
从宏观层面来看,你可以这样理解:Nvidia 长期在一个相当小众的领域内运营;他们面临的竞争非常有限,而且这些竞争对手既没有特别盈利,也没有足够快的增长速度来构成真正的威胁,因为它们缺乏对像 Nvidia 这样的市场领导者施加真正压力所需的资本。游戏市场规模庞大且持续增长,但并未展现出惊人的利润率或特别耀眼的年增长率。
A few big tech companies started ramping up hiring and spending on machine learning and AI efforts around 2016-2017, but it was never a truly significant line item for any of them on an aggregate basis— more of a "moonshot" R&D expenditure. But once the big AI race started in earnest with the release of ChatGPT in 2022— only a bit over 2 years ago, although it seems like a lifetime ago in terms of developments— that situation changed very dramatically.
几家大型科技公司大约在 2016-2017 年间开始加大招聘力度,并增加对机器学习和人工智能领域的投入,但总体而言,这些支出从未成为它们预算中的重大项 —— 更多是作为 “登月计划” 式的研发开支。然而,随着 2022 年 ChatGPT 的发布,真正意义上的 AI 竞赛拉开序幕 —— 尽管从发展速度来看,那似乎已是久远之前的事 —— 这一局面发生了翻天覆地的变化。
Suddenly, big companies were ready to spend many, many billions of dollars incredibly quickly. The number of researchers showing up at the big research conferences like Neurips and ICML went up very, very dramatically. All the smart students who might have previously studied financial derivatives were instead studying Transformers, and $1mm+ compensation packages for non-executive engineering roles (i.e., for independent contributors not managing a team) became the norm at the leading AI labs.
突然间,大公司愿意迅速投入巨额资金,动辄数十亿。参加 Neurips 和 ICML 等大型研究会议的学者数量急剧增加。那些原本可能研究金融衍生品的聪明学生,转而研究 Transformer 模型,而在顶尖的人工智能实验室中,非执行工程职位(即不管理团队的独立贡献者)的薪酬包超过百万美元已成为常态。
It takes a while to change the direction of a massive cruise ship; and even if you move really quickly and spend billions, it takes a year or more to build greenfield data centers and order all the equipment (with ballooning lead times) and get it all set up and working. It takes a long time to hire and onboard even smart coders before they can really hit their stride and familiarize themselves with the existing codebases and infrastructure.
改变一艘巨型游轮的方向需要时间;即使你行动迅速并投入数十亿资金,建设全新的数据中心、订购所有设备(交货期不断延长)并使其全部投入运行也需要一年或更长时间。即使是聪明的程序员,也需要很长时间才能招聘到位并熟悉现有的代码库和基础设施,真正发挥出他们的实力。
But now, you can imagine that absolutely biblical amounts of capital, brainpower, and effort are being expended in this area. And Nvidia has the biggest target of any player on their back, because they are the ones who are making the lion's share of the profits TODAY, not in some hypothetical future where the AI runs our whole lives.
但现在,你可以想象,在这一领域投入的资本、智力和努力堪称圣经级别的巨大。而 Nvidia 背负着所有参与者中最大的目标,因为他们是当下攫取利润大头的一方,而非在某个 AI 主宰我们生活的假想未来。
So the very high level takeaway is basically that "markets find a way"; they find alternative, radically innovative new approaches to building hardware that leverage completely new ideas to sidestep barriers that help prop up Nvidia's moat.
因此,高层次的结论基本上是 “市场自有其道”;它们找到了替代性的、极具创新性的新方法来构建硬件,这些方法利用全新的理念来绕过那些支撑 Nvidia 护城河的障碍。
The Hardware Level Threat
硬件层面的威胁#
For example, so-called "wafer scale" AI training chips from Cerebras, which dedicate an entire 300mm silicon wafer to an absolutely gargantuan chip that contains orders of magnitude more transistors and cores on a single chip (see this recent blog post from them explaining how they were able to solve the "yield problem" that had been preventing this approach from being economically practical in the past).
例如,Cerebras 公司推出的所谓 “晶圆级” AI 训练芯片,将一整块 300 毫米硅晶圆用于制造一个极其庞大的芯片,该芯片在单一芯片上集成了数量级更多的晶体管和核心(参见他们最近的博客文章,解释了如何解决过去阻碍这种方法在经济上可行的 “良率问题”)。
To put this into perspective, if you compare Cerebras' newest WSE-3 chip to Nvidia's flagship data-center GPU, the H100, the Cerebras chip has a total die area of 46,225 square millimeters compared to just 814 for the H100 (and the H100 is itself considered an enormous chip by industry standards); that's a multiple of ~57x! And instead of having 132 "streaming multiprocessor" cores enabled on the chip like the H100 has, the Cerebras chip has ~900,000 cores (granted, each of these cores is smaller and does a lot less, but it's still an almost unfathomably large number in comparison). In more concrete apples-to-apples terms, the Cerebras chip can do around ~32x the FLOPS in AI contexts as a single H100 chip. Since an H100 sells for close to $40k a pop, you can imagine that the WSE-3 chip isn't cheap.
为了更直观地理解这一点,如果将 Cerebras 最新的 WSE-3 芯片与 Nvidia 的旗舰数据中心 GPU H100 进行比较,Cerebras 芯片的总晶片面积达到 46,225 平方毫米,而 H100 仅为 814 平方毫米(按照行业标准,H100 本身已算作巨型芯片);这意味着 WSE-3 的面积是 H100 的约 57 倍!此外,不同于 H100 芯片上启用的 132 个 “流式多处理器” 核心,Cerebras 芯片拥有约 90 万个核心(尽管每个核心较小且功能较少,但相比之下,这个数字仍然大得几乎难以想象)。更具体地以同类比较来看,在 AI 应用场景下,Cerebras 芯片的浮点运算能力约为单个 H100 芯片的 32 倍。考虑到 H100 的售价接近每片 4 万美元,可想而知 WSE-3 芯片的价格不菲。
So why does this all matter? Well, instead of trying to battle Nvidia head-on by using a similar approach and trying to match the Mellanox interconnect technology, Cerebras has used a radically innovative approach to do an end-run around the interconnect problem: inter-processor bandwidth becomes much less of an issue when everything is running on the same super-sized chip. You don't even need to have the same level of interconnect because one mega chip replaces tons of H100s.
那么,这一切为何重要呢?Cerebras 并未采用与 Nvidia 正面交锋的策略,即通过类似方法并试图匹敌 Mellanox 的互连技术,而是采取了一种极具创新性的方式,绕过了互连问题:当所有运算都在同一块超大芯片上进行时,处理器间的带宽问题便大大减轻。你甚至不需要同等水平的互连技术,因为一块巨型芯片就能替代成百上千的 H100 芯片。
And the Cerebras chips also work extremely well for AI inference tasks. In fact, you can try it today for free here and use Meta's very respectable Llama-3.3-70B model. It responds basically instantaneously, at ~1,500 tokens per second. To put that into perspective, anything above 30 tokens per second feels relatively snappy to users based on comparisons to ChatGPT and Claude, and even 10 tokens per second is fast enough that you can basically read the response while it's being generated.
Cerebras 芯片在 AI 推理任务中也表现出色。实际上,您今天就可以在这里免费试用,并使用 Meta 相当出色的 Llama-3.3-70B 模型。它的响应速度基本即时,大约每秒 1,500 个令牌。为了更直观地理解,与 ChatGPT 和 Claude 相比,任何超过每秒 30 个令牌的速度对用户来说都感觉相当迅速,甚至每秒 10 个令牌的速度也足够快,基本上可以在生成响应的同时阅读内容。
Cerebras is also not alone; there are other companies, like Groq (not to be confused with the Grok model family trained by Elon Musk's X AI). Groq has taken yet another innovative approach to solving the same fundamental problem. Instead of trying to compete with Nvidia's CUDA software stack directly, they've developed what they call a "tensor processing unit" (TPU) that is specifically designed for the exact mathematical operations that deep learning models need to perform. Their chips are designed around a concept called "deterministic compute," which means that, unlike traditional GPUs where the exact timing of operations can vary, their chips execute operations in a completely predictable way every single time.
Cerebras 并非孤军奋战;还有其他公司,如 Groq(不要与埃隆・马斯克的 X AI 训练的 Grok 模型家族混淆)。Groq 采取了另一种创新方法来解决相同的基本问题。他们没有直接与 NVIDIA 的 CUDA 软件堆栈竞争,而是开发了一种称为 “张量处理单元”(TPU)的硬件,专门为深度学习模型需要执行的精确数学运算设计。他们的芯片围绕 “确定性计算” 这一概念设计,这意味着与传统 GPU 中操作的确切时间可能有所不同相比,他们的芯片每次都以完全可预测的方式执行操作。
This might sound like a minor technical detail, but it actually makes a massive difference for both chip design and software development. Because the timing is completely deterministic, Groq can optimize their chips in ways that would be impossible with traditional GPU architectures. As a result, they've been demonstrating for the past 6+ months inference speeds of over 500 tokens per second with the Llama series of models and other open source models, far exceeding what's possible with traditional GPU setups. Like Cerebras, this is available today and you can try it for free here.
这听起来可能像是一个微小的技术细节,但实际上它对芯片设计和软件开发都产生了巨大的影响。由于时序是完全确定的,Groq 能够以传统 GPU 架构无法实现的方式优化其芯片。因此,在过去的 6 个多月里,他们展示了使用 Llama 系列模型和其他开源模型实现每秒超过 500 个 token 的推理速度,远远超过了传统 GPU 设置所能达到的水平。与 Cerebras 一样,这项技术现已可用,您可以在这里免费试用。
Using a comparable Llama3 model with "speculative decoding," Groq is able to generate 1,320 tokens per second, on par with Cerebras and far in excess of what is possible using regular GPUs. Now, you might ask what the point is of achieving 1,000+ tokens per second when users seem pretty satisfied with ChatGPT, which is operating at less than 10% of that speed. And the thing is, it does matter. It makes it a lot faster to iterate and not lose focus as a human knowledge worker when you get instant feedback. And if you're using the model programmatically via the API, which is increasingly where much of the demand is coming from, then it can enable whole new classes of applications that require multi-stage inference (where the output of previous stages is used as input in successive stages of prompting/inference) or which require low-latency responses, such as content moderation, fraud detection, dynamic pricing, etc.
使用具有 “推测解码” 功能的类似 Llama3 模型,Groq 每秒能够生成 1,320 个令牌,与 Cerebras 相当,远超常规 GPU 所能达到的水平。你可能会问,当用户对运行速度不足其 10% 的 ChatGPT 已相当满意时,实现每秒 1,000 + 令牌的意义何在。关键在于,这确实重要。它能显著加快迭代速度,让作为人类知识工作者的你在获得即时反馈时不易分心。此外,如果你通过 API 以编程方式使用该模型 —— 这一需求正日益增长 —— 那么它能够开启全新类别的应用,这些应用需要多阶段推理(前一阶段的输出作为后续提示 / 推理阶段的输入)或要求低延迟响应,如内容审核、欺诈检测、动态定价等。
But even more fundamentally, the faster you can serve requests, the faster you can cycle things, and the busier you can keep the hardware. Although Groq's hardware is extremely expensive, clocking in at $2mm to $3mm for a single server, it ends up costing far less per request fulfilled if you have enough demand to keep the hardware busy all the time.
但更根本的是,处理请求的速度越快,循环处理的速度就越快,硬件就能保持越高的利用率。尽管 Groq 的硬件极其昂贵,单台服务器价格高达 200 万至 300 万美元,但如果你有足够的需求让硬件始终保持忙碌状态,那么每完成一个请求的成本实际上要低得多。
And like Nvidia with CUDA, a huge part of Groq's advantage comes from their own proprietary software stack. They are able to take the same open source models that other companies like Meta, DeepSeek, and Mistral develop and release for free, and decompose them in special ways that allow them to run dramatically faster on their specific hardware.
与 Nvidia 的 CUDA 类似,Groq 的一大优势也源自其专有的软件堆栈。他们能够获取 Meta、DeepSeek 和 Mistral 等公司开发并免费发布的开源模型,并以特殊方式分解这些模型,使其在自家特定硬件上运行速度显著提升。
Like Cerebras, they have taken different technical decisions to optimize certain particular aspects of the process, which allows them to do things in a fundamentally different way. In Groq's case, it's because they are entirely focused on inference level compute, not on training: all their special sauce hardware and software only give these huge speed and efficiency advantages when doing inference on an already trained model.
与 Cerebras 一样,他们采取了不同的技术决策以优化流程中的某些特定方面,这使得他们能够以根本不同的方式行事。就 Groq 而言,这是因为他们完全专注于推理级别的计算,而非训练:他们所有的特殊硬件和软件仅在已训练模型上进行推理时,才能提供这些巨大的速度和效率优势。
But if the next big scaling law that people are excited about is for inference level compute— and if the biggest drawback of COT models is the high latency introduced by having to generate all those intermediate logic tokens before they can respond— then even a company that only does inference compute, but which does it dramatically faster and more efficiently than Nvidia can— can introduce a serious competitive threat in the coming years. At the very least, Cerebras and Groq can chip away at the lofty expectations for Nvidia's revenue growth over the next 2-3 years that are embedded in the current equity valuation.
但如果下一个令人兴奋的重大扩展定律是针对推理级别的计算 —— 并且如果 COT 模型的最大缺点是由于在响应之前必须生成所有那些中间逻辑标记而引入的高延迟 —— 那么即使是一家只做推理计算的公司,只要其速度远超英伟达且效率显著提升,也可能在未来几年内构成严重的竞争威胁。至少,Cerebras 和 Groq 能够削弱当前股权估值中嵌入的对英伟达未来 2-3 年收入增长的高期望。
Besides these particularly innovative, if relatively unknown, startup competitors, there is some serious competition coming from some of Nvidia's biggest customers themselves who have been making custom silicon that specifically targets AI training and inference workloads. Perhaps the best known of these is Google, which has been developing its own proprietary TPUs since 2016. Interestingly, although it briefly sold TPUs to external customers, Google has been using all its TPUs internally for the past several years, and it is already on its 6th generation of TPU hardware.
除了这些特别创新但相对不知名的初创公司竞争对手外,Nvidia 还面临着来自其一些最大客户自身的激烈竞争,这些客户一直在开发专门针对 AI 训练和推理工作负载的定制芯片。其中最著名的或许是谷歌,自 2016 年以来一直在开发自己的专有 TPU。有趣的是,尽管谷歌曾短暂向外部客户销售 TPU,但过去几年它一直在内部使用所有 TPU,并且已经发展到第六代 TPU 硬件。
Amazon has also been developing its own custom chips called Trainium2 and Inferentia2. And while Amazon is building out data centers featuring billions of dollars of Nvidia GPUs, they are also at the same time investing many billions in other data centers that use these internal chips. They have one cluster that they are bringing online for Anthropic that features over 400k chips.
亚马逊也在开发自己的定制芯片,名为 Trainium2 和 Inferentia2。尽管亚马逊正在建设配备价值数十亿美元英伟达 GPU 的数据中心,他们同时也在投资数十亿美元于使用这些内部芯片的其他数据中心。他们有一个集群即将上线,专为 Anthropic 服务,拥有超过 40 万颗芯片。
Amazon gets a lot of flak for totally bungling their internal AI model development, squandering massive amounts of internal compute resources on models that ultimately are not competitive, but the custom silicon is another matter. Again, they don't necessarily need their chips to be better and faster than Nvidia's. What they need is for their chips to be good enough, but build them at a breakeven gross margin instead of the ~90%+ gross margin that Nvidia earns on its H100 business.
亚马逊因在内部 AI 模型开发上彻底搞砸而备受批评,浪费了大量内部计算资源于最终不具备竞争力的模型上,但定制芯片则是另一回事。再次强调,他们并不一定需要自己的芯片比英伟达的更好更快。他们需要的是芯片足够好,但要以盈亏平衡的毛利率来制造,而不是像英伟达在其 H100 业务上获得的约 90% 以上的毛利率。
OpenAI has also announced their plans to build custom chips, and they (together with Microsoft) are obviously the single largest user of Nvidia's data center hardware. As if that weren't enough, Microsoft have themselves announced their own custom chips!
OpenAI 还宣布了其定制芯片的计划,他们(与微软一起)显然是 Nvidia 数据中心硬件的最大单一用户。似乎这还不够,微软自己也宣布了他们的定制芯片!
And Apple, the most valuable technology company in the world, has been blowing away expectations for years now with their highly innovative and disruptive custom silicon operation, which now completely trounces the CPUs from both Intel and AMD in terms of performance per watt, which is the most important factor in mobile (phone/tablet/laptop) applications. And they have been making their own internally designed GPUs and "Neural Processors" for years, even though they have yet to really demonstrate the utility of such chips outside of their own custom applications, like the advanced software based image processing used in the iPhone's camera.
苹果,作为全球最具价值的科技公司,多年来凭借其高度创新和颠覆性的定制芯片业务,持续超越预期。如今,在每瓦性能这一对移动设备(手机 / 平板 / 笔记本)至关重要的指标上,苹果的芯片已全面超越英特尔和 AMD 的 CPU。此外,苹果多年来还自主研发 GPU 和 “神经网络处理器”,尽管这些芯片在其定制应用之外的实际效用尚未充分展现,例如 iPhone 相机中基于先进软件的图像处理技术。
While Apple's focus seems somewhat orthogonal to these other players in terms of its mobile-first, consumer oriented, "edge compute" focus, if it ends up spending enough money on its new contract with OpenAI to provide AI services to iPhone users, you have to imagine that they have teams looking into making their own custom silicon for inference/training (although given their secrecy, you might never even know about it directly!).
尽管苹果的关注点似乎与其他玩家有些不同,它更侧重于移动优先、面向消费者的 “边缘计算” 策略,但如果它最终在与 OpenAI 的新合同上投入足够资金,为 iPhone 用户提供 AI 服务,你不得不想象他们有自己的团队在研究制造用于推理 / 训练的定制硅片(尽管鉴于他们的保密性,你可能永远无法直接得知这一点!)。
Now, it's no secret that there is a strong power law distribution of Nvidia's hyper-scaler customer base, with the top handful of customers representing the lion's share of high-margin revenue. How should one think about the future of this business when literally every single one of these VIP customers is building their own custom chips specifically for AI training and inference?
如今,众所周知,英伟达的超大规模客户群呈现出强烈的幂律分布,少数顶级客户占据了高利润收入的大部分份额。当这些 VIP 客户中的每一个都在为 AI 训练和推理专门打造自己的定制芯片时,我们应如何看待这项业务的未来?
When thinking about all this, you should keep one incredibly important thing in mind: Nvidia is largely an IP based company. They don't make their own chips. The true special sauce for making these incredible devices arguably comes more from TSMC, the actual fab, and ASML, which makes the special EUV lithography machines used by TSMC to make these leading-edge process node chips. And that's critically important, because TSMC will sell their most advanced chips to anyone who comes to them with enough up-front investment and is willing to guarantee a certain amount of volume. They don't care if it's for Bitcoin mining ASICs, GPUs, TPUs, mobile phone SoCs, etc.
在思考这一切时,您必须牢记一个极其重要的事实:英伟达本质上是一家以知识产权为基础的公司。他们并不自行制造芯片。这些令人惊叹的设备的真正秘诀,可以说更多地来自于台积电(TSMC),这家实际的晶圆厂,以及阿斯麦(ASML),它制造了台积电用于生产这些尖端工艺节点芯片的特殊极紫外光刻机。这一点至关重要,因为台积电会向任何带着足够前期投资并愿意保证一定产量的客户出售其最先进的芯片。他们不在乎这些芯片是用于比特币挖矿的 ASIC、GPU、TPU、手机 SoC 等。
As much as senior chip designers at Nvidia earn per year, surely some of the best of them could be lured away by these other tech behemoths for enough cash and stock. And once they have a team and resources, they can design innovative chips (again, perhaps not even 50% as advanced as an H100, but with that Nvidia gross margin, there is plenty of room to work with) in 2 to 3 years, and thanks for TSMC, they can turn those into actual silicon using the exact same process node technology as Nvidia.
Nvidia 资深芯片设计师的年收入固然可观,但面对其他科技巨头提供的丰厚现金与股票激励,其中一些顶尖人才难免会被吸引过去。一旦他们组建起团队并拥有资源,便能在两到三年内设计出创新芯片(尽管可能不及 H100 先进程度的 50%,但凭借 Nvidia 的高毛利率,仍有广阔的操作空间)。得益于台积电,他们能够使用与 Nvidia 完全相同的制程节点技术,将这些设计转化为实际的硅芯片。
The Software Threat (s) 软件威胁#
As if these looming hardware threats weren't bad enough, there are a few developments in the software world in the last couple years that, while they started out slowly, are now picking up real steam and could pose a serious threat to the software dominance of Nvidia's CUDA. The first of these is the horrible Linux drivers for AMD GPUs. Remember we talked about how AMD has inexplicably allowed these drivers to suck for years despite leaving massive amounts of money on the table?
仿佛这些迫在眉睫的硬件威胁还不够糟糕,过去几年软件领域的一些发展,虽然起初进展缓慢,但现在正迅速升温,可能对 NVIDIA CUDA 的软件主导地位构成严重威胁。首当其冲的是 AMD GPU 那糟糕透顶的 Linux 驱动程序。还记得我们讨论过 AMD 如何莫名其妙地让这些驱动程序多年来表现糟糕,尽管错失了巨额利润吗?
Well, amusingly enough, the infamous hacker George Hotz (famous for jailbreaking the original iphone as a teenager, and currently the CEO of self-driving startup Comma.ai and AI computer company Tiny Corp, which also makes the open-source tinygrad AI software framework), recently announced that he was sick and tired of dealing with AMD's bad drivers, and desperately wanted to be able to to leverage the lower cost AMD GPUs in their TinyBox AI computers (which come in multiple flavors, some of which use Nvidia GPUs, and some of which use AMD GPUS).
有趣的是,臭名昭著的黑客乔治・霍茨(因在青少年时期破解初代 iPhone 而闻名,现任自动驾驶初创公司 Comma.ai 和 AI 计算机公司 Tiny Corp 的 CEO,该公司还开发了开源 tinygrad AI 软件框架)最近宣布,他对处理 AMD 糟糕的驱动程序感到厌烦,并迫切希望能在他们的 TinyBox AI 计算机中利用成本更低的 AMD GPU(这些计算机有多种配置,部分使用 Nvidia GPU,部分使用 AMD GPU)。
Well, he is making his own custom drivers and software stack for AMD GPUs without any help from AMD themselves; on Jan. 15th of 2025, he tweeted via his company's X account that "We are one piece away from a completely sovereign stack on AMD, the RDNA3 assembler. We have our own driver, runtime, libraries, and emulator. (all in ~12,000 lines!)" Given his track record and skills, it is likely that they will have this all working in the next couple months, and this would allow for a lot of exciting possibilities of using AMD GPUs for all sorts of applications where companies currently feel compelled to pay up for Nvidia GPUs.
他正在独立为 AMD GPU 开发定制驱动程序和软件栈,未得到 AMD 的任何帮助;2025 年 1 月 15 日,他通过公司 X 账户发推文称:“我们距离在 AMD 上实现完全自主的软件栈仅差一步之遥,即 RDNA3 汇编器。我们已拥有自己的驱动程序、运行时、库和模拟器。(总计约 12,000 行代码!)” 鉴于他的过往记录和技术能力,很可能在未来几个月内完成所有工作,这将为 AMD GPU 在各种应用中的使用开辟众多令人兴奋的可能性,而目前这些应用领域的企业往往不得不选择价格更高的 Nvidia GPU。
OK, well that's just a driver for AMD, and it's not even done yet. What else is there? Well, there are a few other areas on the software side that are a lot more impactful. For one, there is now a massive concerted effort across many large tech companies and the open source software community at large to make more generic AI software frameworks that have CUDA as just one of many "compilation targets".
好的,那只是 AMD 的一个驱动程序,而且还没完成。还有什么呢?在软件方面,有几个领域的影响要大得多。首先,现在许多大型科技公司和整个开源软件社区正在共同努力,开发更通用的 AI 软件框架,其中 CUDA 只是众多 “编译目标” 之一。
That is, you write your software using higher-level abstractions, and the system itself can automatically turn those high-level constructs into super well-tuned low-level code that works extremely well on CUDA. But because it's done at this higher level of abstraction, it can just as easily get compiled into low-level code that works extremely well on lots of other GPUs and TPUs from a variety of providers, such as the massive number of custom chips in the pipeline from every big tech company.
也就是说,您使用更高层次的抽象来编写软件,而系统本身能够自动将这些高级结构转化为在 CUDA 上运行极为出色的、经过超级优化的低级代码。但由于这是在更高抽象层次上完成的,它同样可以轻松编译成能在众多其他 GPU 和 TPU 上高效运行的低级代码,这些硬件来自不同供应商,包括各大科技公司正在研发的大量定制芯片。
The most famous examples of these frameworks are MLX (sponsored primarily by Apple), Triton (sponsored primarily by OpenAI), and JAX (developed by Google). MLX is particularly interesting because it provides a PyTorch-like API that can run efficiently on Apple Silicon, showing how these abstraction layers can enable AI workloads to run on completely different architectures. Triton, meanwhile, has become increasingly popular as it allows developers to write high-performance code that can be compiled to run on various hardware targets without having to understand the low-level details of each platform.
这些框架中最著名的例子包括 MLX(主要由苹果赞助)、Triton(主要由 OpenAI 赞助)和 JAX(由谷歌开发)。MLX 尤为引人注目,因为它提供了一个类似 PyTorch 的 API,能够在苹果硅芯片上高效运行,展示了这些抽象层如何使 AI 工作负载能够在完全不同的架构上运行。与此同时,Triton 因其允许开发者编写高性能代码而日益流行,这些代码可以编译后运行在各种硬件目标上,无需深入了解每个平台的底层细节。
These frameworks allow developers to write their code once using high powered abstractions and then target tons of platforms automatically— doesn't that sound like a better way to do things, which would give you a lot more flexibility in terms of how you actually run the code?
这些框架允许开发者使用高级抽象编写一次代码,然后自动针对众多平台进行部署 —— 这听起来不是一种更好的做事方式吗?它将在实际运行代码方面为您提供更大的灵活性
In the 1980s, all the most popular, best selling software was written in hand-tuned assembly language. The PKZIP compression utility for example was hand crafted to maximize speed, to the point where a competently coded version written in the standard C programming language and compiled using the best available optimizing compilers at the time, would run at probably half the speed of the hand-tuned assembly code. The same is true for other popular software packages like WordStar, VisiCalc, and so on.
在 20 世纪 80 年代,所有最受欢迎、销量最佳的软件都是用人工调优的汇编语言编写的。例如,PKZIP 压缩工具就是手工打造的,以最大化速度,以至于用标准 C 编程语言编写并使用当时最好的优化编译器编译的版本,其运行速度可能只有手工调优汇编代码的一半。对于其他流行的软件包,如 WordStar、VisiCalc 等,情况也是如此。
Over time, compilers kept getting better and better, and every time the CPU architectures changed (say, from Intel releasing the 486, then the Pentium, and so on), that hand-rolled assembler would often have to be thrown out and rewritten, something that only the smartest coders were capable of (sort of like how CUDA experts are on a different level in the job market versus a "regular" software developer). Eventually, things converged so that the speed benefits of hand-rolled assembly were outweighed dramatically by the flexibility of being able to write code in a high-level language like C or C++, where you rely on the compiler to make things run really optimally on the given CPU.
随着时间的推移,编译器变得越来越好,每当 CPU 架构发生变化时(例如,英特尔发布 486,然后是奔腾,等等),那些手工编写的汇编代码往往不得不被抛弃并重写,这只有最聪明的程序员才能做到(有点像 CUDA 专家在就业市场上与 “普通” 软件开发人员相比处于不同层次)。最终,情况趋于一致,手工编写汇编代码带来的速度优势被使用 C 或 C++ 等高级语言编写代码的灵活性所大幅超越,在这些高级语言中,你依赖编译器使代码在给定的 CPU 上运行得极其优化。
Nowadays, very little new code is written in assembly. I believe a similar transformation will end up happening for AI training and inference code, for similar reasons: computers are good at optimization, and flexibility and speed of development is increasingly the more important factor— especially if it also allows you to save dramatically on your hardware bill because you don't need to keep paying the "CUDA tax" that gives Nvidia 90%+ margins.
如今,用汇编语言编写的新代码已经很少了。我相信,出于类似的原因,AI 训练和推理代码也将经历类似的转变:计算机擅长优化,而开发的灵活性和速度正日益成为更重要的因素 —— 尤其是如果这还能让你大幅节省硬件成本,因为你不再需要持续支付让英伟达获得 90% 以上利润的 “CUDA 税”。
Yet another area where you might see things change dramatically is that CUDA might very well end up being more of a high level abstraction itself— a "specification language" similar to Verilog (used as the industry standard to describe chip layouts) that skilled developers can use to describe high-level algorithms that involve massive parallelism (since they are already familiar with it, it's very well constructed, it's the lingua franca, etc.), but then instead of having that code compiled for use on Nvidia GPUs like you would normally do, it can instead be fed as source code into an LLM which can port it into whatever low-level code is understood by the new Cerebras chip, or the new Amazon Trainium2, or the new Google TPUv6, etc. This isn't as far off as you might think; it's probably already well within reach using OpenAI's latest O3 model, and surely will be possible generally within a year or two.
另一个可能发生巨大变化的领域是,CUDA 本身很可能最终会成为一种更高级的抽象 —— 类似于 Verilog(作为描述芯片布局的行业标准)的 “规范语言”,熟练的开发者可以用它来描述涉及大规模并行性的高级算法(因为他们已经熟悉它,它构建得非常好,是通用语言等),但随后,与通常将其编译为用于 Nvidia GPU 的代码不同,它可以作为源代码输入到 LLM 中,该工具可以将其移植为新的 Cerebras 芯片、新的 Amazon Trainium2 或新的 Google TPUv6 等所能理解的任何低级代码。这并不像你想象的那么遥远;使用 OpenAI 最新的 O3 模型,这很可能已经触手可及,并且肯定在一两年内就能普遍实现。
The Theoretical Threat 理论威胁#
Perhaps the most shocking development which was alluded to earlier happened in the last couple of weeks. And that is the news that has totally rocked the AI world, and which has been dominating the discourse among knowledgeable people on Twitter despite its complete absence from any of the mainstream media outlets: that a small Chinese startup called DeepSeek released two new models that have basically world-competitive performance levels on par with the best models from OpenAI and Anthropic (blowing past the Meta Llama3 models and other smaller open source model players such as Mistral). These models are called DeepSeek-V3 (basically their answer to GPT-4o and Claude3.5 Sonnet) and DeepSeek-R1 (basically their answer to OpenAI's O1 model).
或许最令人震惊的进展,正如之前所提及的,发生在过去几周内。这一消息彻底震撼了 AI 界,并在 Twitter 上知识渊博的人群中主导了讨论,尽管主流媒体对此完全未予报道:一家名为深度求索(DeepSeek)的中国初创公司发布了两款新模型,其性能水平基本与世界顶尖的 OpenAI 和 Anthropic 模型相媲美(超越了 Meta 的 Llama3 模型及其他小型开源模型玩家如 Mistral)。这两款模型分别名为 DeepSeek-V3(基本上是对标 GPT-4o 和 Claude3.5 Sonnet 的回应)和 DeepSeek-R1(基本上是对 OpenAI 的 O1 模型的回应)。
Why is this all so shocking? Well, first of all, DeepSeek is a tiny Chinese company that reportedly has under 200 employees. The story goes that they started out as a quant trading hedge fund similar to TwoSigma or RenTec, but after Xi Jinping cracked down on that space, they used their math and engineering chops to pivot into AI research. Who knows if any of that is really true or if they are merely some kind of front for the CCP or the Chinese military. But the fact remains that they have released two incredibly detailed technical reports, for DeepSeek-V3 and DeepSeekR1.
为何这一切如此令人震惊?首先,深度求索(DeepSeek)是一家规模很小的中国公司,据报道员工不足 200 人。传闻他们最初是一家类似于 TwoSigma 或 RenTec 的量化交易对冲基金,但在中国对该领域进行整顿后,他们利用自身的数学和工程能力转向了人工智能研究。这些说法是否属实,或者他们是否只是中国共产党或中国军方的某种幌子,无人知晓。但事实是,他们已经发布了关于 DeepSeek-V3 和 DeepSeekR1 的两份极其详尽的技术报告。
These are heavy technical reports, and if you don't know a lot of linear algebra, you probably won't understand much. But what you should really try is to download the free DeepSeek app on the AppStore here and install it using a Google account to log in and give it a try (you can also install it on Android here), or simply try it out on your desktop computer in the browser here. Make sure to select the "DeepThink" option to enable chain-of-thought (the R1 model) and ask it to explain parts of the technical reports in simple terms.
这些是技术含量很高的报告,如果你对线性代数了解不多,可能很难理解大部分内容。但你应该真正尝试的是,在这里的 AppStore 下载免费的 DeepSeek 应用程序,并使用谷歌账户登录安装并试用(你也可以在这里的安卓设备上安装),或者直接在浏览器中通过桌面电脑试用。确保选择 “DeepThink” 选项以启用思维链(R1 模型),并请它以简单的语言解释技术报告的部分内容。
This will simultaneously show you a few important things:
这将同时向您展示几个重要事项:
-
One, this model is absolutely legit. There is a lot of BS that goes on with AI benchmarks, which are routinely gamed so that models appear to perform great on the benchmarks but then suck in real world tests. Google is certainly the worst offender in this regard, constantly crowing about how amazing their LLMs are, when they are so awful in any real world test that they can't even reliably accomplish the simplest possible tasks, let alone challenging coding tasks. These DeepSeek models are not like that— the responses are coherent, compelling, and absolutely on the same level as those from OpenAI and Anthropic.
首先,这个模型绝对靠谱。AI 基准测试中充斥着大量水分,常常被操纵,使得模型在基准测试中表现优异,但在实际测试中却糟糕透顶。谷歌在这方面无疑是最大的罪魁祸首,他们不断吹嘘自己的 LLMs 有多么神奇,然而在任何实际测试中,它们表现如此之差,连最基本的任务都无法可靠完成,更不用说具有挑战性的编程任务了。这些 DeepSeek 模型则不同 —— 它们的回答连贯、有说服力,绝对与 OpenAI 和 Anthropic 的水平相当。 -
Two, that DeepSeek has made profound advancements not just in model quality, but more importantly in model training and inference efficiency. By being extremely close to the hardware and by layering together a handful of distinct, very clever optimizations, DeepSeek was able to train these incredible models using GPUs in a dramatically more efficient way. By some measurements, over ~45x more efficiently than other leading-edge models. DeepSeek claims that the complete cost to train DeepSeek-V3 was just over $5mm. That is absolutely nothing by the standards of OpenAI, Anthropic, etc., which were well into the $100mm+ level for training costs for a single model as early as 2024.
二、深度求索不仅在模型质量上取得了显著进步,更重要的是在模型训练和推理效率方面实现了重大突破。通过极度贴近硬件并结合一系列独特且巧妙的优化策略,深度求索能够以显著更高的效率利用 GPU 训练这些卓越的模型。据某些测量,其效率比其他前沿模型高出约 45 倍。深度求索宣称,训练 DeepSeek-V3 的总成本仅略高于 500 万美元。按照 OpenAI、Anthropic 等公司的标准,这简直微不足道,早在 2024 年,这些公司训练单个模型的成本就已轻松突破 1 亿美元大关。
How in the world could this be possible? How could this little Chinese company completely upstage all the smartest minds at our leading AI labs, which have 100 times more resources, headcount, payroll, capital, GPUs, etc? Wasn't China supposed to be crippled by Biden's restriction on GPU exports? Well, the details are fairly technical, but we can at least describe them at a high level. It might have just turned out that the relative GPU processing poverty of DeepSeek was the critical ingredient to make them more creative and clever, necessity being the mother of invention and all.
这怎么可能呢?这家小小的中国公司是如何完全超越我们领先的人工智能实验室里所有最聪明的人才的?这些实验室的资源、人员、薪资、资本、GPU 等都是其百倍之多。中国不是应该因为拜登对 GPU 出口的限制而受到重创吗?虽然细节相当技术性,但我们至少可以从高层次上描述一下。也许正是深度求索相对匮乏的 GPU 处理能力,成为了促使他们更具创造力和智慧的关键因素,毕竟需求是发明之母。
A major innovation is their sophisticated mixed-precision training framework that lets them use 8-bit floating point numbers (FP8) throughout the entire training process. Most Western AI labs train using "full precision" 32-bit numbers (this basically specifies the number of gradations possible in describing the output of an artificial neuron; 8 bits in FP8 lets you store a much wider range of numbers than you might expect— it's not just limited to 256 different equal-sized magnitudes like you'd get with regular integers, but instead uses clever math tricks to store both very small and very large numbers— though naturally with less precision than you'd get with 32 bits.) The main tradeoff is that while FP32 can store numbers with incredible precision across an enormous range, FP8 sacrifices some of that precision to save memory and boost performance, while still maintaining enough accuracy for many AI workloads.
一项重大创新是他们复杂的混合精度训练框架,该框架允许他们在整个训练过程中使用 8 位浮点数(FP8)。大多数西方 AI 实验室使用 “全精度” 32 位数进行训练(这基本上指定了描述人工神经元输出时可能的分级数量;FP8 中的 8 位允许存储比预期更广泛的数字范围 —— 它不仅限于像常规整数那样 256 个不同等大小的幅度,而是使用巧妙的数学技巧来存储非常小和非常大的数字 —— 尽管自然比 32 位精度低)。主要的权衡在于,虽然 FP32 可以在巨大范围内以极高的精度存储数字,但 FP8 牺牲了一些精度以节省内存并提高性能,同时仍保持足够的准确性以应对许多 AI 工作负载。
DeepSeek cracked this problem by developing a clever system that breaks numbers into small tiles for activations and blocks for weights, and strategically uses high-precision calculations at key points in the network. Unlike other labs that train in high precision and then compress later (losing some quality in the process), DeepSeek's native FP8 approach means they get the massive memory savings without compromising performance. When you're training across thousands of GPUs, this dramatic reduction in memory requirements per GPU translates into needing far fewer GPUs overall.
DeepSeek 通过开发一个巧妙的系统解决了这个问题,该系统将数字分解为用于激活的小块和用于权重的块,并在网络的关键点策略性地使用高精度计算。与其他实验室先进行高精度训练再进行压缩(在此过程中会损失一些质量)不同,DeepSeik 的原生 FP8 方法意味着他们在不牺牲性能的情况下获得了巨大的内存节省。当你在数千个 GPU 上进行训练时,每个 GPU 内存需求的显著减少意味着总体上需要的 GPU 数量大大减少。
Another major breakthrough is their multi-token prediction system. Most Transformer based LLM models do inference by predicting the next token— one token at a time. DeepSeek figured out how to predict multiple tokens while maintaining the quality you'd get from single-token prediction. Their approach achieves about 85-90% accuracy on these additional token predictions, which effectively doubles inference speed without sacrificing much quality. The clever part is they maintain the complete causal chain of predictions, so the model isn't just guessing— it's making structured, contextual predictions.
另一个重大突破是他们的多令牌预测系统。大多数基于 Transformer 的 LLM 模型通过预测下一个令牌来进行推理 —— 一次一个令牌。DeepSeek 找到了在保持单令牌预测质量的同时预测多个令牌的方法。他们的方法在这些额外令牌预测上实现了约 85-90% 的准确率,有效将推理速度提高了一倍,而几乎没有牺牲质量。巧妙之处在于他们保持了预测的完整因果链,因此模型不仅仅是在猜测 —— 它是在进行结构化的、上下文相关的预测。
One of their most innovative developments is what they call Multi-head Latent Attention (MLA). This is a breakthrough in how they handle what are called the Key-Value indices, which are basically how individual tokens are represented in the attention mechanism within the Transformer architecture. Although this is getting a bit too advanced in technical terms, suffice it to say that these KV indices are some of the major uses of VRAM during the training and inference process, and part of the reason why you need to use thousands of GPUs at the same time to train these models— each GPU has a maximum of 96 gb of VRAM, and these indices eat that memory up for breakfast.
他们最创新的发展之一是他们称之为多头潜在注意力(MLA)的技术。这是在处理所谓的键值索引方面的一个突破,这些索引基本上代表了 Transformer 架构中注意力机制内各个标记的表示方式。虽然这在技术术语上有点过于高级,但可以说这些 KV 索引是训练和推理过程中 VRAM 的主要用途之一,也是为什么需要同时使用数千个 GPU 来训练这些模型的部分原因 —— 每个 GPU 最多有 96GB 的 VRAM,而这些索引会迅速消耗这些内存。
Their MLA system finds a way to store a compressed version of these indices that captures the essential information while using far less memory. The brilliant part is this compression is built directly into how the model learns— it's not some separate step they need to do, it's built directly into the end-to-end training pipeline. This means that the entire mechanism is "differentiable" and able to be trained directly using the standard optimizers. All this stuff works because these models are ultimately finding much lower-dimensional representations of the underlying data than the so-called "ambient dimensions". So it's wasteful to store the full KV indices, even though that is basically what everyone else does.
他们的 MLA 系统找到了一种方法来存储这些索引的压缩版本,这些版本在捕获关键信息的同时,使用的内存要少得多。其精妙之处在于,这种压缩直接内置于模型的学习方式中 —— 它不是他们需要单独进行的步骤,而是直接内置在端到端的训练管道中。这意味着整个机制是 “可微分的”,并且能够直接使用标准优化器进行训练。所有这些之所以有效,是因为这些模型最终找到的是比所谓的 “环境维度” 低得多的基础数据的表示。因此,存储完整的 KV 索引是浪费的,尽管这基本上是其他人所做的。
Not only do you end up wasting tons of space by storing way more numbers than you need, which gives a massive boost to the training memory footprint and efficiency (again, slashing the number of GPUs you need to train a world class model), but it can actually end up improving model quality because it can act like a "regularizer," forcing the model to pay attention to the truly important stuff instead of using the wasted capacity to fit to noise in the training data. So not only do you save a ton of memory, but the model might even perform better. At the very least, you don't get a massive hit to performance in exchange for the huge memory savings, which is generally the kind of tradeoff you are faced with in AI training.
不仅因为存储了远超所需的数字而浪费了大量空间,从而大幅增加了训练内存占用并降低了效率(再次减少了训练世界级模型所需的 GPU 数量),而且实际上还可能提高模型质量,因为它可以起到 “正则化器” 的作用,迫使模型关注真正重要的内容,而不是利用浪费的容量去适应训练数据中的噪声。因此,你不仅节省了大量内存,模型性能甚至可能更优。至少,在获得巨大内存节省的同时,不会对性能造成重大影响,而这通常是 AI 训练中不得不面对的权衡。
They also made major advances in GPU communication efficiency through their DualPipe algorithm and custom communication kernels. This system intelligently overlaps computation and communication, carefully balancing GPU resources between these tasks. They only need about 20 of their GPUs' streaming multiprocessors (SMs) for communication, leaving the rest free for computation. The result is much higher GPU utilization than typical training setups achieve.
他们还通过 DualPipe 算法和定制的通信内核在 GPU 通信效率方面取得了重大进展。该系统智能地重叠计算和通信,仔细平衡 GPU 资源在这些任务之间的分配。他们仅需约 20 个 GPU 的流多处理器(SMs)用于通信,其余部分则留作计算使用。结果是 GPU 利用率远高于典型的训练设置所能达到的水平。
Another very smart thing they did is to use what is known as a Mixture-of-Experts (MOE) Transformer architecture, but with key innovations around load balancing. As you might know, the size or capacity of an AI model is often measured in terms of the number of parameters the model contains. A parameter is just a number that stores some attribute of the model; either the "weight" or importance a particular artificial neuron has relative to another one, or the importance of a particular token depending on its context (in the "attention mechanism"), etc.
他们做的另一件非常聪明的事情是使用了一种被称为专家混合(Mixture-of-Experts, MOE)的 Transformer 架构,但在负载均衡方面进行了关键创新。如你所知,AI 模型的规模或容量通常通过模型包含的参数数量来衡量。参数只是一个数字,存储了模型的某些属性;无论是特定人工神经元相对于另一个神经元的 “权重” 或重要性,还是特定标记根据其上下文(在 “注意力机制” 中)的重要性等。
Meta's latest Llama3 models come in a few sizes, for example: a 1 billion parameter version (the smallest), a 70B parameter model (the most commonly deployed one), and even a massive 405B parameter model. This largest model is of limited utility for most users because you would need to have tens of thousands of dollars worth of GPUs in your computer just to run at tolerable speeds for inference, at least if you deployed it in the naive full-precision version. Therefore most of the real-world usage and excitement surrounding these open source models is at the 8B parameter or highly quantized 70B parameter level, since that's what can fit in a consumer-grade Nvidia 4090 GPU, which you can buy now for under $1,000.
Meta 最新的 Llama3 模型提供了几种规模,例如:10 亿参数版本(最小)、700 亿参数模型(最常用部署的),甚至还有一个庞大的 4050 亿参数模型。这个最大的模型对大多数用户来说实用性有限,因为你需要价值数万美元的 GPU 才能以可接受的速度进行推理,至少如果你以原始的全精度版本部署的话。因此,围绕这些开源模型的实际使用和兴奋点主要集中在 80 亿参数或高度量化的 700 亿参数级别,因为这些可以在消费级的 Nvidia 4090 GPU 上运行,而你现在可以以不到 1000 美元的价格购买到这款 GPU。
So why does any of this matter? Well, in a sense, the parameter count and precision tells you something about how much raw information or data the model has stored internally. Note that I'm not talking about reasoning ability, or the model's "IQ" if you will: it turns out that models with even surprisingly modest parameter counts can show remarkable cognitive performance when it comes to solving complex logic problems, proving theorems in plane geometry, SAT math problems, etc.
那么,这一切为何重要呢?从某种意义上说,参数数量和精度揭示了模型内部存储了多少原始信息或数据。请注意,我并非在讨论推理能力,或者模型的 “智商”:事实证明,即便是参数数量看似不多的模型,在解决复杂逻辑问题、证明平面几何定理、SAT 数学题等方面,也能展现出惊人的认知表现。
But those small models aren't going to be able to necessarily tell you every aspect of every plot twist in every single novel by Stendhal, whereas the really big models can potentially do that. The "cost" of that extreme level of knowledge is that the models become very unwieldy both to train and to do inference on, because you always need to store every single one of those 405B parameters (or whatever the parameter count is) in the GPU's VRAM at the same time in order to do any inference with the model.
但那些小型模型未必能详尽解析司汤达每部小说中的每个情节转折,而真正的大型模型则有可能做到这一点。这种极致知识水平的 “代价” 是,模型在训练和推理过程中变得极为笨重,因为为了进行任何模型推理,你始终需要同时将所有 4050 亿个参数(或无论参数数量是多少)存储在 GPU 的显存中。
The beauty of the MOE model approach is that you can decompose the big model into a collection of smaller models that each know different, non-overlapping (at least fully) pieces of knowledge. DeepSeek's innovation here was developing what they call an "auxiliary-loss-free" load balancing strategy that maintains efficient expert utilization without the usual performance degradation that comes from load balancing. Then, depending on the nature of the inference request, you can intelligently route the inference to the "expert" models within that collection of smaller models that are most able to answer that question or solve that task.
MOE 模型方法的美妙之处在于,您可以将大型模型分解为一组较小的模型,每个模型都掌握着不同且不重叠(至少不完全重叠)的知识片段。DeepSeek 在此的创新是开发了一种他们称之为 “无辅助损失” 的负载均衡策略,该策略在保持专家高效利用的同时,避免了负载均衡通常带来的性能下降。然后,根据推理请求的性质,您可以智能地将推理路由到那组较小模型中最能解答该问题或完成该任务的 “专家” 模型中去。
You can loosely think of it as being a committee of experts who have their own specialized knowledge domains: one might be a legal expert, the other a computer science expert, the other a business strategy expert. So if a question comes in about linear algebra, you don't give it to the legal expert. This is of course a very loose analogy and it doesn't actually work like this in practice.
你可以大致将其想象成一个由专家组成的委员会,每位专家都有自己擅长的知识领域:一位可能是法律专家,另一位是计算机科学专家,还有一位是商业战略专家。因此,如果有一个关于线性代数的问题,你不会把它交给法律专家。当然,这是一个非常粗略的类比,实际上在实践中并不会这样运作。
The real advantage of this approach is that it allows the model to contain a huge amount of knowledge without being very unwieldy, because even though the aggregate number of parameters is high across all the experts, only a small subset of these parameters is "active" at any given time, which means that you only need to store this small subset of weights in VRAM in order to do inference. In the case of DeepSeek-V3, they have an absolutely massive MOE model with 671B parameters, so it's much bigger than even the largest Llama3 model, but only 37B of these parameters are active at any given time— enough to fit in the VRAM of two consumer-grade Nvidia 4090 GPUs (under $2,000 total cost), rather than requiring one or more H100 GPUs which cost something like $40k each.
这种方法的真正优势在于,它使得模型能够包含海量知识而不显得过于笨重,因为尽管所有专家模型的总参数数量庞大,但在任何给定时刻,只有一小部分参数是 “活跃” 的。这意味着,为了进行推理,你只需在 VRAM 中存储这一小部分权重即可。以 DeepSeek-V3 为例,他们拥有一个参数高达 6710 亿的巨型 MOE 模型,这甚至比最大的 Llama3 模型还要大得多,但在任何时刻,仅有 370 亿参数处于活跃状态 —— 足以装入两块消费级 Nvidia 4090 GPU 的 VRAM 中(总成本低于 2000 美元),而无需使用每块价值约 4 万美元的 H100 GPU。
It's rumored that both ChatGPT and Claude use an MoE architecture, with some leaks suggesting that GPT-4 had a total of 1.8 trillion parameters split across 8 models containing 220 billion parameters each. Despite that being a lot more doable than trying to fit all 1.8 trillion parameters in VRAM, it still requires multiple H100-grade GPUs just to run the model because of the massive amount of memory used.
传闻 ChatGPT 和 Claude 均采用了 MoE 架构,有泄露消息称 GPT-4 总共拥有 1.8 万亿参数,分布在 8 个模型中,每个模型包含 2200 亿参数。尽管这比尝试将所有 1.8 万亿参数装入 VRAM 要可行得多,但由于使用了大量内存,运行该模型仍需要多块 H100 级别的 GPU。
Beyond what has already been described, the technical papers mention several other key optimizations. These include their extremely memory-efficient training framework that avoids tensor parallelism, recomputes certain operations during backpropagation instead of storing them, and shares parameters between the main model and auxiliary prediction modules. The sum total of all these innovations, when layered together, has led to the ~45x efficiency improvement numbers that have been tossed around online, and I am perfectly willing to believe these are in the right ballpark.
除了已经描述的内容外,技术论文还提到了其他几项关键优化。这些优化包括他们极其节省内存的训练框架,该框架避免了张量并行,在反向传播期间重新计算某些操作而非存储它们,并在主模型和辅助预测模块之间共享参数。所有这些创新层层叠加,共同促成了网上流传的约 45 倍效率提升的数据,我完全愿意相信这些数字大致准确。
One very strong indicator that it's true is the cost of DeepSeek's API: despite this nearly best-in-class model performance, DeepSeek charges something like 95% less money for inference requests via its API than comparable models from OpenAI and Anthropic. In a sense, it's sort of like comparing Nvidia's GPUs to the new custom chips from competitors: even if they aren't quite as good, the value for money is so much better that it can still be a no-brainer depending on the application, as long as you can qualify the performance level and prove that it's good enough for your requirements and the API availability and latency is good enough (thus far, people have been amazed at how well DeepSeek's infrastructure has held up despite the truly incredible surge of demand owing to the performance of these new models).
一个非常有力的证据是 DeepSeek 的 API 成本:尽管其模型性能几乎达到顶级水平,但 DeepSeek 通过其 API 进行推理请求的收费比 OpenAI 和 Anthropic 的同类模型低约 95%。从某种意义上说,这有点像将 Nvidia 的 GPU 与竞争对手的新定制芯片进行比较:即使它们不是最好的,性价比却高得多,根据应用场景,这仍然可能是一个无需犹豫的选择,只要你能确认性能水平并证明它足以满足你的需求,且 API 的可用性和延迟足够好(到目前为止,人们惊讶于 DeepSeek 的基础设施在因这些新模型性能而引发的需求激增中表现如此出色)。
But unlike the case of Nvidia, where the cost differential is the result of them earning monopoly gross margins of 90%+ on their data-center products, the cost differential of the DeepSeek API relative to the OpenAI and Anthropic API could be simply that they are nearly 50x more compute efficient (it might even be significantly more than that on the inference side— the ~45x efficiency was on the training side). Indeed, it's not even clear that OpenAI and Anthropic are making great margins on their API services— they might be more interested in revenue growth and gathering more data from analyzing all the API requests they receive.
但与英伟达的情况不同,其成本差异源于他们在数据中心产品上赚取 90% 以上的垄断毛利率,而 DeepSeek API 相对于 OpenAI 和 Anthropic API 的成本差异可能仅仅是因为其计算效率高出近 50 倍(在推理端甚至可能远高于此 —— 约 45 倍的效率是在训练端)。事实上,尚不清楚 OpenAI 和 Anthropic 是否在其 API 服务上获得了高额利润 —— 他们可能更关注收入增长以及通过分析接收到的所有 API 请求来收集更多数据。
Before moving on, I'd be remiss if I didn't mention that many people are speculating that DeepSeek is simply lying about the number of GPUs and GPU hours spent training these models because they actually possess far more H100s than they are supposed to have given the export restrictions on these cards, and they don't want to cause trouble for themselves or hurt their chances of acquiring more of these cards. While it's certainly possible, I think it's more likely that they are telling the truth, and that they have simply been able to achieve these incredible results by being extremely clever and creative in their approach to training and inference. They explain how they are doing things, and I suspect that it's only a matter of time before their results are widely replicated and confirmed by other researchers at various other labs.
在继续之前,如果我不提一下许多人猜测 DeepSeek 在关于训练这些模型所使用的 GPU 数量和 GPU 小时数上可能撒谎,那将是我的疏忽。因为实际上他们拥有的 H100 数量远超出口限制所允许的范围,他们不想给自己惹麻烦或影响未来获取更多此类显卡的机会。虽然这确实有可能,但我认为更可能是他们在说实话,他们只是通过极其聪明和创造性的训练和推理方法取得了这些令人难以置信的成果。他们解释了他们的做法,我怀疑不久之后,他们的成果就会被其他实验室的研究人员广泛复制和验证。
A Model That Can Really Think
一个真正能思考的模型#
The newer R1 model and technical report might even be even more mind blowing, since they were able to beat Anthropic to Chain-of-thought and now are basically the only ones besides OpenAI who have made this technology work at scale. But note that the O1 preview model was only released by OpenAI in mid-September of 2024. That's only ~4 months ago! Something you absolutely must keep in mind is that, unlike OpenAI, which is incredibly secretive about how these models really work at a low level, and won't release the actual model weights to anyone besides partners like Microsoft and other who sign heavy-duty NDAs, these DeepSeek models are both completely open-source and permissively licensed. They have released extremely detailed technical reports explaining how they work, as well as the code that anyone can look at and try to copy.
较新的 R1 模型和技术报告可能更加令人震撼,因为它们不仅超越了 Anthropic 在思维链技术上的成就,而且目前除了 OpenAI 之外,几乎是唯一一家成功将这项技术大规模应用的公司。但值得注意的是,OpenAI 的 O1 预览模型直到 2024 年 9 月中旬才发布,距今仅约 4 个月!必须牢记的是,与 OpenAI 截然不同,后者对这些模型在底层如何运作极其保密,除了微软等合作伙伴及签署严格保密协议者外,不会向任何人公开模型权重。而 DeepSeek 的这些模型不仅完全开源,还采用了宽松的许可协议。他们发布了极为详尽的技术报告,阐述了模型的工作原理,并公开了代码,任何人都可以查看并尝试复制。
With R1, DeepSeek essentially cracked one of the holy grails of AI: getting models to reason step-by-step without relying on massive supervised datasets. Their DeepSeek-R1-Zero experiment showed something remarkable: using pure reinforcement learning with carefully crafted reward functions, they managed to get models to develop sophisticated reasoning capabilities completely autonomously. This wasn't just about solving problems— the model organically learned to generate long chains of thought, self-verify its work, and allocate more computation time to harder problems.
通过 R1,深度搜索(DeepSeek)实质上破解了人工智能领域的一个圣杯:让模型能够不依赖大规模监督数据集,逐步进行推理。他们的 DeepSeek-R1-Zero 实验展示了令人瞩目的成果:通过精心设计的奖励函数,运用纯粹的强化学习,他们成功让模型完全自主地发展出复杂的推理能力。这不仅仅是解决问题 —— 模型有机地学会了生成长链思维,自我验证其工作,并为更困难的问题分配更多的计算时间。
The technical breakthrough here was their novel approach to reward modeling. Rather than using complex neural reward models that can lead to "reward hacking" (where the model finds bogus ways to boost their rewards that don't actually lead to better real-world model performance), they developed a clever rule-based system that combines accuracy rewards (verifying final answers) with format rewards (encouraging structured thinking). This simpler approach turned out to be more robust and scalable than the process-based reward models that others have tried.
这里的技术突破在于他们新颖的奖励建模方法。他们没有采用可能导致 “奖励黑客行为” 的复杂神经奖励模型(即模型找到虚假方式来增加奖励,而这些方式实际上并不会带来现实世界模型性能的提升),而是开发了一个巧妙的基于规则的系统,将准确性奖励(验证最终答案)与格式奖励(鼓励结构化思维)相结合。这种更简单的方法被证明比其他尝试过的基于过程的奖励模型更加稳健和可扩展。
What's particularly fascinating is that during training, they observed what they called an "aha moment," a phase where the model spontaneously learned to revise its thinking process mid-stream when encountering uncertainty. This emergent behavior wasn't explicitly programmed; it arose naturally from the interaction between the model and the reinforcement learning environment. The model would literally stop itself, flag potential issues in its reasoning, and restart with a different approach, all without being explicitly trained to do this.
特别引人入胜的是,在训练过程中,他们观察到了所谓的 “顿悟时刻”,即模型在遇到不确定性时,自发地学会在过程中修正其思维过程。这种涌现行为并非明确编程设定,而是自然产生于模型与强化学习环境的互动之中。模型会自行暂停,标记其推理中的潜在问题,并以不同的方法重新开始,这一切都无需经过专门训练。
The full R1 model built on these insights by introducing what they call "cold-start" data— a small set of high-quality examples— before applying their RL techniques. They also solved one of the major challenges in reasoning models: language consistency. Previous attempts at chain-of-thought reasoning often resulted in models mixing languages or producing incoherent outputs. DeepSeek solved this through a clever language consistency reward during RL training, trading off a small performance hit for much more readable and consistent outputs.
基于这些洞察,完整的 R1 模型通过引入他们所谓的 “冷启动” 数据 —— 一小部分高质量示例 —— 在应用强化学习技术之前进行了构建。他们还解决了推理模型中的一个主要挑战:语言一致性。以往在链式思维推理方面的尝试常常导致模型混合语言或产生不连贯的输出。DeepSeek 通过在强化学习训练期间巧妙地引入语言一致性奖励机制,以轻微的性能损失为代价,换来了更加可读和一致的输出结果。
The results are mind-boggling: on AIME 2024, one of the most challenging high school math competitions, R1 achieved 79.8% accuracy, matching OpenAI's O1 model. On MATH-500, it hit 97.3%, and it achieved the 96.3 percentile on Codeforces programming competitions. But perhaps most impressively, they managed to distill these capabilities down to much smaller models: their 14B parameter version outperforms many models several times its size, suggesting that reasoning ability isn't just about raw parameter count but about how you train the model to process information.
结果令人震惊:在 2024 年 AIME—— 一项极具挑战性的高中数学竞赛中,R1 模型达到了 79.8% 的准确率,与 OpenAI 的 O1 模型持平。在 MATH-500 测试中,其准确率高达 97.3%,并在 Codeforces 编程竞赛中达到了 96.3 百分位。但或许最令人印象深刻的是,他们成功将这些能力浓缩至更小的模型中:其 14B 参数版本的表现超越了许多规模是其数倍的模型,这表明推理能力不仅仅取决于原始参数数量,更在于如何训练模型处理信息。
The Fallout 辐射#
The recent scuttlebutt on Twitter and Blind (a corporate rumor website) is that these models caught Meta completely off guard and that they perform better than the new Llama4 models which are still being trained. Apparently, the Llama project within Meta has attracted a lot of attention internally from high-ranking technical executives, and as a result they have something like 13 individuals working on the Llama stuff who each individually earn more per year in total compensation than the combined training cost for the DeepSeek-V3 models which outperform it. How do you explain that to Zuck with a straight face? How does Zuck keep smiling while shoveling multiple billions of dollars to Nvidia to buy 100k H100s when a better model was trained using just 2k H100s for a bit over $5mm?
最近在 Twitter 和 Blind(一个企业八卦网站)上的传闻是,这些模型让 Meta 措手不及,它们的表现优于仍在训练中的新 Llama4 模型。显然,Meta 内部的 Llama 项目已引起高层技术主管的极大关注,因此他们大约有 13 个人在从事 Llama 相关工作,每个人的年总薪酬超过了性能更优的 DeepSeek-V3 模型的训练总成本。你如何面不改色地向扎克伯格解释这一点?当更好的模型仅用 2000 个 H100、花费略超 500 万美元就训练完成时,扎克伯格为何还能笑着向 Nvidia 投入数十亿美元购买 10 万个 H100?
But you better believe that Meta and every other big AI lab is taking these DeepSeek models apart, studying every word in those technical reports and every line of the open source code they released, trying desperately to integrate these same tricks and optimizations into their own training and inference pipelines. So what's the impact of all that? Well, naively it sort of seems like the aggregate demand for training and inference compute should be divided by some big number. Maybe not by 45, but maybe by 25 or even 30? Because whatever you thought you needed before these model releases, it's now a lot less.
但你最好相信,Meta 和其他所有大型 AI 实验室都在拆解这些 DeepSeek 模型,研究技术报告中的每一个字以及他们发布的开源代码的每一行,拼命尝试将这些相同的技巧和优化整合到他们自己的训练和推理管道中。那么,这一切的影响是什么呢?从表面上看,训练和推理计算的总需求似乎应该除以一个大数。也许不是 45,但可能是 25 甚至 30?因为在这些模型发布之前,你认为需要的计算量,现在大大减少了。
Now, an optimist might say "You are talking about a mere constant of proportionality, a single multiple. When you're dealing with an exponential growth curve, that stuff gets washed out so quickly that it doesn't end up matter all that much." And there is some truth to that: if AI really is as transformational as I expect, if the real-world utility of this tech is measured in the trillions, if inference-time compute is the new scaling law of the land, if we are going to have armies of humanoid robots running around doing massive amounts of inference constantly, then maybe the growth curve is still so steep and extreme, and Nvidia has a big enough lead, that it will still work out.
现在,乐观主义者可能会说:“你所说的不过是一个比例常数,一个单一的倍数。当你处理指数增长曲线时,这些因素很快就会被冲刷掉,最终不会产生太大影响。” 这话有一定道理:如果人工智能真的如我所预期的那样具有变革性,如果这项技术的实际效用以万亿计,如果推理时计算成为新的扩展法则,如果我们即将拥有大批人形机器人四处奔走,不断进行大量推理,那么也许增长曲线仍然如此陡峭和极端,而英伟达拥有足够大的领先优势,最终仍能成功。
But Nvidia is pricing in a LOT of good news in the coming years for that valuation to make sense, and when you start layering all these things together into a total mosaic, it starts to make me at least feel extremely uneasy about spending ~20x the 2025 estimated sales for their shares. What happens if you even see a slight moderation in sales growth? What if it turns out to be 85% instead of over 100%? What if gross margins come in a bit from 75% to 70%— still ridiculously high for a semiconductor company?
但英伟达的估值已经预支了大量未来几年的利好消息,才能让这个估值显得合理。当你开始把所有这些东西叠加在一起,形成一个完整的图景时,至少让我对以 2025 年预估销售额的约 20 倍购买其股票感到极度不安。如果销售增长出现哪怕轻微的放缓,会发生什么?如果增长率是 85% 而不是超过 100% 呢?如果毛利率从 75% 略微下降到 70%—— 对于一家半导体公司来说仍然高得离谱 —— 又会怎样?
Wrapping it All Up
总结一切#
At a high level, NVIDIA faces an unprecedented convergence of competitive threats that make its premium valuation increasingly difficult to justify at 20x forward sales and 75% gross margins. The company's supposed moats in hardware, software, and efficiency are all showing concerning cracks. The whole world— thousands of the smartest people on the planet, backed by untold billions of dollars of capital resources— are trying to assail them from every angle.
从宏观层面来看,NVIDIA 正面临前所未有的竞争威胁汇聚,这使得其以 20 倍前瞻销售额和 75% 毛利率支撑的高估值愈发难以自圆其说。该公司在硬件、软件和效率方面所谓的护城河均显现出令人担忧的裂痕。全球范围内 —— 成千上万地球上最聪明的人,背后是无数的资本资源 —— 正试图从各个角度对其发起冲击。
On the hardware front, innovative architectures from Cerebras and Groq demonstrate that NVIDIA's interconnect advantage— a cornerstone of its data center dominance— can be circumvented through radical redesigns. Cerebras' wafer-scale chips and Groq's deterministic compute approach deliver compelling performance without needing NVIDIA's complex interconnect solutions. More traditionally, every major NVIDIA customer (Google, Amazon, Microsoft, Meta, Apple) is developing custom silicon that could chip away at high-margin data center revenue. These aren't experimental projects anymore— Amazon alone is building out massive infrastructure with over 400,000 custom chips for Anthropic.
在硬件方面,Cerebras 和 Groq 的创新架构表明,NVIDIA 的互连优势 —— 其数据中心主导地位的基石 —— 可以通过彻底重新设计来规避。Cerebras 的晶圆级芯片和 Groq 的确定性计算方法提供了引人注目的性能,而无需依赖 NVIDIA 复杂的互连解决方案。更传统地,NVIDIA 的每个主要客户(谷歌、亚马逊、微软、Meta、苹果)都在开发定制芯片,这些芯片可能会侵蚀高利润的数据中心收入。这些不再是实验性项目 —— 仅亚马逊一家就正在为 Anthropic 构建庞大的基础设施,配备超过 40 万颗定制芯片。
The software moat appears equally vulnerable. New high-level frameworks like MLX, Triton, and JAX are abstracting away CUDA's importance, while efforts to improve AMD drivers could unlock much cheaper hardware alternatives. The trend toward higher-level abstractions mirrors how assembly language gave way to C/C++, suggesting CUDA's dominance may be more temporary than assumed. Most importantly, we're seeing the emergence of LLM-powered code translation that could automatically port CUDA code to run on any hardware target, potentially eliminating one of NVIDIA's strongest lock-in effects.
软件护城河似乎同样脆弱。MLX、Triton 和 JAX 等新型高级框架正在削弱 CUDA 的重要性,而改进 AMD 驱动程序的努力可能释放出更便宜的硬件替代方案。向更高层次抽象发展的趋势,类似于汇编语言让位于 C/C++,暗示 CUDA 的主导地位可能比假设的更为短暂。最重要的是,我们正见证 LLM 驱动的代码翻译技术的兴起,它能够自动将 CUDA 代码移植到任何硬件目标上运行,这可能会消除 NVIDIA 最强大的锁定效应之一。
Perhaps most devastating is DeepSeek's recent efficiency breakthrough, achieving comparable model performance at approximately 1/45th the compute cost. This suggests the entire industry has been massively over-provisioning compute resources. Combined with the emergence of more efficient inference architectures through chain-of-thought models, the aggregate demand for compute could be significantly lower than current projections assume. The economics here are compelling: when DeepSeek can match GPT-4 level performance while charging 95% less for API calls, it suggests either NVIDIA's customers are burning cash unnecessarily or margins must come down dramatically.
或许最具破坏性的是 DeepSeek 最近的效率突破,以大约 1/45 的计算成本实现了可比的模型性能。这表明整个行业一直在过度配置计算资源。结合通过思维链模型出现的更高效推理架构,对计算的总需求可能远低于当前预测的假设。这里的经济学令人信服:当 DeepSeek 能够匹配 GPT-4 级别的性能,同时 API 调用收费减少 95% 时,这表明要么 NVIDIA 的客户在不必要地烧钱,要么利润率必须大幅下降。
The fact that TSMC will manufacture competitive chips for any well-funded customer puts a natural ceiling on NVIDIA's architectural advantages. But more fundamentally, history shows that markets eventually find a way around artificial bottlenecks that generate super-normal profits. When layered together, these threats suggest NVIDIA faces a much rockier path to maintaining its current growth trajectory and margins than its valuation implies. With five distinct vectors of attack— architectural innovation, customer vertical integration, software abstraction, efficiency breakthroughs, and manufacturing democratization— the probability that at least one succeeds in meaningfully impacting NVIDIA's margins or growth rate seems high. At current valuations, the market isn't pricing in any of these risks.
台积电将为任何资金充足的客户制造具有竞争力的芯片,这一事实自然限制了英伟达的架构优势。但更根本的是,历史表明,市场最终会找到绕过人为瓶颈的方法,这些瓶颈产生了超常利润。综合来看,这些威胁表明,英伟达在维持当前增长轨迹和利润率方面面临的道路比其估值所暗示的要崎岖得多。有五种不同的攻击向量 —— 架构创新、客户垂直整合、软件抽象化、效率突破和制造民主化 —— 至少有一种能够显著影响英伟达的利润率或增长率的可能性似乎很高。在当前的估值下,市场并未对这些风险进行任何定价。
原文链接:https://youtubetranscriptoptimizer.com/blog/05_the_short_case_for_nvda