<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Http on 潘达窝</title><link>https://daidaij.github.io/tags/http/</link><description>Recent content in Http on 潘达窝</description><generator>Hugo -- gohugo.io</generator><language>zh-cn</language><copyright>pandazhangs</copyright><lastBuildDate>Wed, 15 Apr 2026 15:33:03 +0800</lastBuildDate><atom:link href="https://daidaij.github.io/tags/http/index.xml" rel="self" type="application/rss+xml"/><item><title>统计大模型流式响应的token usage</title><link>https://daidaij.github.io/p/sse-usage/</link><pubDate>Wed, 15 Apr 2026 15:33:03 +0800</pubDate><guid>https://daidaij.github.io/p/sse-usage/</guid><description>&lt;img src="https://picsum.photos/seed/fd9c5bb3/800/600" alt="Featured image of post 统计大模型流式响应的token usage" />&lt;h1 id="llm-流式响应的-usage-怎么统计">LLM 流式响应的 usage 怎么统计？
&lt;/h1>&lt;blockquote>
&lt;p>以Openai api 为例&lt;/p>
&lt;/blockquote>
&lt;p>现在大模型流式输出应用很广泛了，很多webui 上附加的打字机效果，视觉体验确实好。但是这种场景下咋统计大模型的token 用量&lt;/p>
&lt;p>这里简单分享一下个人的拙作。&lt;/p>
&lt;hr>
&lt;h2 id="sse-响应格式的特点">SSE 响应格式的特点
&lt;/h2>&lt;p>Server-Sent Events（SSE）是一种基于 HTTP 的轻量级单向实时通信协议。在 OpenAI API 中，流式响应以 SSE 格式传输，每个分块的结构如下：&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt">1
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">data: {&amp;#34;id&amp;#34;:&amp;#34;...&amp;#34;,&amp;#34;object&amp;#34;:&amp;#34;chat.completion.chunk&amp;#34;,&amp;#34;created&amp;#34;:...,&amp;#34;model&amp;#34;:&amp;#34;...&amp;#34;,&amp;#34;choices&amp;#34;:[{&amp;#34;index&amp;#34;:0,&amp;#34;delta&amp;#34;:{&amp;#34;content&amp;#34;:&amp;#34;某&amp;#34;},&amp;#34;finish_reason&amp;#34;:null}]}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>关键特点：&lt;/p>
&lt;ul>
&lt;li>每个分块是一行以 &lt;code>data:&lt;/code> 开头的文本&lt;/li>
&lt;li>最后一个分块固定为 &lt;code>data: [DONE]&lt;/code>，标志流结束&lt;/li>
&lt;li>分块之间可能穿插空行&lt;/li>
&lt;li>响应头中 &lt;code>content-type: text/event-stream&lt;/code>，&lt;code>transfer-encoding: chunked&lt;/code>&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="usage-怎么获取">usage 怎么获取
&lt;/h2>&lt;p>大多数 LLM API（OpenAI、以及各种兼容 API）在开启流式响应时，&lt;strong>不会在第一个分块里就把 usage 返回&lt;/strong>。而是会在流快结束、发送 &lt;code>[DONE]&lt;/code> 之前，插一个特殊的分块。这个分块大概长这样：&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt">1
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="err">data:&lt;/span> &lt;span class="p">{&lt;/span>&lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="s2">&amp;#34;...&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nt">&amp;#34;object&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="s2">&amp;#34;chat.completion.chunk&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nt">&amp;#34;created&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="err">...&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nt">&amp;#34;model&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="s2">&amp;#34;...&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nt">&amp;#34;choices&amp;#34;&lt;/span>&lt;span class="p">:[{&lt;/span>&lt;span class="nt">&amp;#34;index&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nt">&amp;#34;delta&amp;#34;&lt;/span>&lt;span class="p">:{},&lt;/span>&lt;span class="nt">&amp;#34;finish_reason&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="s2">&amp;#34;stop&amp;#34;&lt;/span>&lt;span class="p">}],&lt;/span>&lt;span class="nt">&amp;#34;usage&amp;#34;&lt;/span>&lt;span class="p">:{&lt;/span>&lt;span class="nt">&amp;#34;prompt_tokens&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">11&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nt">&amp;#34;completion_tokens&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">27&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nt">&amp;#34;total_tokens&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">38&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nt">&amp;#34;prompt_tokens_details&amp;#34;&lt;/span>&lt;span class="p">:{&lt;/span>&lt;span class="nt">&amp;#34;cached_tokens&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">}}}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>可以看到：&lt;/p>
&lt;ul>
&lt;li>&lt;code>delta&lt;/code> 是空的 &lt;code>{}&lt;/code>，不是 null&lt;/li>
&lt;li>&lt;code>finish_reason&lt;/code> 变成了 &lt;code>&amp;quot;stop&amp;quot;&lt;/code>&lt;/li>
&lt;li>整块的 &lt;code>usage&lt;/code> 信息就在这儿&lt;/li>
&lt;/ul>
&lt;p>这就是核心：&lt;strong>usage 信息不是元数据单独发过来的，而是塞在流结束前的上个分块里。&lt;/strong> 。&lt;/p>
&lt;hr>
&lt;h2 id="滚动更新缓存">滚动更新缓存
&lt;/h2>&lt;p>参考： &lt;a class="link" href="https://openai.apifox.cn/api-67883981" target="_blank" rel="noopener"
>openai chat completes api 文档&lt;/a>
既然 usage 必然出现在 &lt;code>[DONE]&lt;/code> 之前，只要在收到 &lt;code>[DONE]&lt;/code> 时，解析缓存的最后一个分块就行了。&lt;/p>
&lt;p>思路：&lt;/p>
&lt;ol>
&lt;li>收到正常分块 → 放进队列，队列满了就踢掉最老的&lt;/li>
&lt;li>收到 &lt;code>[DONE]&lt;/code> → 队列里的一定有 usage，直接取出来解析&lt;/li>
&lt;/ol>
&lt;p>用队列（maxsize=2）来实现，代码如下：&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;span class="lnt">11
&lt;/span>&lt;span class="lnt">12
&lt;/span>&lt;span class="lnt">13
&lt;/span>&lt;span class="lnt">14
&lt;/span>&lt;span class="lnt">15
&lt;/span>&lt;span class="lnt">16
&lt;/span>&lt;span class="lnt">17
&lt;/span>&lt;span class="lnt">18
&lt;/span>&lt;span class="lnt">19
&lt;/span>&lt;span class="lnt">20
&lt;/span>&lt;span class="lnt">21
&lt;/span>&lt;span class="lnt">22
&lt;/span>&lt;span class="lnt">23
&lt;/span>&lt;span class="lnt">24
&lt;/span>&lt;span class="lnt">25
&lt;/span>&lt;span class="lnt">26
&lt;/span>&lt;span class="lnt">27
&lt;/span>&lt;span class="lnt">28
&lt;/span>&lt;span class="lnt">29
&lt;/span>&lt;span class="lnt">30
&lt;/span>&lt;span class="lnt">31
&lt;/span>&lt;span class="lnt">32
&lt;/span>&lt;span class="lnt">33
&lt;/span>&lt;span class="lnt">34
&lt;/span>&lt;span class="lnt">35
&lt;/span>&lt;span class="lnt">36
&lt;/span>&lt;span class="lnt">37
&lt;/span>&lt;span class="lnt">38
&lt;/span>&lt;span class="lnt">39
&lt;/span>&lt;span class="lnt">40
&lt;/span>&lt;span class="lnt">41
&lt;/span>&lt;span class="lnt">42
&lt;/span>&lt;span class="lnt">43
&lt;/span>&lt;span class="lnt">44
&lt;/span>&lt;span class="lnt">45
&lt;/span>&lt;span class="lnt">46
&lt;/span>&lt;span class="lnt">47
&lt;/span>&lt;span class="lnt">48
&lt;/span>&lt;span class="lnt">49
&lt;/span>&lt;span class="lnt">50
&lt;/span>&lt;span class="lnt">51
&lt;/span>&lt;span class="lnt">52
&lt;/span>&lt;span class="lnt">53
&lt;/span>&lt;span class="lnt">54
&lt;/span>&lt;span class="lnt">55
&lt;/span>&lt;span class="lnt">56
&lt;/span>&lt;span class="lnt">57
&lt;/span>&lt;span class="lnt">58
&lt;/span>&lt;span class="lnt">59
&lt;/span>&lt;span class="lnt">60
&lt;/span>&lt;span class="lnt">61
&lt;/span>&lt;span class="lnt">62
&lt;/span>&lt;span class="lnt">63
&lt;/span>&lt;span class="lnt">64
&lt;/span>&lt;span class="lnt">65
&lt;/span>&lt;span class="lnt">66
&lt;/span>&lt;span class="lnt">67
&lt;/span>&lt;span class="lnt">68
&lt;/span>&lt;span class="lnt">69
&lt;/span>&lt;span class="lnt">70
&lt;/span>&lt;span class="lnt">71
&lt;/span>&lt;span class="lnt">72
&lt;/span>&lt;span class="lnt">73
&lt;/span>&lt;span class="lnt">74
&lt;/span>&lt;span class="lnt">75
&lt;/span>&lt;span class="lnt">76
&lt;/span>&lt;span class="lnt">77
&lt;/span>&lt;span class="lnt">78
&lt;/span>&lt;span class="lnt">79
&lt;/span>&lt;span class="lnt">80
&lt;/span>&lt;span class="lnt">81
&lt;/span>&lt;span class="lnt">82
&lt;/span>&lt;span class="lnt">83
&lt;/span>&lt;span class="lnt">84
&lt;/span>&lt;span class="lnt">85
&lt;/span>&lt;span class="lnt">86
&lt;/span>&lt;span class="lnt">87
&lt;/span>&lt;span class="lnt">88
&lt;/span>&lt;span class="lnt">89
&lt;/span>&lt;span class="lnt">90
&lt;/span>&lt;span class="lnt">91
&lt;/span>&lt;span class="lnt">92
&lt;/span>&lt;span class="lnt">93
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">httpx&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">json&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">queue&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">stream_chat_completion&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">base_url&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="nb">str&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sk&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="nb">str&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="nb">str&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">messages&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="nb">list&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> 流式请求 OpenAI /v1/chat/completions 接口，
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> 用大小为 2 的队列缓存 SSE 分块，解析 usage。
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Args:
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> base_url: API 基础地址
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> sk: 模型密钥sk
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> model: 模型名称
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> messages: 对话消息列表
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">url&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">base_url&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rstrip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/v1/chat/completions&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">payload&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;model&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;messages&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">messages&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;stream&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="kc">True&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">headers&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;Content-Type&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;application/json&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;Authorization&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Bearer &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">sk&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;请求地址: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">url&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;请求体: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">json&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">dumps&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">payload&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ensure_ascii&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">indent&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;-&amp;#34;&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">50&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;开始接收 SSE 流:&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;-&amp;#34;&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">50&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">with&lt;/span> &lt;span class="n">httpx&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">stream&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;POST&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">url&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">json&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">payload&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">headers&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">headers&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">timeout&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mf">60.0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">response&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;状态码: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">response&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">status_code&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;响应头: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">dict&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">response&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">headers&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;-&amp;#34;&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">50&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># 关键：用队列只保留最近 2 个分块&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">records&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">queue&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Queue&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">queue&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Queue&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">maxsize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">line&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">response&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">iter_lines&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">line&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">line&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">startswith&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;data:&amp;#34;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">data&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">line&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">:]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">strip&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">data&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s2">&amp;#34;[DONE]&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;得到 [SSE] 流结束信号: [DONE]&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">try&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">item&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">records&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">parsed&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">json&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">loads&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">item&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">usage&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">parsed&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;usage&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">{})&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">usage&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">output_token&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">usage&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;completion_tokens&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">input_token&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">usage&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;prompt_tokens&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">cached_token&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">usage&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;cached_tokens&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">prompt_tokens_details&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">usage&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;prompt_tokens_details&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">{}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;cached_tokens&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="n">cached_token&lt;/span> &lt;span class="ow">and&lt;/span> &lt;span class="n">prompt_tokens_details&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">cached_token&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">prompt_tokens_details&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;[解析后] output_token=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">output_token&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">, &amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;input_token=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">input_token&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">, cached_token=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">cached_token&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">while&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="n">records&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">empty&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">s&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">records&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_nowait&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">s&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">except&lt;/span> &lt;span class="n">json&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">JSONDecodeError&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;[解析失败] 无法解析 JSON: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">data&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># 队列满了就丢弃旧数据，只保留最新的&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="n">records&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">empty&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">records&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">block&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">records&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">put&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">data&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;[其他] &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">line&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">if&lt;/span> &lt;span class="vm">__name__&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s2">&amp;#34;__main__&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">BASE_URL&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;https://api.example.com&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">MODEL&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;gpt-3.5-turbo&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">MESSAGES&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[{&lt;/span>&lt;span class="s2">&amp;#34;role&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;user&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;content&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;用一句话介绍 Python 3.11&amp;#34;&lt;/span>&lt;span class="p">}]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">SK&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;your-api-key&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">stream_chat_completion&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">BASE_URL&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">SK&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">MODEL&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">MESSAGES&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>跑一下，大概是这种输出：&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;span class="lnt">11
&lt;/span>&lt;span class="lnt">12
&lt;/span>&lt;span class="lnt">13
&lt;/span>&lt;span class="lnt">14
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">请求地址: https://api.example.com/v1/chat/completions
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">请求体: {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;#34;model&amp;#34;: &amp;#34;gpt-3.5-turbo&amp;#34;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;#34;messages&amp;#34;: [{&amp;#34;role&amp;#34;: &amp;#34;user&amp;#34;, &amp;#34;content&amp;#34;: &amp;#34;用一句话介绍Python 3.11&amp;#34;}],
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;#34;stream&amp;#34;: true
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">}
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">--------------------------------------------------
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">开始接收 SSE 流:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">--------------------------------------------------
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">状态码: 200
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">响应头: {&amp;#39;server&amp;#39;: &amp;#39;openresty&amp;#39;, &amp;#39;content-type&amp;#39;: &amp;#39;text/event-stream&amp;#39;, &amp;#39;transfer-encoding&amp;#39;: &amp;#39;chunked&amp;#39;}
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">--------------------------------------------------
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">得到 [SSE] 流结束信号: [DONE]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">[解析后] output_token=27, input_token=11, cached_token=0
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>这样就拿到usage和主要的输入/缓存/输出token 量了，其实还可以顺带统计下api 请求次数，现在很多coding plan 是基于api 调用次数计量的 。&lt;/p>
&lt;hr>
&lt;h2 id="其实一个变量就够了">其实一个变量就够了
&lt;/h2>&lt;p>回过头来看，队列大小设成 2 有点多余。&lt;/p>
&lt;p>usage 分块是 &lt;code>[DONE]&lt;/code> 之前&lt;strong>唯一&lt;/strong>一个 &lt;code>delta&lt;/code> 为空的块，而且它和 &lt;code>[DONE]&lt;/code> 之间不会再有其他有效数据了。所以我们只需要记住最近收到的那个 &lt;code>data:&lt;/code> 行就行，不需要队列。&lt;/p>
&lt;p>单变量版：&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">last_chunk&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">None&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">line&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">response&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">iter_lines&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">line&lt;/span> &lt;span class="ow">and&lt;/span> &lt;span class="n">line&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">startswith&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;data:&amp;#34;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">data&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">line&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">:]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">strip&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">data&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s2">&amp;#34;[DONE]&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">last_chunk&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">usage&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">json&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">loads&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">last_chunk&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;usage&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">{})&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;input=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">usage&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;prompt_tokens&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">, output=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">usage&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;completion_tokens&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">last_chunk&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">data&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>逻辑是一样的，只是把&amp;quot;最多两个&amp;quot;换成了&amp;quot;就一个&amp;quot;——反正够用。空间复杂度直接从 O(2) 变成 O(1)，虽然实际差别微乎其微，但写出来更干净。&lt;/p>
&lt;hr>
&lt;h2 id="适用场景">适用场景
&lt;/h2>&lt;p>想了一下，大概是这么几类场景：&lt;/p>
&lt;p>&lt;strong>API 代理或网关&lt;/strong>。转发流的时候想顺便记个 token 用量，没必要把整个响应缓存下来再处理，拿到 usage 直接记了就行。&lt;/p>
&lt;p>&lt;strong>轻量级 SDK&lt;/strong>。不想维护一个大的缓冲区，特别是并发量大的时候，内存省一点是一点。&lt;/p>
&lt;p>&lt;strong>成本统计/监控&lt;/strong>。实时统计每个请求的消耗，而不是事后去日志里翻。&lt;/p>
&lt;p>&lt;strong>调试工具&lt;/strong>。快速跑一个请求，看看某个模型实际消耗了多少 token，做个对比什么的。&lt;/p>
&lt;p>基本上，只要你需要在流式响应的同时拿到 usage，而又不打算存完整的流，这个思路都能用。&lt;/p>
&lt;hr>
&lt;p>其实核心就一句话：&lt;strong>usage 信息是在 &lt;code>[DONE]&lt;/code> 的上个分块里一起过来的&lt;/strong>。理解这个要点，用一个变量（或两个元素的队列）就能拿到，完全不需要缓存整个流。当然实际开发中怎么用，看你心情。追求简洁就一个变量，想留点冗余空间方便以后加日志就队列，两种写法都行，理解原理最重要。&lt;/p></description></item></channel></rss>