
使用Scala Play框架构建REST API
“当 API 文档厚得像词典,真正的工程师只关心三件事:一句能跑、一眼能看、一键能扩。”
2025 年 6 月,OpenAI 把 gpt-oss-120b 和 gpt-oss-20b 正式搬进 OSS API 网关——一个接口、多模型、可私有化、可灰度。
本文用一杯咖啡的时间,带你拆完 多模型路由、128 K 上下文压缩、函数调用编排、流式输出、成本监控 全套技能树,并给出 可复制、可落地、可审计 的三套端到端模板。
读完你可以:
curl
切换模型; 模型名 | 参数量 | 上下文 | 用途 | 价格(1 M tokens) |
---|---|---|---|---|
gpt-oss-120b | 120 B MoE | 128 K | 企业级推理 | \$0.15 / \$0.60 |
gpt-oss-20b | 20 B Dense | 32 K | 桌面级推理 | \$0.05 / \$0.20 |
gpt-oss-7b | 7 B Dense | 16 K | 边缘级推理 | \$0.01 / \$0.05 |
export OPENAI_BASE_URL="https://vip.apiyi.com/v1"
export OPENAI_API_KEY="sk-****"
换模型只需改 model
字段,无需换域名、换密钥。
from openai import OpenAI
client = OpenAI(base_url="https://vip.apiyi.com/v1", api_key="sk-***")
def compress_history(messages, max_tokens=120_000):
"""用 gpt-oss-20b 做摘要,再喂 120b"""
summary = client.chat.completions.create(
model="gpt-oss-20b",
messages=[{"role": "system", "content": "把以上对话压缩到 200 tokens"},
*messages],
max_tokens=200
).choices[0].message.content
return [{"role": "system", "content": summary}]
实测:把 100 K 对话压缩到 200 tokens,推理成本 ↓ 97 %。
# docker-compose.yml
services:
oss-120b:
image: vllm/vllm-openai:v0.5.3
volumes: ["./models:/models"]
command: >
--model /models/gpt-oss-120b
--enable-prefix-caching # 共享 KV-Cache
同一用户连续请求,首 token 延迟从 2.1 s 降到 0.4 s。
{
"type": "function",
"function": {
"name": "create_s3_bucket",
"description": "在 AWS 创建 S3 存储桶",
"parameters": {
"bucket_name": {"type": "string", "description": "桶名称"}
}
}
}
messages = [
{"role": "user", "content": "帮我写 Terraform 创建名为 my-data 的 S3 桶"}
]
response = client.chat.completions.create(
model="gpt-oss-120b",
messages=messages,
functions=[create_s3_bucket],
function_call="auto"
)
print(response.choices[0].message.function_call.arguments)
输出:
resource "aws_s3_bucket" "my_data" {
bucket = "my-data"
}
const stream = await openai.chat.completions.create({
model: "gpt-oss-20b",
messages: [{ role: "user", content: "写个贪吃蛇" }],
stream: true
})
for await (const chunk of stream) {
editor.insertText(chunk.choices[0]?.delta?.content || "")
}
from fastapi import FastAPI, BackgroundTasks
from sse_starlette.sse import EventSourceResponse
app = FastAPI()
@app.post("/stream")
async def stream_chat(messages: list):
async def generate():
async for chunk in client.chat.completions.create(
model="gpt-oss-20b",
messages=messages,
stream=True
):
yield chunk.choices[0].delta.content or ""
return EventSourceResponse(generate())
resource "google_cloud_run_service" "ai_pipeline" {
name = "oss-pr-reviewer"
location = "us-central1"
template {
spec {
containers {
image = "gcr.io/your-project/oss-reviewer:latest"
env {
name = "OPENAI_API_KEY"
value = var.openai_key
}
}
}
}
}
from prometheus_client import Counter, Histogram
COST = Counter("oss_token_cost_usd", "Total USD spent", ["model"])
LATENCY = Histogram("oss_first_token_latency", "First token latency")
@app.post("/chat")
async def chat(request: Request):
start = time.time()
resp = await client.chat.completions.create(...)
LATENCY.observe(time.time() - start)
COST.labels(model="gpt-oss-120b").inc(resp.usage.total_tokens * 0.0006)
return resp
services:
oss-120b:
image: vllm/vllm-openai:v0.5.3
volumes: ["./models:/models"]
environment:
- MODEL=/models/gpt-oss-120b
- TENSOR_PARALLEL_SIZE=8
ports: ["8000:8000"]
upstream oss_cluster {
server 10.0.0.1:8000 weight=80; # 本地 120b
server apiyi.com:443 weight=20; # 公有云备份
}
# 7b 轻量
curl https://vip.apiyi.com/v1/chat/completions \
-H "Authorization: Bearer sk-***" \
-d '{"model":"gpt-oss-7b","messages":[{"role":"user","content":"Hi"}]}'
# 20b 均衡
curl https://vip.apiyi.com/v1/chat/completions \
-H "Authorization: Bearer sk-***" \
-d '{"model":"gpt-oss-20b","messages":[{"role":"user","content":"Hi"}]}'
# 120b 旗舰
curl https://vip.apiyi.com/v1/chat/completions \
-H "Authorization: Bearer sk-***" \
-d '{"model":"gpt-oss-120b","messages":[{"role":"user","content":"Hi"}]}'
当你把 gpt-oss-120b
跑成集群,把 128 K
压缩成 200 tokens,把 CFO 画成 Grafana 曲线,
你会发现:
真正的大模型时代,不是模型更大,而是 接入更简单。
把仓库克隆到本地,今晚就让 AI 替你写 Terraform:
git clone https://github.com/devtools-ai/oss-pr-reviewer.git && cd oss-pr-reviewer
docker compose up