Classification is the highest-volume, lowest-risk LLM workload most teams run. Ten thousand tickets, one label each, done overnight. This is exactly where you should not be paying GPT-4o prices — DeepSeek V3 or Qwen 2.5 72B clear the accuracy bar at a fraction of the cost.
Through token-hub the only thing that changes between models is the model string. Here is a production-ready pattern.
The target
Input: 10,000 support tickets, ~300 tokens each.
Output: one label per ticket from {billing, bug, how-to, feature-request, other}.
Concurrency: 40 in-flight requests.
Budget: under $2 total.
At DeepSeek V3 prices that’s ~$0.0001 per ticket, so 10k tickets cost about $1. At Qwen 2.5 72B prices (also $0.55/$1.65 per 1M) the math is similar.
The script
import asyncio
import csv
import json
import os
from dataclasses import dataclass
from typing import Iterable
import aiohttp
TOKENHUB_URL = "https://api.sandboxclaw.com/v1/chat/completions"
API_KEY = os.environ["TOKENHUB_KEY"] # sk-th_...
MODEL = "deepseek-chat"
CONCURRENCY = 40
MAX_RETRIES = 3
SYSTEM_PROMPT = (
"You are a support ticket classifier. Read the ticket and respond with "
"ONE of these labels, lowercase, no other text: "
"billing, bug, how-to, feature-request, other."
)
@dataclass
class Ticket:
id: str
body: str
@dataclass
class Result:
id: str
label: str
tokens_in: int
tokens_out: int
async def classify_one(
session: aiohttp.ClientSession,
sem: asyncio.Semaphore,
ticket: Ticket,
) -> Result:
payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ticket.body},
],
"max_tokens": 10,
"temperature": 0,
}
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
async with sem:
for attempt in range(MAX_RETRIES):
try:
async with session.post(TOKENHUB_URL, json=payload, headers=headers, timeout=30) as r:
if r.status == 429:
wait = int(r.headers.get("Retry-After", "1"))
await asyncio.sleep(wait + attempt * 0.5)
continue
if r.status >= 500:
await asyncio.sleep(2 ** attempt)
continue
r.raise_for_status()
data = await r.json()
return Result(
id = ticket.id,
label = data["choices"][0]["message"]["content"].strip().lower(),
tokens_in = data["usage"]["prompt_tokens"],
tokens_out= data["usage"]["completion_tokens"],
)
except (aiohttp.ClientError, asyncio.TimeoutError):
if attempt == MAX_RETRIES - 1:
raise
await asyncio.sleep(2 ** attempt)
return Result(id=ticket.id, label="other", tokens_in=0, tokens_out=0)
async def run(tickets: Iterable[Ticket]) -> list[Result]:
sem = asyncio.Semaphore(CONCURRENCY)
async with aiohttp.ClientSession() as session:
tasks = [classify_one(session, sem, t) for t in tickets]
return await asyncio.gather(*tasks)
def load_tickets(path: str) -> list[Ticket]:
with open(path, newline="", encoding="utf-8") as f:
return [Ticket(id=row["id"], body=row["body"]) for row in csv.DictReader(f)]
def main():
tickets = load_tickets("tickets.csv")
results = asyncio.run(run(tickets))
total_in = sum(r.tokens_in for r in results)
total_out = sum(r.tokens_out for r in results)
cost_usd = total_in * 0.27 / 1_000_000 + total_out * 1.10 / 1_000_000
with open("labels.jsonl", "w") as f:
for r in results:
f.write(json.dumps({"id": r.id, "label": r.label}) + "\n")
print(f"Classified {len(results)} tickets")
print(f"Tokens: {total_in:,} in / {total_out:,} out")
print(f"Cost : ${cost_usd:.4f}")
if __name__ == "__main__":
main()
Run:
TOKENHUB_KEY=sk-th_... python classify.py
Typical output for 10k tickets:
Classified 10000 tickets
Tokens: 3,120,000 in / 180,000 out
Cost : $1.0406
Wall time on our test box: ~6 minutes at concurrency 40.
Why this pattern works
Concurrency via semaphore. We let the event loop schedule 40 concurrent HTTP calls. aiohttp pipelines them over a handful of connections. You don’t need a thread pool or a job queue for 10k requests.
Retries at the right places. 429 waits the Retry-After seconds. 5xx backs off exponentially. Everything else surfaces.
temperature: 0. Classification is deterministic; no creative output. Setting temperature to 0 makes the label stable across retries, which matters for idempotency if you re-run a partial batch.
max_tokens: 10. The label is one word. Capping output at 10 tokens prevents the model from monologuing and caps worst-case cost.
Switching models mid-pipeline
If your eval shows Qwen outperforms DeepSeek on Chinese tickets, route by language:
def pick_model(ticket: Ticket) -> str:
if contains_chinese(ticket.body):
return "qwen2.5-72b-instruct"
return "deepseek-chat"
# ... then pass MODEL = pick_model(ticket) per call.
No SDK swap, no separate auth flow, no second invoice. That is the gateway doing its job.
Gotchas
- Respect rate limits. Default is 600 RPM per account. At concurrency 40 with 400ms per call you are at ~6000 RPM — you’ll hit the 429 path fast. Keep concurrency around 10 by default, or ask support to raise your limit.
- Validate labels. Models occasionally return something off-list (e.g., “billing issue” instead of “billing”). Normalize and default to “other” if no exact match.
- Checkpoint. For very large batches, write results to disk as they arrive, so a restart does not re-bill completed rows.