Skip to main content
token-hub

Classify 10k support tickets cheaply

Run large-volume classification jobs through token-hub using DeepSeek or Qwen — cheap, fast, and easy to parallelize with asyncio + aiohttp.

Classification is the highest-volume, lowest-risk LLM workload most teams run. Ten thousand tickets, one label each, done overnight. This is exactly where you should not be paying GPT-4o prices — DeepSeek V3 or Qwen 2.5 72B clear the accuracy bar at a fraction of the cost.

Through token-hub the only thing that changes between models is the model string. Here is a production-ready pattern.

The target

Input: 10,000 support tickets, ~300 tokens each. Output: one label per ticket from {billing, bug, how-to, feature-request, other}. Concurrency: 40 in-flight requests. Budget: under $2 total.

At DeepSeek V3 prices that’s ~$0.0001 per ticket, so 10k tickets cost about $1. At Qwen 2.5 72B prices (also $0.55/$1.65 per 1M) the math is similar.

The script

import asyncio
import csv
import json
import os
from dataclasses import dataclass
from typing import Iterable

import aiohttp

TOKENHUB_URL = "https://api.sandboxclaw.com/v1/chat/completions"
API_KEY      = os.environ["TOKENHUB_KEY"]   # sk-th_...
MODEL        = "deepseek-chat"
CONCURRENCY  = 40
MAX_RETRIES  = 3

SYSTEM_PROMPT = (
    "You are a support ticket classifier. Read the ticket and respond with "
    "ONE of these labels, lowercase, no other text: "
    "billing, bug, how-to, feature-request, other."
)

@dataclass
class Ticket:
    id: str
    body: str

@dataclass
class Result:
    id: str
    label: str
    tokens_in: int
    tokens_out: int

async def classify_one(
    session: aiohttp.ClientSession,
    sem: asyncio.Semaphore,
    ticket: Ticket,
) -> Result:
    payload = {
        "model": MODEL,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user",   "content": ticket.body},
        ],
        "max_tokens": 10,
        "temperature": 0,
    }
    headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

    async with sem:
        for attempt in range(MAX_RETRIES):
            try:
                async with session.post(TOKENHUB_URL, json=payload, headers=headers, timeout=30) as r:
                    if r.status == 429:
                        wait = int(r.headers.get("Retry-After", "1"))
                        await asyncio.sleep(wait + attempt * 0.5)
                        continue
                    if r.status >= 500:
                        await asyncio.sleep(2 ** attempt)
                        continue
                    r.raise_for_status()
                    data = await r.json()
                    return Result(
                        id        = ticket.id,
                        label     = data["choices"][0]["message"]["content"].strip().lower(),
                        tokens_in = data["usage"]["prompt_tokens"],
                        tokens_out= data["usage"]["completion_tokens"],
                    )
            except (aiohttp.ClientError, asyncio.TimeoutError):
                if attempt == MAX_RETRIES - 1:
                    raise
                await asyncio.sleep(2 ** attempt)

        return Result(id=ticket.id, label="other", tokens_in=0, tokens_out=0)

async def run(tickets: Iterable[Ticket]) -> list[Result]:
    sem = asyncio.Semaphore(CONCURRENCY)
    async with aiohttp.ClientSession() as session:
        tasks = [classify_one(session, sem, t) for t in tickets]
        return await asyncio.gather(*tasks)

def load_tickets(path: str) -> list[Ticket]:
    with open(path, newline="", encoding="utf-8") as f:
        return [Ticket(id=row["id"], body=row["body"]) for row in csv.DictReader(f)]

def main():
    tickets = load_tickets("tickets.csv")
    results = asyncio.run(run(tickets))

    total_in  = sum(r.tokens_in  for r in results)
    total_out = sum(r.tokens_out for r in results)
    cost_usd  = total_in * 0.27 / 1_000_000 + total_out * 1.10 / 1_000_000

    with open("labels.jsonl", "w") as f:
        for r in results:
            f.write(json.dumps({"id": r.id, "label": r.label}) + "\n")

    print(f"Classified {len(results)} tickets")
    print(f"Tokens: {total_in:,} in / {total_out:,} out")
    print(f"Cost : ${cost_usd:.4f}")

if __name__ == "__main__":
    main()

Run:

TOKENHUB_KEY=sk-th_... python classify.py

Typical output for 10k tickets:

Classified 10000 tickets
Tokens: 3,120,000 in / 180,000 out
Cost : $1.0406

Wall time on our test box: ~6 minutes at concurrency 40.

Why this pattern works

Concurrency via semaphore. We let the event loop schedule 40 concurrent HTTP calls. aiohttp pipelines them over a handful of connections. You don’t need a thread pool or a job queue for 10k requests.

Retries at the right places. 429 waits the Retry-After seconds. 5xx backs off exponentially. Everything else surfaces.

temperature: 0. Classification is deterministic; no creative output. Setting temperature to 0 makes the label stable across retries, which matters for idempotency if you re-run a partial batch.

max_tokens: 10. The label is one word. Capping output at 10 tokens prevents the model from monologuing and caps worst-case cost.

Switching models mid-pipeline

If your eval shows Qwen outperforms DeepSeek on Chinese tickets, route by language:

def pick_model(ticket: Ticket) -> str:
    if contains_chinese(ticket.body):
        return "qwen2.5-72b-instruct"
    return "deepseek-chat"

# ... then pass MODEL = pick_model(ticket) per call.

No SDK swap, no separate auth flow, no second invoice. That is the gateway doing its job.

Gotchas