Classify 10k support tickets cheaply

Classification is the highest-volume, lowest-risk LLM workload most teams run. Ten thousand tickets, one label each, done overnight. This is exactly where you should not be paying GPT-4o prices — DeepSeek V3 or Qwen 2.5 72B clear the accuracy bar at a fraction of the cost.

Through token-hub the only thing that changes between models is the model string. Here is a production-ready pattern.

The target

Input: 10,000 support tickets, ~300 tokens each. Output: one label per ticket from {billing, bug, how-to, feature-request, other}. Concurrency: 40 in-flight requests. Budget: under $2 total.

At DeepSeek V3 prices that’s ~$0.0001 per ticket, so 10k tickets cost about $1. At Qwen 2.5 72B prices (also $0.55/$1.65 per 1M) the math is similar.

The script

import asyncio
import csv
import json
import os
from dataclasses import dataclass
from typing import Iterable

import aiohttp

TOKENHUB_URL = "https://api.sandboxclaw.com/v1/chat/completions"
API_KEY      = os.environ["TOKENHUB_KEY"]   # sk-th_...
MODEL        = "deepseek-chat"
CONCURRENCY  = 40
MAX_RETRIES  = 3

SYSTEM_PROMPT = (
    "You are a support ticket classifier. Read the ticket and respond with "
    "ONE of these labels, lowercase, no other text: "
    "billing, bug, how-to, feature-request, other."
)

@dataclass
class Ticket:
    id: str
    body: str

@dataclass
class Result:
    id: str
    label: str
    tokens_in: int
    tokens_out: int

async def classify_one(
    session: aiohttp.ClientSession,
    sem: asyncio.Semaphore,
    ticket: Ticket,
) -> Result:
    payload = {
        "model": MODEL,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user",   "content": ticket.body},
        ],
        "max_tokens": 10,
        "temperature": 0,
    }
    headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

    async with sem:
        for attempt in range(MAX_RETRIES):
            try:
                async with session.post(TOKENHUB_URL, json=payload, headers=headers, timeout=30) as r:
                    if r.status == 429:
                        wait = int(r.headers.get("Retry-After", "1"))
                        await asyncio.sleep(wait + attempt * 0.5)
                        continue
                    if r.status >= 500:
                        await asyncio.sleep(2 ** attempt)
                        continue
                    r.raise_for_status()
                    data = await r.json()
                    return Result(
                        id        = ticket.id,
                        label     = data["choices"][0]["message"]["content"].strip().lower(),
                        tokens_in = data["usage"]["prompt_tokens"],
                        tokens_out= data["usage"]["completion_tokens"],
                    )
            except (aiohttp.ClientError, asyncio.TimeoutError):
                if attempt == MAX_RETRIES - 1:
                    raise
                await asyncio.sleep(2 ** attempt)

        return Result(id=ticket.id, label="other", tokens_in=0, tokens_out=0)

async def run(tickets: Iterable[Ticket]) -> list[Result]:
    sem = asyncio.Semaphore(CONCURRENCY)
    async with aiohttp.ClientSession() as session:
        tasks = [classify_one(session, sem, t) for t in tickets]
        return await asyncio.gather(*tasks)

def load_tickets(path: str) -> list[Ticket]:
    with open(path, newline="", encoding="utf-8") as f:
        return [Ticket(id=row["id"], body=row["body"]) for row in csv.DictReader(f)]

def main():
    tickets = load_tickets("tickets.csv")
    results = asyncio.run(run(tickets))

    total_in  = sum(r.tokens_in  for r in results)
    total_out = sum(r.tokens_out for r in results)
    cost_usd  = total_in * 0.27 / 1_000_000 + total_out * 1.10 / 1_000_000

    with open("labels.jsonl", "w") as f:
        for r in results:
            f.write(json.dumps({"id": r.id, "label": r.label}) + "\n")

    print(f"Classified {len(results)} tickets")
    print(f"Tokens: {total_in:,} in / {total_out:,} out")
    print(f"Cost : ${cost_usd:.4f}")

if __name__ == "__main__":
    main()

Run:

TOKENHUB_KEY=sk-th_... python classify.py

Typical output for 10k tickets:

Classified 10000 tickets
Tokens: 3,120,000 in / 180,000 out
Cost : $1.0406

Wall time on our test box: ~6 minutes at concurrency 40.

Why this pattern works

Concurrency via semaphore. We let the event loop schedule 40 concurrent HTTP calls. aiohttp pipelines them over a handful of connections. You don’t need a thread pool or a job queue for 10k requests.

Retries at the right places. 429 waits the Retry-After seconds. 5xx backs off exponentially. Everything else surfaces.

temperature: 0. Classification is deterministic; no creative output. Setting temperature to 0 makes the label stable across retries, which matters for idempotency if you re-run a partial batch.

max_tokens: 10. The label is one word. Capping output at 10 tokens prevents the model from monologuing and caps worst-case cost.

Switching models mid-pipeline

If your eval shows Qwen outperforms DeepSeek on Chinese tickets, route by language:

def pick_model(ticket: Ticket) -> str:
    if contains_chinese(ticket.body):
        return "qwen2.5-72b-instruct"
    return "deepseek-chat"

# ... then pass MODEL = pick_model(ticket) per call.

No SDK swap, no separate auth flow, no second invoice. That is the gateway doing its job.

Gotchas

Respect rate limits. Default is 600 RPM per account. At concurrency 40 with 400ms per call you are at ~6000 RPM — you’ll hit the 429 path fast. Keep concurrency around 10 by default, or ask support to raise your limit.
Validate labels. Models occasionally return something off-list (e.g., “billing issue” instead of “billing”). Normalize and default to “other” if no exact match.
Checkpoint. For very large batches, write results to disk as they arrive, so a restart does not re-bill completed rows.