RAG pipeline with model mixing — token-hub scenarios

A real RAG pipeline rarely runs a single model. The embedder is chosen for cost and speed; the generator is chosen for answer quality. With token-hub, both sit behind one OpenAI-compatible endpoint and one API key, so mixing models is a configuration change rather than a migration.

The shape

docs ─► chunker ─► embed (DeepSeek) ─► vector store (Chroma)
                                             │
user query ──► embed (DeepSeek) ──► retrieve │ top-k chunks
                                             │
                                             ▼
                              generate (Claude 3.5 Sonnet) ──► answer

Embeddings are a bulk, high-volume job — cheap is what you want. DeepSeek embeddings run at about $0.04 per 1M tokens, versus $0.13 for OpenAI text-embedding-3-small. Generation is where quality pays back; Claude 3.5 Sonnet is our default for answer synthesis.

Full example, in Python

This uses LangChain’s OpenAI-compatible wrappers pointed at the token-hub base URL. No provider-specific SDK is imported.

import os
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema import Document

os.environ["OPENAI_API_KEY"]  = "sk-th_..."          # your token-hub key
os.environ["OPENAI_BASE_URL"] = "https://api.sandboxclaw.com/v1"

# 1. Embedder — cheap model for bulk work
embedder = OpenAIEmbeddings(model="deepseek-embedding")

# 2. Ingest once
raw_docs = [Document(page_content=open(p).read(), metadata={"source": p})
            for p in ["docs/quickstart.md", "docs/pricing.md", "docs/faq.md"]]

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=80)
chunks   = splitter.split_documents(raw_docs)

store = Chroma.from_documents(chunks, embedder, persist_directory="./chroma-db")

# 3. Generator — quality model for answers
generator = ChatOpenAI(model="claude-3-5-sonnet-20241022", temperature=0.2)

# 4. Query loop
def answer(question: str, k: int = 4) -> str:
    hits = store.similarity_search(question, k=k)
    context = "\n\n---\n\n".join(h.page_content for h in hits)

    messages = [
        {"role": "system", "content":
            "You are a docs assistant. Answer using ONLY the provided context. "
            "If the context does not contain the answer, say you don't know."},
        {"role": "user", "content":
            f"Context:\n{context}\n\nQuestion: {question}"},
    ]
    resp = generator.invoke(messages)
    return resp.content

print(answer("How do I rotate an API key?"))

Why mix models here

Two reasons.

Cost. An ingest run over 100k chunks of 500 tokens each is 50M tokens. At $0.04/1M that is $2. At $0.13/1M it is $6.50. Not huge absolute numbers, but the ratio persists across every re-ingest. Over a year that is the difference between “we’ll re-index weekly” and “we’ll batch it monthly.”

Quality. Generation is where you cannot skimp. Claude 3.5 Sonnet handles multi-chunk synthesis and citation behavior better than cheaper models in our evals — roughly 92% vs 84% accuracy on a 3-hop Q&A set. Those 8 points show up as fewer support tickets when users ask follow-up questions.

Gotchas we hit

Embedding dimension mismatch. If you switch embedders mid-stream, your vector store rejects inserts. Always re-embed from scratch when changing models.
Streaming the generator. The generator call accepts stream=True and returns an iterator of delta chunks — useful when you want typewriter-style output for the answer but not needed for ingest.
Token budgets. A top-k of 8 with 800-token chunks plus a 200-token question is ~6600 input tokens per query. At Claude Sonnet’s $3/1M input that is $0.02 per query — fine for low-volume, expensive at 100 QPS. Switch to Haiku for the generation step if QPS climbs.

Where to take it next

Swap Chroma for Qdrant or pgvector if you need production scale.
Add a re-ranker pass — call a cheap model like gpt-4o-mini to pick top-3 from top-10 before sending to the generator.
Wire usage tracking by reading the usage block from the generator’s response and logging per-user cost.

The point of the gateway here is that none of these changes touch your SDK or your key management — just edit model names.