A real RAG pipeline rarely runs a single model. The embedder is chosen for cost and speed; the generator is chosen for answer quality. With token-hub, both sit behind one OpenAI-compatible endpoint and one API key, so mixing models is a configuration change rather than a migration.
The shape
docs ─► chunker ─► embed (DeepSeek) ─► vector store (Chroma)
│
user query ──► embed (DeepSeek) ──► retrieve │ top-k chunks
│
▼
generate (Claude 3.5 Sonnet) ──► answer
Embeddings are a bulk, high-volume job — cheap is what you want. DeepSeek embeddings run at about $0.04 per 1M tokens, versus $0.13 for OpenAI text-embedding-3-small. Generation is where quality pays back; Claude 3.5 Sonnet is our default for answer synthesis.
Full example, in Python
This uses LangChain’s OpenAI-compatible wrappers pointed at the token-hub base URL. No provider-specific SDK is imported.
import os
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema import Document
os.environ["OPENAI_API_KEY"] = "sk-th_..." # your token-hub key
os.environ["OPENAI_BASE_URL"] = "https://api.sandboxclaw.com/v1"
# 1. Embedder — cheap model for bulk work
embedder = OpenAIEmbeddings(model="deepseek-embedding")
# 2. Ingest once
raw_docs = [Document(page_content=open(p).read(), metadata={"source": p})
for p in ["docs/quickstart.md", "docs/pricing.md", "docs/faq.md"]]
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=80)
chunks = splitter.split_documents(raw_docs)
store = Chroma.from_documents(chunks, embedder, persist_directory="./chroma-db")
# 3. Generator — quality model for answers
generator = ChatOpenAI(model="claude-3-5-sonnet-20241022", temperature=0.2)
# 4. Query loop
def answer(question: str, k: int = 4) -> str:
hits = store.similarity_search(question, k=k)
context = "\n\n---\n\n".join(h.page_content for h in hits)
messages = [
{"role": "system", "content":
"You are a docs assistant. Answer using ONLY the provided context. "
"If the context does not contain the answer, say you don't know."},
{"role": "user", "content":
f"Context:\n{context}\n\nQuestion: {question}"},
]
resp = generator.invoke(messages)
return resp.content
print(answer("How do I rotate an API key?"))
Why mix models here
Two reasons.
Cost. An ingest run over 100k chunks of 500 tokens each is 50M tokens. At $0.04/1M that is $2. At $0.13/1M it is $6.50. Not huge absolute numbers, but the ratio persists across every re-ingest. Over a year that is the difference between “we’ll re-index weekly” and “we’ll batch it monthly.”
Quality. Generation is where you cannot skimp. Claude 3.5 Sonnet handles multi-chunk synthesis and citation behavior better than cheaper models in our evals — roughly 92% vs 84% accuracy on a 3-hop Q&A set. Those 8 points show up as fewer support tickets when users ask follow-up questions.
Gotchas we hit
- Embedding dimension mismatch. If you switch embedders mid-stream, your vector store rejects inserts. Always re-embed from scratch when changing models.
- Streaming the generator. The generator call accepts
stream=Trueand returns an iterator of delta chunks — useful when you want typewriter-style output for the answer but not needed for ingest. - Token budgets. A top-k of 8 with 800-token chunks plus a 200-token question is ~6600 input tokens per query. At Claude Sonnet’s $3/1M input that is $0.02 per query — fine for low-volume, expensive at 100 QPS. Switch to Haiku for the generation step if QPS climbs.
Where to take it next
- Swap Chroma for Qdrant or pgvector if you need production scale.
- Add a re-ranker pass — call a cheap model like
gpt-4o-minito pick top-3 from top-10 before sending to the generator. - Wire usage tracking by reading the
usageblock from the generator’s response and logging per-user cost.
The point of the gateway here is that none of these changes touch your SDK or your key management — just edit model names.