Skip to main content
token-hub

RAG pipeline with model mixing

Build a retrieval-augmented generation pipeline using token-hub. Embed documents with one model, retrieve from your vector store, and generate answers with Claude — all through one API key.

A real RAG pipeline rarely runs a single model. The embedder is chosen for cost and speed; the generator is chosen for answer quality. With token-hub, both sit behind one OpenAI-compatible endpoint and one API key, so mixing models is a configuration change rather than a migration.

The shape

docs ─► chunker ─► embed (DeepSeek) ─► vector store (Chroma)

user query ──► embed (DeepSeek) ──► retrieve │ top-k chunks


                              generate (Claude 3.5 Sonnet) ──► answer

Embeddings are a bulk, high-volume job — cheap is what you want. DeepSeek embeddings run at about $0.04 per 1M tokens, versus $0.13 for OpenAI text-embedding-3-small. Generation is where quality pays back; Claude 3.5 Sonnet is our default for answer synthesis.

Full example, in Python

This uses LangChain’s OpenAI-compatible wrappers pointed at the token-hub base URL. No provider-specific SDK is imported.

import os
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema import Document

os.environ["OPENAI_API_KEY"]  = "sk-th_..."          # your token-hub key
os.environ["OPENAI_BASE_URL"] = "https://api.sandboxclaw.com/v1"

# 1. Embedder — cheap model for bulk work
embedder = OpenAIEmbeddings(model="deepseek-embedding")

# 2. Ingest once
raw_docs = [Document(page_content=open(p).read(), metadata={"source": p})
            for p in ["docs/quickstart.md", "docs/pricing.md", "docs/faq.md"]]

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=80)
chunks   = splitter.split_documents(raw_docs)

store = Chroma.from_documents(chunks, embedder, persist_directory="./chroma-db")

# 3. Generator — quality model for answers
generator = ChatOpenAI(model="claude-3-5-sonnet-20241022", temperature=0.2)

# 4. Query loop
def answer(question: str, k: int = 4) -> str:
    hits = store.similarity_search(question, k=k)
    context = "\n\n---\n\n".join(h.page_content for h in hits)

    messages = [
        {"role": "system", "content":
            "You are a docs assistant. Answer using ONLY the provided context. "
            "If the context does not contain the answer, say you don't know."},
        {"role": "user", "content":
            f"Context:\n{context}\n\nQuestion: {question}"},
    ]
    resp = generator.invoke(messages)
    return resp.content

print(answer("How do I rotate an API key?"))

Why mix models here

Two reasons.

Cost. An ingest run over 100k chunks of 500 tokens each is 50M tokens. At $0.04/1M that is $2. At $0.13/1M it is $6.50. Not huge absolute numbers, but the ratio persists across every re-ingest. Over a year that is the difference between “we’ll re-index weekly” and “we’ll batch it monthly.”

Quality. Generation is where you cannot skimp. Claude 3.5 Sonnet handles multi-chunk synthesis and citation behavior better than cheaper models in our evals — roughly 92% vs 84% accuracy on a 3-hop Q&A set. Those 8 points show up as fewer support tickets when users ask follow-up questions.

Gotchas we hit

Where to take it next

The point of the gateway here is that none of these changes touch your SDK or your key management — just edit model names.