RAG answer synthesis through token-hub

A retrieval pipeline has two separate jobs: retrieve relevant context, then ask a model to synthesize an answer. token-hub is useful for the generation step because your application can call the same OpenAI-compatible chat shape while the model channels evolve behind the gateway.

This scenario keeps embeddings and vector storage in your own stack, then calls the current public smoke-tested moonshot-v1-8k route for answer synthesis.

Shape

documents -> chunker -> your embedding model -> vector store
                                                   |
user query -> your embedding model -> retrieve top-k
                                                   |
                                                   v
                                      token-hub /v1/chat/completions

Minimal generator call

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://llm.sandboxclaw.com/v1",
    api_key=os.environ["TOKENHUB_KEY"],
)


def answer(question: str, chunks: list[str]) -> str:
    context = "\n\n".join(chunks[:5])
    prompt = f"""Use only the context below to answer the question.

Context:
{context}

Question:
{question}
"""

    resp = client.chat.completions.create(
        model="moonshot-v1-8k",
        messages=[
            {"role": "system", "content": "Answer concisely and cite the provided context."},
            {"role": "user", "content": prompt},
        ],
        max_tokens=600,
        temperature=0.2,
    )
    return resp.choices[0].message.content or ""

LangChain adapter

If your stack already uses LangChain’s OpenAI-compatible chat wrapper, point it at TokenHub:

from langchain_openai import ChatOpenAI

generator = ChatOpenAI(
    model="moonshot-v1-8k",
    base_url="https://llm.sandboxclaw.com/v1",
    api_key=os.environ["TOKENHUB_KEY"],
    temperature=0.2,
)

Guardrails

Keep retrieved context under the selected model’s context window.
Log the model ID, request ID, token usage, and top-k document IDs for debugging.
Use your own eval set before switching model channels.
When additional models are enabled for your account, treat the model string as configuration rather than hardcoding it throughout the app.