Skip to main content
token-hub

RAG answer synthesis through token-hub

Use token-hub as the generation step in a retrieval pipeline while keeping embeddings and vector storage in your own stack.

A retrieval pipeline has two separate jobs: retrieve relevant context, then ask a model to synthesize an answer. token-hub is useful for the generation step because your application can call the same OpenAI-compatible chat shape while the model channels evolve behind the gateway.

This scenario keeps embeddings and vector storage in your own stack, then calls the current public smoke-tested moonshot-v1-8k route for answer synthesis.

Shape

documents -> chunker -> your embedding model -> vector store
                                                   |
user query -> your embedding model -> retrieve top-k
                                                   |
                                                   v
                                      token-hub /v1/chat/completions

Minimal generator call

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://llm.sandboxclaw.com/v1",
    api_key=os.environ["TOKENHUB_KEY"],
)


def answer(question: str, chunks: list[str]) -> str:
    context = "\n\n".join(chunks[:5])
    prompt = f"""Use only the context below to answer the question.

Context:
{context}

Question:
{question}
"""

    resp = client.chat.completions.create(
        model="moonshot-v1-8k",
        messages=[
            {"role": "system", "content": "Answer concisely and cite the provided context."},
            {"role": "user", "content": prompt},
        ],
        max_tokens=600,
        temperature=0.2,
    )
    return resp.choices[0].message.content or ""

LangChain adapter

If your stack already uses LangChain’s OpenAI-compatible chat wrapper, point it at TokenHub:

from langchain_openai import ChatOpenAI

generator = ChatOpenAI(
    model="moonshot-v1-8k",
    base_url="https://llm.sandboxclaw.com/v1",
    api_key=os.environ["TOKENHUB_KEY"],
    temperature=0.2,
)

Guardrails