A retrieval pipeline has two separate jobs: retrieve relevant context, then ask a model to synthesize an answer. token-hub is useful for the generation step because your application can call the same OpenAI-compatible chat shape while the model channels evolve behind the gateway.
This scenario keeps embeddings and vector storage in your own stack, then calls the current public smoke-tested moonshot-v1-8k route for answer synthesis.
Shape
documents -> chunker -> your embedding model -> vector store
|
user query -> your embedding model -> retrieve top-k
|
v
token-hub /v1/chat/completions
Minimal generator call
from openai import OpenAI
import os
client = OpenAI(
base_url="https://llm.sandboxclaw.com/v1",
api_key=os.environ["TOKENHUB_KEY"],
)
def answer(question: str, chunks: list[str]) -> str:
context = "\n\n".join(chunks[:5])
prompt = f"""Use only the context below to answer the question.
Context:
{context}
Question:
{question}
"""
resp = client.chat.completions.create(
model="moonshot-v1-8k",
messages=[
{"role": "system", "content": "Answer concisely and cite the provided context."},
{"role": "user", "content": prompt},
],
max_tokens=600,
temperature=0.2,
)
return resp.choices[0].message.content or ""
LangChain adapter
If your stack already uses LangChain’s OpenAI-compatible chat wrapper, point it at TokenHub:
from langchain_openai import ChatOpenAI
generator = ChatOpenAI(
model="moonshot-v1-8k",
base_url="https://llm.sandboxclaw.com/v1",
api_key=os.environ["TOKENHUB_KEY"],
temperature=0.2,
)
Guardrails
- Keep retrieved context under the selected model’s context window.
- Log the model ID, request ID, token usage, and top-k document IDs for debugging.
- Use your own eval set before switching model channels.
- When additional models are enabled for your account, treat the model string as configuration rather than hardcoding it throughout the app.