Integration Patterns

Retrieval-Augmented Generation

Inject relevant context from your own knowledge base into the system prompt before sending to the LLM. This grounds the model's answer in your data.

Pattern overview

Embed the user query using an embedding model.
Retrieve the top-k relevant chunks from your vector store.
Concatenate the chunks into a context block in the system message.
Send the conversation to OneInfer and return the response.

Python (simplified)

python

import requests

BASE_URL = "https://api.oneinfer.ai"
token = requests.post(
    f"{BASE_URL}/v1/ula/oauth-authentication?api_key=YOUR_API_KEY"
).json()["access_token"]
headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}

def retrieve_context(query: str) -> str:
    # Replace with your actual vector store lookup
    return "OneInfer supports OpenAI, Anthropic, and Novita providers."

def rag_answer(user_query: str) -> str:
    context = retrieve_context(user_query)
    system_prompt = (
        "You are a helpful assistant. Use the following context to answer the user's question. "
        "If the context does not contain the answer, say you don't know.\n\n"
        f"Context:\n{context}"
    )
    response = requests.post(
        f"{BASE_URL}/v1/ula/chat/completions",
        headers=headers,
        json={
            "provider": "openai",
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_query},
            ],
            "max_tokens": 512,
            "temperature": 0.3,
        },
    )
    return response.json()["data"]["text"]

print(rag_answer("Which providers does OneInfer support?"))