Skip to content

HTTP API

Endpoints

POST /v1/chat/completions

OpenAI-compatible chat completions endpoint.

POST /v1/responses

OpenAI-compatible Responses API endpoint (regular JSON and SSE streaming).

POST /v1/embeddings

OpenAI-compatible embeddings endpoint.

GET /v1/responses

OpenAI-compatible WebSocket mode for Responses API.

GET /v1/models

Returns the discovered and configured models visible through the gateway.

GET /v1/models/{model}

Returns details for a single discovered/configured model.

GET /health

Container and process health endpoint.

GET /ready

Readiness endpoint for orchestration and probes.

GET /metrics

Prometheus metrics scrape endpoint.

Example chat completion request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-5.2",
    "messages": [
      {"role": "user", "content": "Hello from LunarGate"}
    ]
  }'

Example embeddings request

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/nomic-embed-text-v2-moe",
    "input": [
      "LunarGate can proxy embeddings requests.",
      "Embeddings are useful for semantic search."
    ]
  }'

Custom request headers

Header Description
X-LunarGate-Provider Force a specific provider
X-LunarGate-Model Override the model
X-LunarGate-Route Force a named route
X-LunarGate-SessionID Session correlation identifier used in request metadata/logs
X-LunarGate-No-Cache Bypass cache when set to true
X-LunarGate-No-Retry Disable retries when set to true

Response headers

Header Description
X-LunarGate-Request-ID Unique request identifier
X-LunarGate-Provider Provider that served the request
X-LunarGate-Model Model used for the request
X-LunarGate-Route Route that matched
X-LunarGate-Cache-Status HIT or MISS
X-LunarGate-Latency-Ms End-to-end latency in milliseconds
X-LunarGate-Overhead-Duration-Ms Gateway overhead timing header

Streaming

The gateway supports SSE streaming on chat completions and Responses endpoints.

Example streaming request:

curl -N http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-5.2",
    "stream": true,
    "messages": [
      {"role": "user", "content": "Write a short haiku about LunarGate."}
    ]
}'

Responses SSE request:

curl -N http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-5.2",
    "stream": true,
    "input": "Write a short haiku about LunarGate."
  }'

Responses WebSocket mode

Use GET /v1/responses with a WebSocket client and send response.create frames.

  • Gateway converts each response.create frame into a Responses request and always streams events back.
  • Events are sent as JSON WebSocket messages (response.created, response.output_text.delta, response.completed, and error events).
  • previous_response_id is validated against response IDs created earlier on the same WebSocket connection.
  • One request is processed at a time per connection.
  • If x-lunargate-sessionid is missing on the WebSocket handshake, gateway generates one automatically (wsresp_<uuid>).
  • The same session ID is injected into each upstream request created from response.create frames, so collector/request logs can correlate multiple upstream requests from one WS session.

Example with wscat:

wscat -c ws://localhost:8080/v1/responses

Then send:

{"type":"response.create","model":"openai/gpt-5.2","input":"Say hello from LunarGate"}

Compatibility notes

LunarGate normalizes some client payload variants before routing to upstream providers. That helps preserve OpenAI compatibility even when upstream or intermediate clients serialize text content differently.

For embeddings specifically:

  • the public endpoint is POST /v1/embeddings
  • a common routing pattern is to match /v1/embeddings separately from /v1/chat/completions
  • local Ollama is a good smoke-test target for embeddings before building retrieval or RAG flows