HTTP API¶
Endpoints¶
POST /v1/chat/completions¶
OpenAI-compatible chat completions endpoint.
POST /v1/responses¶
OpenAI-compatible Responses API endpoint (regular JSON and SSE streaming).
POST /v1/embeddings¶
OpenAI-compatible embeddings endpoint.
GET /v1/responses¶
OpenAI-compatible WebSocket mode for Responses API.
GET /v1/models¶
Returns the discovered and configured models visible through the gateway.
GET /v1/models/{model}¶
Returns details for a single discovered/configured model.
GET /health¶
Container and process health endpoint.
GET /ready¶
Readiness endpoint for orchestration and probes.
GET /metrics¶
Prometheus metrics scrape endpoint.
Example chat completion request¶
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-5.2",
"messages": [
{"role": "user", "content": "Hello from LunarGate"}
]
}'
Example embeddings request¶
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/nomic-embed-text-v2-moe",
"input": [
"LunarGate can proxy embeddings requests.",
"Embeddings are useful for semantic search."
]
}'
Custom request headers¶
| Header | Description |
|---|---|
X-LunarGate-Provider |
Force a specific provider |
X-LunarGate-Model |
Override the model |
X-LunarGate-Route |
Force a named route |
X-LunarGate-SessionID |
Session correlation identifier used in request metadata/logs |
X-LunarGate-No-Cache |
Bypass cache when set to true |
X-LunarGate-No-Retry |
Disable retries when set to true |
Response headers¶
| Header | Description |
|---|---|
X-LunarGate-Request-ID |
Unique request identifier |
X-LunarGate-Provider |
Provider that served the request |
X-LunarGate-Model |
Model used for the request |
X-LunarGate-Route |
Route that matched |
X-LunarGate-Cache-Status |
HIT or MISS |
X-LunarGate-Latency-Ms |
End-to-end latency in milliseconds |
X-LunarGate-Overhead-Duration-Ms |
Gateway overhead timing header |
Streaming¶
The gateway supports SSE streaming on chat completions and Responses endpoints.
Example streaming request:
curl -N http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-5.2",
"stream": true,
"messages": [
{"role": "user", "content": "Write a short haiku about LunarGate."}
]
}'
Responses SSE request:
curl -N http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-5.2",
"stream": true,
"input": "Write a short haiku about LunarGate."
}'
Responses WebSocket mode¶
Use GET /v1/responses with a WebSocket client and send response.create frames.
- Gateway converts each
response.createframe into a Responses request and always streams events back. - Events are sent as JSON WebSocket messages (
response.created,response.output_text.delta,response.completed, and error events). previous_response_idis validated against response IDs created earlier on the same WebSocket connection.- One request is processed at a time per connection.
- If
x-lunargate-sessionidis missing on the WebSocket handshake, gateway generates one automatically (wsresp_<uuid>). - The same session ID is injected into each upstream request created from
response.createframes, so collector/request logs can correlate multiple upstream requests from one WS session.
Example with wscat:
Then send:
Compatibility notes¶
LunarGate normalizes some client payload variants before routing to upstream providers. That helps preserve OpenAI compatibility even when upstream or intermediate clients serialize text content differently.
For embeddings specifically:
- the public endpoint is
POST /v1/embeddings - a common routing pattern is to match
/v1/embeddingsseparately from/v1/chat/completions - local Ollama is a good smoke-test target for embeddings before building retrieval or RAG flows