2026-05-26 · Kalmantic

TL;DR — An inference endpoint is a URL that runs a trained AI model and returns its output to you over HTTP. "OpenAI-compatible" means the endpoint accepts the same request shape as OpenAI's Chat Completions API, so any client built for OpenAI can swap to it with only a base_url and api_key change.

What is an inference endpoint?

If you've ever wired up ChatGPT in code, you've used one. But the term gets used loosely. This post pins it down and explains how it fits into the AI coding stack you're probably building.

The 30-second version

An inference endpoint is a URL that runs an AI model for you and returns the result. You send it a request (POST /v1/chat/completions with a prompt), it sends back a response (a stream of tokens, optionally including tool calls). You don't run the model. The endpoint provider does.

Synonyms you'll see in the wild:

inference API (same thing)
inference server (the software running behind the URL)
model endpoint (often used for fine-tuned variants)
LLM API (umbrella term)

Where the term came from

Machine learning has two phases:

Training — feed data, adjust weights. Slow, expensive, done once or rarely.
Inference — given a trained model, generate output for a new input. Fast (relatively), cheap (relatively), done billions of times.

An inference endpoint is the production-facing surface for phase 2. The fact that it's called inference and not generation or prediction is a holdover from statistics, where "inference" means "deriving the unknown from the known."

What's inside an inference endpoint

┌────────────────────────────────────────────────────────┐
│ HTTP / gRPC / SSE surface                              │
│   accepts: prompt, params (temperature, max_tokens…)  │
│   returns: tokens (streamed) + usage stats             │
├────────────────────────────────────────────────────────┤
│ Routing / load-balancing layer                         │
│   may pick a server, a region, a replica               │
├────────────────────────────────────────────────────────┤
│ Inference server (vLLM, TGI, TensorRT-LLM…)            │
│   batches requests, manages KV cache, runs decode      │
├────────────────────────────────────────────────────────┤
│ GPU(s) holding model weights in VRAM                   │
└────────────────────────────────────────────────────────┘

You only ever see the top layer. Everything below is the provider's problem.

What "OpenAI-compatible" means

OpenAI's Chat Completions schema (/v1/chat/completions with messages, tools, stream, temperature, etc.) won enough mindshare that nearly every inference provider now mimics it. So when someone says "OpenAI-compatible API", they mean: you can point any client built for OpenAI at our URL with a different base_url and api_key, and most things will just work.

This is a huge deal because:

Tool authors (Cursor, Aider, Claude Code's compat mode, Cline, Continue) only need to support one schema.
Switching providers is a config change, not a code change.
You can A/B test providers in production by routing 1% of traffic to a new base URL.

The two main alternative schemas worth knowing:

Anthropic Messages (/v1/messages with system, messages[], tools) — slightly different shape, similar power. Claude Code speaks both.
Responses API (/v1/responses) — newer OpenAI surface for stateful conversations. Still rare in third-party tools.

Inference endpoint vs gateway vs router

These terms blur. Here's how I separate them:

An inference endpoint runs a model. (Together, Fireworks, OpenAI, Anthropic.)
A gateway sits in front of inference endpoints, adds auth/logging/rate-limiting. (Helicone, parts of Portkey.)
A router picks which inference endpoint to call. (OpenRouter, jusInfer.)

Some products do all three. jusInfer is mostly a router with gateway features. OpenRouter is an aggregator that happens to expose a uniform endpoint. Anthropic is just an endpoint.

What to look for when choosing one

Property	Why it matters
Compatibility	Can your existing tools point at it without code changes?
Latency	First-token and tokens-per-second. Look for p95, not average.
Cost	Per-token, and what your real workload actually pays after batching/caching.
Catalog	Which models are available? Are the open-weights versions current?
Reliability	Uptime over the last 90 days. Most providers publish a status page; check it.
Streaming	Server-sent events should work end-to-end. Test before committing.
Tool use	Your agent makes tool calls. The endpoint must support them.
Cost guardrails	Per-user caps, alerts on usage spikes. Often missing — verify.
Auth model	API keys vs JWT vs OAuth. Rotation story matters at scale.

Inference endpoints for coding agents specifically

Coding agents differ from chat in three ways:

Long context. Whole files, sometimes whole repos. 32k–200k input tokens are normal.
Tool use. Read file → propose edit → run test. Many round trips per task.
Code quality matters more than fluency. A model that writes confident-but-broken Python is worse than one that's terse and correct.

The implication: pick an endpoint that's been benchmarked on coding tasks, not generic chat. The cheapest chat model might be terrible at tool use. The fastest endpoint might lag on long context.

This is why a coding-specific router like jusInfer exists — we route per-call based on which underlying endpoint is best for that specific step. A trivial autocomplete goes to a small fast model; a multi-file refactor goes to a frontier model. You don't pick; we do.

Try one

If you want to feel an inference endpoint in 30 seconds:

curl https://api.jusinfer.com/v1/chat/completions \
  -H "Authorization: Bearer jinf_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jusInfer-auto",
    "messages": [{"role":"user","content":"Write a Python function to reverse a string."}],
    "stream": false
  }'

Mint a key at jusinfer.com/developer (free signup, $0.05 starting credit).

Raw markdown: /blog/what-is-an-inference-endpoint.md

inference-endpointopenai-compatiblellm-apiai-codingbeginner