TL;DR — If your custom agent harness uses the OpenAI Python or Node SDK, change two values — base_url to https://api.jusinfer.com/v1 and api_key to a jinf_ token. No code changes beyond the client constructor. Tool-call shapes are normalized server-side.
Your custom agent harness — point it at a cheaper endpoint in one line
Some teams build their own coding-agent harness instead of using Claude Code, OpenCode, or Cursor. The reasons vary — domain-specific tools, privacy, in-house benchmarks, or just a strong opinion about how an agent loop should work. Names you see in the wild: in-house "OpenClaw" / "nemoClaw" style projects, Hermes-based agent harnesses, custom CrewAI / LangGraph orchestrations, internal forks of Aider.
All of them have the same property: if the harness speaks OpenAI Chat Completions, it speaks jusInfer. This post is the universal setup, plus the three things to verify before you ship.
The universal config
99% of custom agents use one of two HTTP client patterns. Here are both.
Pattern 1 — OpenAI Python SDK
from openai import OpenAI
client = OpenAI(
base_url="https://api.jusinfer.com/v1",
api_key=os.environ["JUSINFER_API_KEY"], # jinf_… from /developer
)
# Everything else stays the same.
resp = client.chat.completions.create(
model="jusInfer-auto", # or any provider/model id you want pinned
messages=conversation,
tools=tool_schemas,
stream=True,
)
Pattern 2 — Raw fetch / requests
import requests, os
resp = requests.post(
"https://api.jusinfer.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {os.environ['JUSINFER_API_KEY']}",
"Content-Type": "application/json",
},
json={
"model": "jusInfer-auto",
"messages": conversation,
"tools": tool_schemas,
"stream": True,
},
stream=True,
)
That's the whole integration. The hard part — model selection per call, upstream provider routing, failover — happens server-side.
Mapping your existing config
Most custom harnesses have a config block like:
LLM_PROVIDER = "anthropic" # or "openai", "together", etc.
LLM_MODEL = "claude-sonnet-4-5"
LLM_BASE_URL = "https://api.anthropic.com"
LLM_API_KEY = "sk-ant-..."
To switch to jusInfer, only three lines change:
LLM_PROVIDER = "openai" # we speak OpenAI shape regardless of underlying model
LLM_MODEL = "jusInfer-auto" # or pin a specific provider/model id
LLM_BASE_URL = "https://api.jusinfer.com/v1"
LLM_API_KEY = "jinf_..."
If your harness has hard-coded "anthropic" / "openai" / "together" branches in code, the simplest fix is to add a "jusInfer" branch that uses the OpenAI client path. Or just use the OpenAI branch and point its base URL at us.
What works for any custom harness
| Feature | Status |
|---|---|
messages shape (system/user/assistant/tool) | ✅ |
| Streaming via SSE | ✅ |
| Tool calls (OpenAI shape) | ✅ |
| Tool calls (Anthropic shape) | ✅ we normalize |
response_format: json_object | ✅ |
| Parallel tool calls | ✅ when the underlying model supports it |
| Vision inputs (image_url) | ✅ auto-routes to a vision-capable model |
| Logprobs | ⚠️ pass-through if upstream supports; ignored if not |
| Fine-tunes | ❌ we don't host fine-tunes; use the provider directly for those |
Three things to verify before shipping
1. Tool-call shape on YOUR specific model id
If you pin a specific model (e.g. nousresearch/hermes-4-405b or qwen/qwen3-coder-480b), test that tool calls land in the shape your agent expects. We normalize but edge cases exist — log one tool call end-to-end and inspect the JSON.
2. Cost per task on YOUR workload
Don't trust the homepage promise. Run 10 real tasks through your harness, check the usage tab. If it's not 50%+ cheaper than your prior setup, something's miscoded (often: you're sending the same prompt twice because of a retry bug in your harness).
3. Failure mode when upstream stutters
We retry transparently when an upstream model 502s or rate-limits. Test this: kill your network for 5 seconds mid-stream. Your agent should see a 502 UPSTREAM_ERROR only if all our retries failed (rare). Otherwise the stream resumes from the same response with no duplicate text.
Per-user accounting in a multi-user agent
If your harness serves multiple users and you need per-user spend attribution:
- Mint one
jinf_key per user viaPOST /v1/keys(JWT-authed; humans sign in first). - Each user's calls go through their own key.
- The usage dashboard breaks down spend per key.
- Set per-user soft caps via
POST /v1/tenant/members/:id/capso a single user can't blow the team budget.
For a single shared key (single tenant) we still attribute by user_id if your agent passes it in headers — see API reference.
A note on the names
Got asked about specific custom harnesses recently — "OpenClaw", "nemoClaw", "Hermes agents". These aren't standardized products in the sense Cursor or Claude Code are. They're patterns or in-house projects. If you're building one of them or its equivalent, this guide covers you. If "OpenClaw" or "nemoClaw" is something more specific you want a tailored guide for, email hello@jusinfer.com with a link to your repo and we'll write one.
Setup checklist
- Sign up at jusinfer.com/login.
- Mint a
jinf_key at /developer. - Set base URL + key in your harness config.
- Run one task end-to-end. Verify response on the Usage tab.
- Set per-user caps if multi-tenant.
- Set rate limits in the Tenant tab if you expect burst traffic.
Related reading
- OpenAI-compatible drop-in (every popular harness, step by step)
- Hermes models for coding agents
- Inference endpoints for coding agents — what's different
- API reference
Raw markdown: /blog/custom-agent-harness-openai-compatible.md