---
title: Hermes models for coding agents — what they're good at, what they're not
description: Hermes 3 and Hermes-style instruction-tuned models punch above their weight on tool use. Here's where they fit in a coding-agent stack and how to route them via an OpenAI-compatible endpoint.
tldr: Hermes models (NousResearch, Llama-based) excel at tool use and instruction-following. Use them for the mid-tier of a coding-agent stack — bulk tool-calling traffic where frontier models are overkill. Route via any OpenAI-compatible gateway like jusInfer.
date: 2026-05-26
author: jusInfer
cluster: conceptual
tags: hermes-models, hermes-agents, nousresearch, open-source-llm, tool-use, coding-agents
---

# Hermes models for coding agents

The [Hermes](https://nousresearch.com/) family of open-weights models from NousResearch — Hermes 3 (Llama-based) and the newer Hermes 4 — are some of the strongest open models for **tool use** and **function calling**. That makes them genuinely interesting for coding agents, which spend most of their time calling tools (read file, propose edit, run test). This post is an honest look at when to use them and how to route them via an OpenAI-compatible endpoint like jusInfer.

## What "Hermes" actually is

Hermes models are post-trained from base models (Llama-3 70B, Llama-3.1 405B, more recently Llama-4 variants) by NousResearch. The recipe emphasizes:

- **Steerability** — system prompts actually shift behavior, not just style.
- **Tool use / function calling** — JSON-schema-constrained outputs are reliable.
- **Reasoning without RLHF refusal patterns** — fewer "as an AI language model…" deflections.
- **Long context** — the larger variants handle 128k+ comfortably.

They're not chat-tuned the way Claude or GPT-5 are. The vibe is more "competent technical assistant who does what you ask" and less "warm conversational partner." For coding agents that's a feature.

## Where Hermes fits in a coding-agent stack

Coding agents have roughly three workload tiers:

| Tier | Examples | Best model class |
|---|---|---|
| Heavy reasoning | architecture decisions, debugging complex stack traces, multi-file refactors | Frontier (Sonnet 4.5, GPT-5, Hermes 4 405B) |
| Tool execution | "read file X", "propose edit Y", "run command Z" | Mid-tier with strong tool use (Hermes 3 70B, Qwen3 Coder, DeepSeek V4) |
| Trivial completions | type annotations, lint fixes, single-line patches | Small fast (Llama 4 8B, Qwen3 8B, Kimi K2.6) |

Hermes lives in the middle tier. It's almost never *the best* model for any single benchmark, but it's reliably *good enough* for the bulk of an agent's tool-calling traffic — at 5-10× lower cost than the frontier tier.

## When to pick Hermes specifically

**Pick it when:**
- You need open weights (audit, on-prem, compliance reasons)
- You're tool-use-bound (most agent steps are function calls, not reasoning)
- You want consistent behavior across system-prompt instructions
- Your budget says "no Claude/GPT default routing"

**Don't pick it when:**
- You need state-of-the-art reasoning (use Sonnet/GPT-5 for that subset of calls)
- You're vision-bound (Hermes vision support is patchy)
- You need 200k+ context with quality (the larger Hermes models handle long context but quality degrades faster than frontier)

## Using Hermes via jusInfer

jusInfer routes to Hermes-style models when the request shape matches their strengths. You don't need to pick — but if you want to, you can pin:

```sh
curl https://api.jusinfer.com/v1/chat/completions \
  -H "Authorization: Bearer jinf_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nousresearch/hermes-4-405b",
    "messages": [
      {"role": "system", "content": "You are a careful code editor."},
      {"role": "user", "content": "Read package.json and tell me which deps are unused."}
    ],
    "tools": [{
      "type": "function",
      "function": {
        "name": "read_file",
        "parameters": {"type": "object", "properties": {"path": {"type": "string"}}}
      }
    }]
  }'
```

We normalize the tool-call response to OpenAI shape regardless of the underlying provider's format. So your agent code doesn't change when we route to Hermes vs Claude vs GPT.

## Pinning Hermes for a whole agent

If you want every call from a specific harness to go to Hermes, override the auto-router by setting model name at the client level. For each tool:

- **Aider:** `aider --model openai/nousresearch/hermes-4-405b`
- **Continue (`config.json`):** `"model": "nousresearch/hermes-4-405b"`
- **Cline:** set the model in Cline Settings → API Provider → OpenAI Compatible → Model
- **OpenCode:** `default_model: nousresearch/hermes-4-405b` in `~/.opencode/config.json`
- **Cursor:** add `nousresearch/hermes-4-405b` to Custom Models, select it per feature

## On "Hermes agents" as a term

People sometimes say "Hermes agents" loosely to mean *any agent built on top of a Hermes model.* There isn't a single project called "Hermes Agents" with capital letters — it's a class of agents that use NousResearch's models as their reasoning backend. Common patterns:

- **Hermes + custom harness** — engineering teams who fine-tune their own agent loop and want open weights underneath.
- **Hermes + LangGraph / CrewAI** — multi-agent orchestration frameworks that accept any OpenAI-compatible endpoint.
- **Hermes + on-prem deployment** — companies that can't send data to closed providers and need open weights they can self-host (or route via a gateway like jusInfer that supports open-weights inference at the edge).

All three patterns work with jusInfer. Point your harness at our base URL with a Hermes model id and it routes to a Together / Fireworks / Cloudflare-hosted Hermes endpoint behind the scenes.

## Benchmarks worth checking

Don't trust generic benchmarks for coding. Look at these:

- **SWE-Bench Verified** — real GitHub issues, real repo edits, judged on whether the fix passes the test suite. Hermes 4 405B is competitive with frontier here.
- **HumanEval / MBPP** — fast unit-test passes. Hermes scores well; less interesting for agentic use.
- **BFCL (Berkeley Function Calling Leaderboard)** — function-call accuracy. Hermes punches above weight.
- **Aider polyglot benchmark** — Paul Gauthier's real-edit benchmark across languages. Hermes generally lands in upper-middle of open models.

## Related reading

- [OpenAI-compatible drop-in](/docs/openai-drop-in/)
- [Inference endpoints for coding agents — what's different](/blog/inference-endpoint-coding-agents/)
- [The cheapest LLM API for coding agents in 2026](/blog/cheapest-llm-api-for-coding-2026/)
- [API reference](/docs/api-reference/)

---

*Raw markdown: [/blog/hermes-models-and-coding-agents.md](/blog/hermes-models-and-coding-agents.md)*
