Models

Jarvis runs a fleet of local models through Ollama, all accessible via a single LiteLLM endpoint. You can send requests to a specific model by name, or let LiteLLM select the best one for your task.

All models run on your own hardware. No requests or prompts are sent to external services unless you explicitly configure a cloud provider in LiteLLM.

Model categories

General purpose
Code
Fast

General-purpose models handle a wide range of tasks: writing, summarization, question answering, and reasoning.

Model	Best for
Llama 3	Balanced performance across most tasks
Mistral	Fast responses, strong instruction following
DeepSeek	Reasoning-heavy tasks and analysis

Use these when you don’t have a specific requirement — they handle most everyday workloads well.

Code-focused models are optimized for writing, reviewing, and explaining code.

Model	Best for
DeepSeek Coder	Code generation and completion
Mistral (instruct variants)	Code review and explanation
Llama (code variants)	Multi-language code tasks

Pair a code model with the Claude Code agent for end-to-end development tasks. See Claude Code.

Smaller, quantized models prioritize speed over maximum quality. Use these for high-throughput tasks or when latency matters more than depth.

Model	Best for
Mistral 7B (Q4)	Fast completions, lightweight reasoning
Llama 3 8B	Quick responses with solid quality

Fast models run well on dell-micro and handle multiple concurrent requests without stressing GPU resources.

Select a model for your task

Pass the model name in your request using the standard OpenAI-compatible format. LiteLLM maps the name to the correct Ollama model and routes to an available node.

{
  "model": "llama3",
  "messages": [{ "role": "user", "content": "Summarize this document." }]
}

You can use any model name that Ollama has pulled on the mesh. To see what’s available:

curl https://your-jarvis-host/api/models

Model routing and load balancing

LiteLLM manages routing automatically:

Load balancing — if a model runs on multiple nodes, LiteLLM distributes requests across them.
Failover — if a node becomes unavailable, requests fall through to another node running the same model.
Priority routing — ai-max and ai-mini-x1 receive GPU-bound requests first; dell-micro picks up lightweight overflow.

You can override routing by specifying a node alongside your model. See Nodes for details.

LiteLLM API call format

LiteLLM exposes an OpenAI-compatible API. Any client that works with the OpenAI SDK works with Jarvis.

curl https://your-jarvis-host/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "llama3",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "What is the capital of France?" }
    ]
  }'

Overview

Brain Mesh

Agents

Integrations

Model categories

Select a model for your task

Model routing and load balancing

LiteLLM API call format

Next steps

Inference

API reference

Overview

Brain Mesh

Agents

Integrations

​Model categories

​Select a model for your task

​Model routing and load balancing

​LiteLLM API call format

​Next steps

Inference

API reference

Model categories

Select a model for your task

Model routing and load balancing

LiteLLM API call format

Next steps