Skip to content

Ollama

MetadataValue
Categoryai
Capabilitieshttp, shell
Websitehttps://ollama.com
  • api — Ollama REST API — fast inference path, requires server to be running
  • cli — Ollama CLI — can start the server, pull/delete models, management ops

Run local AI models — inference, tool calling, and model management — entirely on your machine. No API keys, no cloud, no cost per token.

Install Ollama (if not already installed):

Terminal window
brew install ollama
brew services start ollama

Pull a model to use:

Terminal window
ollama pull qwen3.5:9b-q8_0 # recommended: 11GB, tools + vision + thinking, 256K ctx
ollama pull glm-4.7-flash # coding specialist: 19GB, best local SWE-bench

The skill auto-starts the Ollama server if it is not running — no manual setup required.

ConnectionWhat it does
api (default)REST API at localhost:11434 — fast inference, most operations
cliOllama CLI binary — server start/stop, model pulls, management
ToolConnectionDescription
statuscliCheck if server is running; start it if not
chatapi / cliMulti-turn chat with tool calling and thinking mode
generateapiOne-shot text generation (faster for simple prompts)
list_modelsapi / cliList all downloaded models with size and metadata
pull_modelcli / apiDownload a model from the Ollama registry
delete_modelapi / cliDelete a model to free disk space
psapiShow models currently loaded in memory
show_modelapiShow model details: arch, context length, template

chat normalizes Ollama’s tool call format to the canonical AgentOS shape:

{ id, name, input }

Pass tools as { name, description, input_schema } — the same format used by the Anthropic skill.

Models that support extended reasoning (qwen3, glm-4.7-flash, etc.) can be activated with thinking: true. The reasoning trace is returned in the thinking field, separate from content.

  • chat with connection: cli collapses message history into a single prompt — suitable for single-turn only
  • pull_model defaults to the CLI connection for reliable progress on large downloads (19GB models)
  • ps shows what is currently warm in unified memory — useful before running a new model to estimate reload time
  • The generate operation skips chat overhead — use it for classification, extraction, or quick single-turn tasks where speed matters