Open Source AI

Open Source AI You Can Keep Running: Ollama, NVIDIA NIM, OpenRouter, and GLM 5.2

The useful version of "open source AI they can't shut off" is not a slogan. It is a stack decision.

Pat Simmons' video is a quick-start tour through three layers: run a model on your own machine with Ollama, test stronger hosted open models through NVIDIA NIM, and route cheap or specialized models through OpenRouter. The practical lesson is simple: do not make every workflow depend on one frontier-model account.

JQ AI SYSTEMS take: Build a resilience stack. Use local models for private, repeatable, low-risk work. Use hosted open-model routes when your laptop is too slow. Use frontier models where the task actually needs them.

Video credit: Pat Simmons. This post uses the supplied transcript as commentary and checks the setup paths against Ollama, NVIDIA, OpenCode, OpenRouter, Z.ai, and Claude Code documentation.

Source Note

Credit for the walkthrough goes to Pat Simmons. You can also find him on X and at PerSimmons Studio.

The video frames the moment around restricted frontier-model access. I am going to keep the analysis narrower and more useful: what can a builder set up this week so private notes, code search, draft work, and low-risk agent tasks do not depend entirely on one cloud vendor?

The Main Idea

There are three different things people often mix together:

Layer What it gives you What can still go wrong
Local Ollama Model files and inference on your own machine. Your hardware limits speed, context, and model size.
NVIDIA NIM Free development access to hosted models and a path to self-hosted NVIDIA inference. Still a service. Limits, terms, model availability, and uptime can change.
OpenRouter One API key for many open and closed models, plus useful Claude Code routing. Still cloud. You are buying routing flexibility, not full ownership.

That distinction matters. The "can't shut off" layer is local. The hosted layers are still useful because they reduce single-provider dependency and make model routing easier.

Here are the useful links from the stack, grouped by what you are trying to do.

Need Start here Why it matters
Run local models Ollama, Ollama download, Ollama model library The fastest beginner path to local model files, local API calls, and private experiments.
Use a local model in Claude Code Ollama Claude Code integration, Claude Code settings, Claude Code env vars Lets a local model run inside an agentic coding harness for file reading, editing, and tool use.
Test local starter models Gemma 4 on Ollama, Google Gemma + Ollama docs, Qwen3-Coder on Ollama Gemma is a good private multimodal starting point. Qwen3-Coder is a stronger coding-agent test.
Use stronger hosted open models NVIDIA Build, NVIDIA model catalog, NVIDIA NIM Good for experiments where your own machine is too slow or too small.
Run an open-source coding agent OpenCode, OpenCode providers, OpenCode models, OpenCode config Useful when you want a coding-agent harness that can connect to many providers, including local or hosted routes.
Route many models through one key OpenRouter, OpenRouter Claude Code integration, Z.ai on OpenRouter Best for swapping models per task without rebuilding your agent setup.
Try GLM 5.2 Z.ai GLM 5.2 release, GLM 5.2 docs, GLM 5.2 on Hugging Face Relevant for cheaper coding-agent and long-context experiments, usually via hosted routing.
Search private notes Obsidian, Obsidian Local REST API with MCP Lets local or agent workflows search notes without uploading your whole vault to a chatbot.

Method 1: Ollama

If you want actual resilience, start with Ollama. It runs models on your own computer and exposes a local API. Ollama's own site now positions it as a way to run apps and agents with open models, including Claude Code and similar tools.

The path is beginner-friendly:

  1. Install Ollama from the download page.
  2. Pick a model from the model library.
  3. Start with a small model for chat, notes, and private drafting.
  4. Move to coding-oriented models like Qwen3-Coder when you want agent work.
  5. Use Ollama's Claude Code integration if you want to launch Claude Code against a local model.
# Basic model chat
ollama run gemma4

# Coding-oriented local model
ollama run qwen3-coder:30b

# Launch Claude Code through Ollama's Anthropic-compatible route
ollama launch claude --model qwen3-coder

This is the part of the stack that can keep working if a hosted model is paused, a pricing plan changes, or an account gets rate-limited. The tradeoff is that your laptop is now the data center. Small models are fine. Large models will be slow or impossible without serious memory.

Where local Ollama is strongest

  • Searching private notes and local files.
  • Drafting and rewriting non-sensitive text.
  • Classifying documents or CSVs.
  • Running small coding-agent tasks in a sandbox.
  • Keeping an offline fallback for travel, outages, or gated frontier access.

Where it is weak

  • Big UI builds where design taste matters.
  • Long-horizon refactors with many hidden dependencies.
  • High-stakes legal, medical, finance, security, or production changes.
  • Workflows that need very large context on modest hardware.

Method 2: NVIDIA NIM

NVIDIA's Build site and model catalog give developers API access to many models. NVIDIA also describes NIM as optimized inference microservices that can run on NVIDIA GPUs in workstations, data centers, or cloud.

In the video, Pat uses NVIDIA-hosted models through a coding-agent workflow. The benefit is obvious: you can test beefier models without buying a GPU. The caveat is just as important: this is not ownership. It is a generous hosted development route that can have limits, outages, model changes, or policy changes.

For OpenCode users, the relevant research links are:

My recommendation: use NIM as a testing lane. If a workflow proves valuable, then decide whether to keep using hosted APIs, rent GPUs, or buy hardware.

Method 3: OpenRouter

OpenRouter is the practical model-routing layer in this stack. It gives you one account and one API key for many models. The official Claude Code integration says that setting ANTHROPIC_BASE_URL to https://openrouter.ai/api lets Claude Code speak its native protocol directly to OpenRouter.

This is why the setup is useful for builders. You can test GLM 5.2, Qwen, DeepSeek, Anthropic, OpenAI, Gemini, and other models through one harness, then decide which one belongs on which task.

The important caveat: OpenRouter is still cloud. It is a flexibility layer, not a local-ownership layer. Do not send private client data or secrets through it unless your policies, contracts, and risk tolerance allow it.

Claude Code Config

The video includes a project-local .claude/settings.local.json approach for routing Claude Code through OpenRouter to GLM 5.2. Claude Code's official docs explain that settings files can configure environment variables, and the LLM gateway docs describe ANTHROPIC_BASE_URL as the common address variable for gateways.

Put this in your project-local .claude/settings.local.json. Paste your OpenRouter API key into ANTHROPIC_AUTH_TOKEN. Do not commit this file.

{
  "env": {
    "ANTHROPIC_BASE_URL": "https://openrouter.ai/api",
    "ANTHROPIC_AUTH_TOKEN": "",
    "ANTHROPIC_API_KEY": "",
    "API_TIMEOUT_MS": "3000000",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "z-ai/glm-5.2",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "z-ai/glm-5.2",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "z-ai/glm-5.2",
    "ANTHROPIC_SMALL_FAST_MODEL": "z-ai/glm-5.2",
    "CLAUDE_CODE_SUBAGENT_MODEL": "z-ai/glm-5.2"
  }
}
Security note: settings.local.json should stay local. Add .claude/settings.local.json to your ignore rules if the project is under Git, and never paste API keys into shared docs, screenshots, or agent transcripts.

What To Route Where

The stack is most useful when you stop asking "which model is best?" and start asking "which model is good enough for this part of the workflow?"

Workflow Best first route Why
Private notes and Obsidian search Ollama local model Privacy matters more than frontier quality. Keep the vault local when possible.
Small code cleanup Ollama Qwen3-Coder or OpenRouter GLM 5.2 Cheap iteration, easy rollback, and enough intelligence for scoped edits.
Landing page experiments NVIDIA NIM through OpenCode or OpenRouter GLM 5.2 Better models help, but you still need visual review.
Large refactors Frontier model or GLM 5.2 with strict tests Use cheaper models for exploration, but keep human review and tests on production code.
Agent research over public sources OpenRouter or NIM Hosted speed matters more than local ownership when the data is already public.
Client data or sensitive business docs Local first Do not optimize for model quality before privacy, consent, and data handling.

Safety And Privacy

Running a model locally does not automatically make the workflow safe. A local agent can still delete files, leak secrets through tools, or make bad edits. Treat local models like junior employees with a lot of keyboard access.

  • Keep local APIs private. Do not expose Ollama or Open WebUI publicly without authentication, VPN, and a clear reason.
  • Separate private and public tasks. Notes and client files should stay local unless you intentionally approve a hosted route.
  • Use sandboxes. Local code agents should work in project folders, branches, or throwaway copies.
  • Log what matters. Keep prompts, commands, files changed, tests run, and model/provider used.
  • Use review gates. Cheaper model routing is not a substitute for code review, data review, or deployment approval.

Starter Checklist

If I were setting this up for a founder or operator this week, I would do it in this order:

  1. Install Ollama.
  2. Run one small model and one coding model from the Ollama library.
  3. Test a private task: summarize notes, search a folder, or clean a transcript.
  4. Try Ollama with Claude Code on a harmless repo.
  5. Create an OpenRouter key and test GLM 5.2 with the local settings file above.
  6. Try NVIDIA Build or NIM when local speed is not enough.
  7. Write down your routing rule: local for private, OpenRouter/NIM for cheap public experiments, frontier models for high-stakes review.

The end state is not "use open source for everything." The end state is choice. Your workflow should not collapse just because one provider changes access, price, model names, or rollout policy.

Sources

Common questions

Can open source AI really not be shut off?
Only the local part is truly resilient. If you download a model and run it through Ollama on your own computer, that workflow can keep working without a provider account. NVIDIA NIM and OpenRouter are useful fallback routes, but they are still hosted services with terms, rate limits, and availability risk.
What is the easiest way to start running AI locally?
Install Ollama, pull a model such as Gemma 4 or Qwen3-Coder, and test simple chat, document, or code tasks. Ollama also documents a Claude Code integration using the ollama launch claude command.
What is NVIDIA NIM useful for?
NVIDIA NIM is useful when you want to test stronger hosted models through NVIDIA infrastructure before renting GPUs or buying hardware. It is good for experimentation, but not the same as owning the model locally.
Why use OpenRouter?
OpenRouter gives one API key for many models and documents a direct Claude Code integration through ANTHROPIC_BASE_URL. It is useful for model routing, cost tests, and quickly trying GLM, Qwen, DeepSeek, and closed models without rebuilding your tooling.
Should I replace Claude or ChatGPT with local models?
No. Use local and open models for privacy, cheap iteration, notes, drafts, code search, and fallback workflows. Keep frontier models for tasks where the higher quality and reliability justify the cost.
Share
X LinkedIn Reddit
Build Yours

Want a system
like this one?

Book a free 30-minute call. We map your situation, identify the highest-impact automation, and figure out if we are a fit.

Book Free 30-min Call