The useful version of "open source AI they can't shut off" is not a slogan. It is a stack decision.
Pat Simmons' video is a quick-start tour through three layers: run a model on your own machine with Ollama, test stronger hosted open models through NVIDIA NIM, and route cheap or specialized models through OpenRouter. The practical lesson is simple: do not make every workflow depend on one frontier-model account.
Source Note
Credit for the walkthrough goes to Pat Simmons. You can also find him on X and at PerSimmons Studio.
The video frames the moment around restricted frontier-model access. I am going to keep the analysis narrower and more useful: what can a builder set up this week so private notes, code search, draft work, and low-risk agent tasks do not depend entirely on one cloud vendor?
The Main Idea
There are three different things people often mix together:
| Layer | What it gives you | What can still go wrong |
|---|---|---|
| Local Ollama | Model files and inference on your own machine. | Your hardware limits speed, context, and model size. |
| NVIDIA NIM | Free development access to hosted models and a path to self-hosted NVIDIA inference. | Still a service. Limits, terms, model availability, and uptime can change. |
| OpenRouter | One API key for many open and closed models, plus useful Claude Code routing. | Still cloud. You are buying routing flexibility, not full ownership. |
That distinction matters. The "can't shut off" layer is local. The hosted layers are still useful because they reduce single-provider dependency and make model routing easier.
Link Map
Here are the useful links from the stack, grouped by what you are trying to do.
| Need | Start here | Why it matters |
|---|---|---|
| Run local models | Ollama, Ollama download, Ollama model library | The fastest beginner path to local model files, local API calls, and private experiments. |
| Use a local model in Claude Code | Ollama Claude Code integration, Claude Code settings, Claude Code env vars | Lets a local model run inside an agentic coding harness for file reading, editing, and tool use. |
| Test local starter models | Gemma 4 on Ollama, Google Gemma + Ollama docs, Qwen3-Coder on Ollama | Gemma is a good private multimodal starting point. Qwen3-Coder is a stronger coding-agent test. |
| Use stronger hosted open models | NVIDIA Build, NVIDIA model catalog, NVIDIA NIM | Good for experiments where your own machine is too slow or too small. |
| Run an open-source coding agent | OpenCode, OpenCode providers, OpenCode models, OpenCode config | Useful when you want a coding-agent harness that can connect to many providers, including local or hosted routes. |
| Route many models through one key | OpenRouter, OpenRouter Claude Code integration, Z.ai on OpenRouter | Best for swapping models per task without rebuilding your agent setup. |
| Try GLM 5.2 | Z.ai GLM 5.2 release, GLM 5.2 docs, GLM 5.2 on Hugging Face | Relevant for cheaper coding-agent and long-context experiments, usually via hosted routing. |
| Search private notes | Obsidian, Obsidian Local REST API with MCP | Lets local or agent workflows search notes without uploading your whole vault to a chatbot. |
Method 1: Ollama
If you want actual resilience, start with Ollama. It runs models on your own computer and exposes a local API. Ollama's own site now positions it as a way to run apps and agents with open models, including Claude Code and similar tools.
The path is beginner-friendly:
- Install Ollama from the download page.
- Pick a model from the model library.
- Start with a small model for chat, notes, and private drafting.
- Move to coding-oriented models like Qwen3-Coder when you want agent work.
- Use Ollama's Claude Code integration if you want to launch Claude Code against a local model.
# Basic model chat
ollama run gemma4
# Coding-oriented local model
ollama run qwen3-coder:30b
# Launch Claude Code through Ollama's Anthropic-compatible route
ollama launch claude --model qwen3-coder
This is the part of the stack that can keep working if a hosted model is paused, a pricing plan changes, or an account gets rate-limited. The tradeoff is that your laptop is now the data center. Small models are fine. Large models will be slow or impossible without serious memory.
Where local Ollama is strongest
- Searching private notes and local files.
- Drafting and rewriting non-sensitive text.
- Classifying documents or CSVs.
- Running small coding-agent tasks in a sandbox.
- Keeping an offline fallback for travel, outages, or gated frontier access.
Where it is weak
- Big UI builds where design taste matters.
- Long-horizon refactors with many hidden dependencies.
- High-stakes legal, medical, finance, security, or production changes.
- Workflows that need very large context on modest hardware.
Method 2: NVIDIA NIM
NVIDIA's Build site and model catalog give developers API access to many models. NVIDIA also describes NIM as optimized inference microservices that can run on NVIDIA GPUs in workstations, data centers, or cloud.
In the video, Pat uses NVIDIA-hosted models through a coding-agent workflow. The benefit is obvious: you can test beefier models without buying a GPU. The caveat is just as important: this is not ownership. It is a generous hosted development route that can have limits, outages, model changes, or policy changes.
For OpenCode users, the relevant research links are:
- OpenCode: open-source coding agent.
- OpenCode providers: connect providers and API keys.
- OpenCode models: configure provider and model selection.
- OpenCode config: project and global config locations.
- AI SDK NVIDIA NIM provider: useful because NIM exposes an OpenAI-compatible API shape.
My recommendation: use NIM as a testing lane. If a workflow proves valuable, then decide whether to keep using hosted APIs, rent GPUs, or buy hardware.
Method 3: OpenRouter
OpenRouter is the practical model-routing layer in this stack. It gives you one account and one API key for many models. The official Claude Code integration says that setting ANTHROPIC_BASE_URL to https://openrouter.ai/api lets Claude Code speak its native protocol directly to OpenRouter.
This is why the setup is useful for builders. You can test GLM 5.2, Qwen, DeepSeek, Anthropic, OpenAI, Gemini, and other models through one harness, then decide which one belongs on which task.
The important caveat: OpenRouter is still cloud. It is a flexibility layer, not a local-ownership layer. Do not send private client data or secrets through it unless your policies, contracts, and risk tolerance allow it.
Claude Code Config
The video includes a project-local .claude/settings.local.json approach for routing Claude Code through OpenRouter to GLM 5.2. Claude Code's official docs explain that settings files can configure environment variables, and the LLM gateway docs describe ANTHROPIC_BASE_URL as the common address variable for gateways.
Put this in your project-local .claude/settings.local.json. Paste your OpenRouter API key into ANTHROPIC_AUTH_TOKEN. Do not commit this file.
{
"env": {
"ANTHROPIC_BASE_URL": "https://openrouter.ai/api",
"ANTHROPIC_AUTH_TOKEN": "",
"ANTHROPIC_API_KEY": "",
"API_TIMEOUT_MS": "3000000",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "z-ai/glm-5.2",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "z-ai/glm-5.2",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "z-ai/glm-5.2",
"ANTHROPIC_SMALL_FAST_MODEL": "z-ai/glm-5.2",
"CLAUDE_CODE_SUBAGENT_MODEL": "z-ai/glm-5.2"
}
}
settings.local.json should stay local. Add .claude/settings.local.json to your ignore rules if the project is under Git, and never paste API keys into shared docs, screenshots, or agent transcripts.
What To Route Where
The stack is most useful when you stop asking "which model is best?" and start asking "which model is good enough for this part of the workflow?"
| Workflow | Best first route | Why |
|---|---|---|
| Private notes and Obsidian search | Ollama local model | Privacy matters more than frontier quality. Keep the vault local when possible. |
| Small code cleanup | Ollama Qwen3-Coder or OpenRouter GLM 5.2 | Cheap iteration, easy rollback, and enough intelligence for scoped edits. |
| Landing page experiments | NVIDIA NIM through OpenCode or OpenRouter GLM 5.2 | Better models help, but you still need visual review. |
| Large refactors | Frontier model or GLM 5.2 with strict tests | Use cheaper models for exploration, but keep human review and tests on production code. |
| Agent research over public sources | OpenRouter or NIM | Hosted speed matters more than local ownership when the data is already public. |
| Client data or sensitive business docs | Local first | Do not optimize for model quality before privacy, consent, and data handling. |
Safety And Privacy
Running a model locally does not automatically make the workflow safe. A local agent can still delete files, leak secrets through tools, or make bad edits. Treat local models like junior employees with a lot of keyboard access.
- Keep local APIs private. Do not expose Ollama or Open WebUI publicly without authentication, VPN, and a clear reason.
- Separate private and public tasks. Notes and client files should stay local unless you intentionally approve a hosted route.
- Use sandboxes. Local code agents should work in project folders, branches, or throwaway copies.
- Log what matters. Keep prompts, commands, files changed, tests run, and model/provider used.
- Use review gates. Cheaper model routing is not a substitute for code review, data review, or deployment approval.
Starter Checklist
If I were setting this up for a founder or operator this week, I would do it in this order:
- Install Ollama.
- Run one small model and one coding model from the Ollama library.
- Test a private task: summarize notes, search a folder, or clean a transcript.
- Try Ollama with Claude Code on a harmless repo.
- Create an OpenRouter key and test GLM 5.2 with the local settings file above.
- Try NVIDIA Build or NIM when local speed is not enough.
- Write down your routing rule: local for private, OpenRouter/NIM for cheap public experiments, frontier models for high-stakes review.
The end state is not "use open source for everything." The end state is choice. Your workflow should not collapse just because one provider changes access, price, model names, or rollout policy.
Sources
- Pat Simmons video: How to Run Open Source AI They Can't Shut Off
- Pat Simmons on YouTube
- Pat Simmons on X
- PerSimmons Studio
- Ollama
- Ollama download
- Ollama model library
- Ollama Claude Code integration
- Gemma 4 on Ollama
- Google: run Gemma with Ollama
- Qwen3-Coder on Ollama
- NVIDIA Build
- NVIDIA model catalog
- NVIDIA NIM for developers
- AI SDK: NVIDIA NIM provider
- OpenCode
- OpenCode providers
- OpenCode models
- OpenCode config
- OpenRouter
- OpenRouter Claude Code integration
- Z.ai models on OpenRouter
- Z.ai GLM 5.2 release
- Z.ai GLM 5.2 docs
- GLM 5.2 on Hugging Face
- Claude Code settings docs
- Claude Code environment variables
- Claude Code LLM gateway docs
- Obsidian
- Obsidian Local REST API with MCP