Local AI hardware is not one buying decision. It is a routing decision. A tiny model for private notes, a 32B coding model, a 70B reasoning model, and a GLM-5.2 long-horizon coding run do not belong on the same machine.
This guide maps the hardware to the work: which boxes make sense for Ollama and LM Studio, which model families fit Mac or Windows, when a sidecar machine is enough, and when you should stop shopping and rent a GPU instead.
Source Note
This is a buyer's guide, not a benchmark leaderboard. It uses official product pages, hardware docs, cloud GPU pricing pages, local runtime documentation, and hands-on video walkthroughs as source material. Prices were checked in late June 2026 and can move quickly, especially GPUs and cloud rentals.
The practical recommendation is intentionally conservative: start on hardware you already own, prove one weekly workflow, then buy or rent the smallest reliable machine that runs that workflow well.
Link Map
| Layer | Best links | What to use them for |
|---|---|---|
| Local runtimes | Ollama, Ollama GPU docs, LM Studio, LM Studio server | Start local inference, test models, and expose a local API endpoint to tools. |
| Mac sidecar | Mac mini US, Mac mini Portugal, Mac Studio US | Quiet always-on local assistant, private files, small/medium local models. |
| NVIDIA local | RTX 5090, RTX PRO 6000 Blackwell, DGX Spark | CUDA-first inference, larger local models, agent boxes, and business workloads. |
| Compact shared memory | Framework Desktop, Ryzen AI Max+ 395, GMKtec EVO-X2 | Interesting high-memory mini-workstation experiments where CUDA is not required. |
| Cloud rental | RunPod, Lambda, Scaleway GPU pricing, Hetzner GPU servers | Rent first for huge models, fine-tuning, benchmarks, and uncertain workloads. |
| Model families | Qwen3, Gemma, Llama, DeepSeek-R1, GLM-5.2 | Choose the model family before choosing the machine. |
Videos Worth Watching
The Hardware Rule
The simple rule is memory first, then software support, then raw speed.
Local models are usually constrained by memory before they are constrained by ambition. If the model does not fit in RAM or VRAM at a usable quantization, the rest of the spec sheet does not matter. After memory, the next question is software support. NVIDIA CUDA is still the safest path for many AI developer tools. Apple Silicon is excellent for quiet unified-memory inference. AMD shared-memory machines are interesting, but the stack is still less mature than NVIDIA for many agent builders.
| Model tier | Practical hardware target | What to run |
|---|---|---|
| 1B to 4B | Modern laptop, Mac mini, mini PC, 16GB RAM | Fast private notes, summaries, classification, lightweight helpers |
| 7B to 9B | 16GB minimum, 24GB to 32GB better | Useful Ollama/LM Studio starter models |
| 12B to 14B | 32GB RAM or 24GB+ Apple unified memory | Better coding help, RAG answers, local agent drafts |
| 30B to 32B | 48GB to 64GB RAM, RTX 4090/5090, Mac Studio | Serious local coding and reasoning experiments |
| 70B | High-memory Mac Studio, RTX PRO 6000, DGX Spark, cloud | High-quality local reasoning with compromises |
| 200B+ / huge MoE | Cloud, hosted API, or multi-GPU infrastructure | GLM-5.2, full DeepSeek-style models, frontier open-weight experiments |
Quick Picks With Prices
If you just want the answer, this is the current shortlist.
| Pick | Approx price | Best for | Buy / rent link |
|---|---|---|---|
| Current computer + Ollama or LM Studio | Free software, hardware already owned | Learning local AI before spending money | Ollama / LM Studio |
| Mac mini M4 Pro, 48GB unified memory | US M4 Pro line from about $1,599; Portugal 48GB example around EUR 2,199 | Quiet sidecar agent box for Qwen, Gemma, Llama, DeepSeek distills | Apple US / Apple PT / Worten example |
| Mac Studio | M4 Max from about $1,999; M3 Ultra from about $3,999 before upgrades | Quiet high-memory Mac local AI workstation | Apple US / Apple PT |
| RTX 4090 / RTX 5090 Windows or Linux desktop | RTX 5090 official Spain listing around EUR 2,099 when available; street prices vary | CUDA-first local coding agents and faster inference | RTX 5090 specs / NVIDIA Spain marketplace |
| RTX PRO 6000 Blackwell 96GB | NVIDIA marketplace listing around $13,250, often stock-limited | Pro single-GPU local AI box, 70B-class work, business inference | NVIDIA marketplace |
| NVIDIA DGX Spark / ASUS Ascent GX10 class | DGX Spark about $4,699; ASUS Ascent GX10 around $3,999 to $4,100 | Deskside AI appliance with NVIDIA software stack | DGX Spark / ASUS Ascent GX10 |
| Ryzen AI Max+ 395 / Framework Desktop class | Framework mainboard examples: $969, $1,659, $3,149 by memory tier; GMKtec EU example around EUR 1,949.99 | Compact shared-memory local AI experiments | Framework Desktop / Framework mainboard / GMKtec EU |
| Cloud GPU rental | RunPod 4090 examples from about $0.34/hr; Lambda B200 about $6.69/GPU/hr; Scaleway L40S from about EUR 1.47/hr | Testing before buying, huge models, fine-tuning, burst workloads | RunPod / Lambda / Scaleway L40S / Hetzner GPU |
Best Hardware By Model Family
Ollama starter models
Ollama is the easiest way to learn the muscle memory of local models. Start with the machine you already own, then upgrade when one workflow is proven.
- Best hardware: current laptop, Mac mini M4 24GB, Mac mini M4 Pro 48GB, or Windows desktop with 32GB RAM.
- Best models: Gemma small, Qwen 4B/8B, Llama 1B/3B/8B, DeepSeek-R1 distills.
- Best use cases: private notes, transcript cleanup, first drafts, classification, local code explanations.
- Buy advice: spend zero first. Install Ollama or LM Studio, run three real tasks, then decide.
Gemma
Google's current Gemma documentation lists Gemma 4 model sizes as E2B, E4B, 12B, 31B, and 26B A4B. Treat Gemma as a strong practical family for small assistants, private writing, lightweight multimodal workflows, and efficient local tests.
| Gemma tier | Hardware | Use case |
|---|---|---|
| E2B / E4B | Normal laptop, Mac mini, mini PC | Fast drafts and private notes |
| 12B | 32GB RAM, Apple Silicon 24GB+, RTX 4060 Ti 16GB or better | Better writing, RAG, assistant workflows |
| 26B A4B / 31B | Mac mini M4 Pro 48GB, Mac Studio, RTX 4090/5090, 64GB+ RAM | More serious local assistants and vision tests |
Qwen
Qwen3 is one of the most useful open model families to cover because it spans tiny dense models, serious 14B/32B dense models, and MoE models like Qwen3-30B-A3B and Qwen3-235B-A22B.
- Qwen 4B/8B: laptop, Mac mini, mini PC, great first Ollama tests.
- Qwen 14B: 32GB RAM or 24GB+ Apple unified memory.
- Qwen 30B-A3B / 32B: Mac mini M4 Pro 48GB, Mac Studio, RTX 4090/5090, Ryzen AI Max 128GB.
- Qwen 235B-A22B: cloud or multi-GPU infrastructure. Do not buy a Mac mini for this.
- Best use cases: multilingual workflows, coding, structured reasoning, agent planning, private business assistants.
Llama
Llama remains important because support is broad across local runtimes, examples, and deployment tools.
- Llama 1B/3B: edge devices, low-end laptops, quick classification.
- Llama 8B: Mac mini, 16GB+ machines, small sidecar agents.
- Llama 70B: high-memory Mac Studio, RTX PRO 6000, DGX Spark, or cloud.
- Llama 4 Scout/Maverick: serious NVIDIA hardware or hosted/cloud routes first.
- Best use cases: general local assistants, support workflows, long-context experiments, local app backends.
DeepSeek
Separate the distilled DeepSeek-R1 models from the full DeepSeek-R1 and V3 models. The distills are normal local targets. The full models are not.
- DeepSeek-R1 distill 1.5B/7B/8B: laptop, Mac mini, small local box.
- DeepSeek-R1 distill 14B/32B: Mac mini M4 Pro, RTX 4090/5090, Mac Studio.
- DeepSeek-R1 distill 70B: RTX PRO 6000, DGX Spark, high-memory Mac Studio, cloud.
- Full DeepSeek-R1/V3: cloud or multi-GPU infrastructure.
- Best use cases: reasoning, coding, math, careful analysis, local fallback for hard tasks.
GLM-5.2
GLM-5.2 is the model people will be tempted to misunderstand. It is open-weight and powerful, but it is not a normal home-local model. The public model page describes it as a very large MoE model. The practical route for most builders is hosted inference, not a single consumer box.
- Best route: Z.ai, Z.ai docs, hosted providers, OpenRouter-style routing, or rented high-end cloud GPUs.
- Local warning: do not buy a Mac mini, RTX 4090, or RTX 5090 expecting to comfortably host the full model locally.
- Best use cases: long-horizon coding, agentic software work, model-routing tests, Claude Code-style experiments.
- Practical pairing: use GLM-5.2 hosted for hard work, then use local Qwen/DeepSeek/Gemma models for private drafts, routine agents, and cheap fallback tasks.
Mistral, Phi, and small specialist models
Smaller specialist models are still important because always-on agents are often bottlenecked by reliability and cost, not maximum intelligence.
- Best hardware: current laptop, Mac mini, mini PC, small NVIDIA card.
- Best use cases: extraction, classification, routing, short summaries, command helpers, light automation.
- Buy advice: if the task is narrow, try a smaller specialist model before moving to a 30B or 70B model.
Best Hardware By Use Case
| Use case | Best hardware | Models to start with | Why |
|---|---|---|---|
| Private notes and writing | Existing laptop, Mac mini M4, Mac mini M4 Pro | Gemma small, Qwen 4B/8B, Llama small | Privacy and convenience matter more than raw speed. |
| Coding assistant beside main computer | Mac mini M4 Pro 48GB or RTX 4090/5090 desktop | Qwen 14B/32B, DeepSeek distills, Gemma 31B | You want a stable sidecar that can run while your main machine stays clean. |
| Always-on background agents | Dedicated sidecar with 64GB+ RAM, wired Ethernet, UPS | Qwen 8B/14B, Gemma 12B, Llama 8B, DeepSeek 14B | Reliability, logs, backups, and permissions matter more than headline benchmarks. |
| Private RAG over files | 64GB+ RAM, fast NVMe, optional GPU | Qwen 14B/30B, Gemma 12B/31B, Llama 8B | Indexing and retrieval need storage and memory, not just a huge GPU. |
| Voice and meeting notes | Apple Silicon for quiet use; NVIDIA desktop for batch work | Whisper-class transcription plus Qwen/Gemma cleanup | Transcription and cleanup are easier to verify than open-ended reasoning. |
| Vision and multimodal tests | Mac Studio, RTX 5090, RTX PRO 6000, DGX Spark, cloud | Gemma multimodal, Llama 4 where supported | Image inputs raise memory and runtime demands. |
| Long-horizon GLM coding workflows | Hosted GLM-5.2 first, cloud GPU for experiments | GLM-5.2 hosted, local Qwen/DeepSeek fallback | The full model is too large for normal local hardware. |
| Fine-tuning | Cloud GPU first; RTX PRO 6000 or multi-GPU only if recurring | Task-specific small/medium models | Renting avoids buying the wrong expensive machine. |
Ollama and LM Studio Setups
I would treat Ollama and LM Studio as two different entry points.
- LM Studio: best if you want a desktop app, a model catalog, and a friendly way to run a local server.
- Ollama: best if you want terminal control, a simple runtime, and something agents or tools can call.
- Open WebUI: best if you want a local browser interface over Ollama or API-compatible endpoints.
- llama.cpp: best when you want lower-level control, quantization awareness, or maximum portability.
For your first sidecar agent box, I would install Ollama, one Qwen model, one Gemma model, one DeepSeek distill, and Open WebUI only if you need a browser workspace. Do not install twenty models. You will learn more by testing four models on the same five tasks.
Mac vs Windows vs Cloud
| Platform | Strength | Weakness | Best buyer |
|---|---|---|---|
| Mac mini / Mac Studio | Quiet, compact, unified memory, good sidecar experience | CUDA ecosystem is not available | Builders who want a private, always-on, low-noise local AI box |
| Windows/Linux NVIDIA desktop | CUDA, broad AI tooling, high inference speed | Power, heat, noise, driver complexity, GPU pricing volatility | Developers running coding agents, vision, batch jobs, and CUDA-first tools |
| Ryzen AI Max / shared-memory mini workstation | Compact high-memory experiments | Software support is still less mature than NVIDIA CUDA | Experimenters who want a small box and accept rough edges |
| Cloud GPU | No upfront hardware, huge GPU options, easy to scale up temporarily | Ongoing cost, data governance, setup, idle waste | Anyone testing huge models, fine-tuning, or proving demand before buying hardware |
Buy vs Rent
The buying rule is simple: buy stable daily capacity, rent uncertain peak capacity.
- Buy a Mac mini if a private assistant or local agent will run every day and silence matters.
- Buy an NVIDIA desktop if local coding, CUDA support, and speed matter more than noise.
- Buy RTX PRO 6000 or DGX Spark only if this is business infrastructure, not a weekend curiosity.
- Rent RunPod, Lambda, Scaleway, or Hetzner when the workload is irregular, huge, or still unknown.
- Do not buy hardware for full GLM-5.2 local hosting until the workflow has already paid for itself on hosted or rented infrastructure.
The most expensive mistake is not buying the wrong GPU. It is buying a GPU before you know what job it will do every week.
What I Would Buy First
If this were my stack, I would do it in this order.
- First week: use the existing computer with Ollama and LM Studio. Test Qwen, Gemma, and DeepSeek distill on real tasks.
- First purchase: Mac mini M4 Pro 48GB if I want a quiet sidecar, or RTX 5090/4090 desktop if I need CUDA.
- Before any pro workstation: rent a RunPod or Lambda GPU for the biggest model I think I need.
- Only after proof: buy Mac Studio, RTX PRO 6000, DGX Spark, or a dedicated server if the workflow is now business-critical.
For JQ AI SYSTEMS clients, I would almost never start with a giant machine. I would start with a repeatable workflow: private briefing, local document Q&A, transcript cleanup, code review, or background research. Once the workflow is real, hardware selection becomes obvious.
Security for Always-On Agent Boxes
A local agent box is still a computer with permissions. Treat it like infrastructure.
- Run local model servers only on LAN, VPN, or a private network.
- Use Tailscale SSH or a similar private network for remote access.
- Do not expose Ollama, LM Studio, or Open WebUI directly to the public internet.
- Use separate folders for agent work, with backups and clear logs.
- Do not give agents email, browser control, file deletion, or payment access until review gates exist.
- Encrypt disks if the box holds client files, private notes, prompts, transcripts, or indexes.
Local does not automatically mean safe. It means you control more of the risk.
Sources
- Ollama
- Ollama GPU docs
- LM Studio
- LM Studio local server docs
- llama.cpp
- Open WebUI
- Apple Mac mini US
- Apple Mac mini Portugal
- Worten Mac mini 48GB example
- Apple Mac Studio US
- Apple Mac Studio Portugal
- NVIDIA RTX 5090
- NVIDIA Spain RTX 5090 marketplace
- NVIDIA RTX PRO 6000 Blackwell marketplace
- NVIDIA DGX Spark
- ASUS Ascent GX10
- AMD Ryzen AI Max+ 395
- Framework Desktop
- Framework Desktop mainboard
- GMKtec EVO-X2 EU
- Qwen3 model family
- Google Gemma docs
- Meta Llama
- DeepSeek-R1
- DeepSeek-V3
- Z.ai GLM-5.2
- GLM-5.2 docs
- GLM-5.2 on Hugging Face
- RunPod pricing
- RunPod RTX 4090
- RunPod RTX 5090
- Lambda pricing
- Scaleway L40S GPU
- Scaleway L4 GPU
- Hetzner GPU servers
- Hetzner GEX131
- Tailscale SSH