A split-brain illustration with the left side featuring science and research symbols, and the right side containing technology and data icons, surrounded by interconnected lines, gears, and interface elements.

Practical On-Prem AI: What OpenAI’s GPT-OSS 20B & 120B Mean for Utilities

Running strong reasoning models locally for safer, auditable field operations

written by

Ted Tschopp

Symphony of Circuits and Synapses: an image by

Updated: Aug 10, 2025

Bottom Line Up Front: OpenAI has released powerful, open-weight AI models designed specifically for local, on-premises use, ideal for utility companies aiming to boost field productivity, enhance safety, and streamline operations. These models, GPT-OSS-20B and GPT-OSS-120B, can run on standard hardware, making them perfect for embedding directly into field devices like substation controllers, rugged tablets, or drones. With customized local training, they can autonomously generate crew briefings, validate safety checks, intelligently triage equipment alarms, and translate plain-language requests into structured, auditable procedures—always ensuring humans retain final oversight. This offers utilities a practical way to keep critical data secure, improve offline resilience during outages, and reduce human error in complex field workflows.

OpenAI recently released two open-weight (model card, usage policy) — open-weight — language models: gpt-oss-20b and gpt-oss-120b. “Open-weight” means the model files you run (the weights) are published under the permissive Apache 2.0 license (plus OpenAI’s GPT-OSS usage policy). You can download them, run them locally, and fine-tune them—even for commercial use. They’re designed for reasoning and tool use, and they expose their chain-of-thought (the step-by-step reasoning tokens) to developers for debugging and research. Note: CoT should be treated as a developer/debug trace; avoid surfacing raw CoT to end users.

At a high level, OpenAI’s pitch is: these models are small enough to run on common hardware, cheap enough to experiment with, and good enough at reasoning that they are useful stand-alone or as building blocks inside equiptment, apps, agents, and research projects. On OpenAI’s own benchmark table, the larger 120B model lands near o4-mini (one of OpenAI’s closed models) on several reasoning scores.

What exactly did OpenAI ship?

Both models use a Mixture-of-Experts (MoE) transformer. MoE turns on only a few “expert” subnetworks per token, so you get the punch of a large model while actually activating far fewer parameters each step. For gpt-oss-120b, about 5.1B parameters are active per token; for gpt-oss-20b, about 3.6B. Context length is up to 131,072 tokens (≈128k).

OpenAI quantized most of the model to a 4.25-bit format (MXFP4) so the 120B fits on a single 80 GB GPU, and the 20B can run with as little as 16 GB of memory. That makes local experimentation plausible for many labs and even power users.

Reasoning and tools are first-class. Post-training focuses on chain-of-thought (CoT) reasoning—similar to how OpenAI trained its o-series models—and on agentic tool use (web browsing, Python execution, and function calling). You can dial the reasoning effort (“low/medium/high”) depending on whether you want speed or depth. The models expect OpenAI’s harmony chat format; if you skip it, performance is degraded.

On OpenAI’s own chart, gpt-oss-120b is competitive on reasoning benchmarks (e.g., AIME, MMLU, GPQA) and shows strong results on tool-use and health evaluations. The 20B trails but remains respectable for its size. Remember that these are vendor-reported numbers; independent testing often finds both strengths and gaps.

Training footprint. OpenAI reports the 120B pre-training consumed about 2.1 million H100-hours (with the 20B needing ~10× fewer). That figure is for training compute, separate from the harder-to-estimate costs of data collection and research time. This puts the cost to build this mode around $5m $7m.

Safety: how OpenAI justified open weights

What is “High capability”? In OpenAI’s Preparedness Framework, high capability denotes a model that can independently accomplish or reliably accelerate hazardous tasks with minimal expert oversight. Crossing this threshold would trigger stricter controls and would preclude an open-weight release. Definition & framework. OpenAI reports results for (a) the default model and (b) an adversarially fine-tuned variant intended to simulate a determined attacker. Both were scored against the same preparedness rubric so the “High capability” threshold is comparable across variants.

What this does not mean. Not crossing the threshold is not a guarantee of safety in the wild. Model behavior can change under fine-tuning/jailbreaks, and capabilities can compound with external tools (search, coding, lab protocols). Misuse risk remains non-zero.

Open models create tricky safety trade-offs because once weights are public, anyone can fine-tune them—including for malicious purposes—and you can’t revoke access later. To address this, OpenAI says it ran the default 120B model and an adversarially fine-tuned version through its Preparedness Framework across bio, cyber, and AI self-improvement categories. Their stated results:

The default model does not reach OpenAI’s “High capability” threshold in those tracked areas.
Even after worst-case fine-tuning (their attempt to mimic a determined attacker), the 120B still did not reach “High capability” on bio or cyber.
Releasing it would not meaningfully advance the open-model frontier in hazardous biological capabilities, in part because other open models (e.g., DeepSeek, Qwen 3 Thinking, Kimi K2) already approach the fine-tuned model’s scores on relevant tests.

These claims, plus external expert review, are what OpenAI points to in saying this is a responsible open-weights release, given today’s landscape.

Useful, but limited—and the real risk is copying

Pitfalls & fixes

GPT-OSS is clearly good at its target: STEM-style reasoning and tool use. But it highlights several early complaints from testers:

Guardrails feel heavy in some cases (refusals around copyrighted text or sensitive topics).
Hallucinations still occur, and world knowledge seems thin outside STEM/coding.
Configuration matters: many people may be running the models without the expected “harmony” format or proper tool wiring, which depresses results and fuels confusion.

OpenAI’s core strategic concern, though, is distillation and fast-following. By publishing strong open weights and detailed reasoning traces, OpenAI may make it easier for competitors to distill the behavior into smaller, cheaper models that run on widely available hardware. This could shorten the lead time between top U.S. labs and fast-moving teams elsewhere, such as China, and many will argue OpenAI should not have released GPT-OSS because the primary impact will be to accelerate others more than it benefits Americans. OpenAI’s argument is that the release sets a higher safety bar for open models and empowers useful research and products without significantly raising worst-case risk, as summarized above.)

What this means in plain terms

These are strong, accessible reasoning engines. If you need local or on-prem agent behavior—web browsing, calling functions, writing and running code—GPT-OSS is built for that. The 20B is attractive for laptops/workstations; the 120B is aimed at single-GPU servers. Use the harmony format and set reasoning level appropriately, or you will leave capability on the table.
They are not general-knowledge encyclopedias. Don’t expect the breadth of world knowledge seen in the largest proprietary chat models. For broad Q\&A, you’ll still want retrieval (search/RAG) or a larger hosted model.
Safety is improving but not solved. OpenAI’s own testing says the release does not cross its risk thresholds—even when fine-tuned to try—but that doesn’t eliminate misuse risk. Open-weight releases redistribute who can build powerful systems, and that’s the heart of the debate.
The geopolitics are real. The article’s worry is that publishing capable, well-engineered open weights will accelerate competitors via distillation and reverse-engineering of training methods. Whether that effect is large or marginal is an empirical question we’ll learn more about in the coming months.

Brainstormed ideas of Use at an Electric Utility

At an electric utility, GPT-OSS can run as a field copilot on a crew laptop or a rugged tablet, or at a substation. With a small local “knowledge pack” (switching procedures, safety manuals, equipment prints, GIS snippets, and standard forms), the 20B model can work offline or on weak links to: generate tailboard briefings; turn plain-language requests into step-checked switching sequences (never auto-execute—require two-person review); prefill work orders and as-built notes from voice dictation; triage AMI/SCADA alarms into short “what to check next” lists; call local tools (camera/thermal reader, barcode/asset lookup, maps) and stitch the results into a single report; translate and summarize storm updates; and validate PPE/permit prerequisites against the job plan. Running open-weights on-device keeps customer and grid data local, and you can fine-tune with your own tags and terminology. Use the harmony format and guardrails so the agent cannot issue switching steps without explicit human approval; log structured reasoning traces for QA, but don’t show raw CoT to the crew. For heavier reasoning or fleet-wide analytics, hand off to the 120B back at the depot or data center.

Utilities can also embed the 20B model directly into edge gear—substation gateways, feeder automation controllers, reclosers, line-sensor hubs, DER/EVSE controllers, or inspection drones. In this pattern, GPT-OSS sits beside an allow-listed tool layer that speaks DNP3 / IEC 61850 / Modbus / REST. Operators express plain-language intents (e.g., “prepare to re-energize feeder 12; run the clearance checklist”), and the agent expands that request into bounded command sequences: run pre-checks (permits/PPE/zone-of-protection), query sensors, execute only permitted actions, and emit an auditable log with evidence (photos/IR frames/PMU snapshots). On-device reasoning gives low-latency and offline operation during storms or backhaul loss, reduces cognitive load, and standardizes procedures.

Some things to keep in mind:

Keep human-in-the-loop controls: two-person confirmation for state-changing steps
Simulate against a digital twin before execution
Gate every tool behind RBAC, interlocks, and time-boxed workflows
Log structured reasoning traces for QA
Never display raw Chain of Thought to crews

Bottom line

What OpenAI delivered: two open-weight reasoning models, licensed for broad use, that run on modest hardware, show strong STEM/tool performance, and expose full chain-of-thought for developers.
Why some say it’s “already old news”: GPT-5 arrived right after, and many tests suggest these models excel in narrow reasoning but lag in general knowledge and polish compared to top closed models.
The strategic dispute: OpenAI argues the safety bar is high enough and the benefits to research and products are substantial. The article you shared argues the main outcome will be faster copying and catch-up by others, so the release was a mistake. Reasonable people can disagree; both the capabilities and the risks will become clearer as the ecosystem builds on GPT-OSS.

Use GPT-OSS if…

You need on-prem/private agent workflows (tools, code exec, browsing) with tight data control.

You have ~16 GB (20B) or ~80 GB (120B) available and want predictable local costs.

You’re building STEM/coding workflows where CoT reasoning shines.

Prefer Global Models if…

You need the broadest world knowledge and polished chat UX out-of-the-box.

You lack the hardware/ops to host 20B/120B reliably at your target latency.

You prioritize top-end performance on general QA, creative tasks, or multilingual breadth.

How to get it running

As of Aug 10, 2025. Prices/configs change frequently—verify on vendor pages.

What the models need. OpenAI quantizes most of the Mixture-of-Experts (MoE) weights to a 4.25-bit format (MXFP4). That is what lets the 120B model run within roughly 80 GB of memory and the 20B model run within 16 GB. In practice you still want some headroom for activations and tooling, but those are the right ballpark numbers.

Quick sizing tips.

Disk space: Keep at least 2× the checkpoint size available for swaps and temporary files (≈25–30 GiB for 20B, ≈120–150 GiB for 120B with logs, caches, and extra safeties). The model card lists ~12.8 GiB and ~60.8 GiB checkpoints for 20B and 120B respectively.
Thermals and power: Laptops will run 20B but will throttle sooner under sustained load; desktops hold clocks better for 120B. (See the desktop-class configs above for the smoother experience.)
Software setup: To use multiple GPUs on Windows towers, choose an inference stack that supports tensor/model parallelism.

Vendor	Device (chip / GPU)	Form factor	Status	RAM / Unified memory (Apple-/Dell-published) & GPU VRAM (per card, if relevant)	Good for 20b	Good for 120b	Price (USD)
Apple	Mac mini (M4)	Small desktop	Shipping	16 GB base → 24/32 GB configurable.	yes		From $599 (M4).
Apple	Mac mini (M4 Pro)	Small desktop	Shipping	24 GB base → 48/64 GB configurable.	yes		From $1,399 (M4 Pro).
Apple	iMac 24-inch (M4)	All-in-one	Shipping	16 GB base → up to 32 GB configurable.	yes
Apple	MacBook Air (M4, 13/15)	Laptop	Shipping	16 GB base → 24/32 GB configurable.	yes		From $999 (13-inch); from $1,199 (15-inch).
Apple	MacBook Pro 16-inch (M4 Max)	Laptop	Shipping	36/48 GB Base → 128 GB configuration.	yes	Yes at 128GB	From $3,999 for 20b model;From $4,999 for 120b.
Apple	Mac Studio (M4 Max)	Small desktop	Shipping	Up to 128 GB unified memory.	yes	Yes	From $1,999 for 20b; from 3,499 for 120b
Apple	Mac Studio (M3 Ultra)	Small desktop	Shipping	Up to 512 GB unified memory.	yes	Yes	From $3,999.
Apple	Mac Pro (M2 Ultra)	Tower	Shipping	Up to 192 GB unified memory.	yes	Yes	From $6,999 (tower).
Dell	Precision 5690 (Intel Core Ultra + RTX Ada)	Mobile workstation	Shipping	Up to 64 GB LPDDR5x 7467 MT/s (onboard). GPU options up to RTX 5000 Ada (16–?? GB VRAM varies by card).	yes	No	Starts around $2,549.
Dell	Precision 5860 Tower (Xeon W + RTX/Radeon Pro)	Tower	Shipping	Up to 2 TB DDR5 ECC RDIMM (multiple DIMM configs). Up to RTX 6000 Ada 48 GB per GPU (single or dual).	yes	Yes (with dual 48 GB via sharding)	From $2,829 (base config). 120b requires ~$20,000
Dell	Precision 7875 Tower (Ryzen Threadripper Pro + RTX Ada)	Tower	Shipping	Up to 2 TB DDR5 ECC RDIMM (multiple DIMM configs). Up to RTX 6000 Ada 48 GB per GPU (single or dual).	yes	Yes (with dual 48 GB via sharding)	From $3,789 (base config). 120b requires ~$20,000
Apple	iPhone 16e	Phone	Shipping	A18; 8 GB RAM (confirmed by dev tools/teardowns)	No	No	From $599
Apple	iPhone 16 Pro Max	Phone	Shipping	A18 Pro; 8 GB RAM (all 16 models)	No	No	From $1,199
Apple	iPhone 17 Pro Max (2025)	Phone	Rumored	A19 Pro; 12 GB RAM expected	No	No	TBD (likely near $999)
Apple	iPhone 18 Pro Max (2026)	Phone	Rumored	A20 Pro (2 nm); 16 GB RAM rumored	Yes	No	TBD (likely near $999)
Apple	iPhone 19 Pro Max (2027)	Phone	Rumored / Projected	A21-class; RAM ≥ 16 GB 20th anniversary Model	Yes	No	TBD (likely near $1,999)
Apple	iPad Pro (M4) 11-/13-inch	Tablet	Shipping	M4; 8 GB (256/512 GB) or 16 GB (1–2 TB)	Yes (16 GB configs only)	No	From $1,599 (11″)
Apple	“Next” iPad Pro (M5, 2025)	Tablet	Rumored	M5; 2025 fall window; memory likely similar tiers to M4 (final RAM TBD).	Yes (16 GB configs only)	No	From $1,599 (11″)
Apple	iPad Pro (M6, ~H1 2027)	Tablet	Rumored	M6; early 2027 refresh. 16–32 GB	Yes	No	From $999 (11″)

Notes & sources (workstations):

“Good for 20B” assumes you can allocate ~16 GB to the model. Any recent Mac with at least 16 GB unified memory qualifies; on Windows/Linux, a 16 GB-VRAM GPU or enough system RAM for CPU inference will work, with lower throughput.
“Good for 120B” means you either have a single ~80 GB GPU (as OpenAI’s target) or you shard the model across multiple GPUs (e.g., two 48 GB RTX 6000 Ada cards in a Dell Precision tower). On Apple Silicon, a single machine with ≥128 GB unified memory (e.g., Mac Studio M4 Max 128 GB or MacBook Pro M4 Max 128 GB) provides enough headroom for on-device runs, with speed depending on your Metal-capable runtime.
Apple memory capacities come from the official Tech Specs or buy pages (Mac mini M4/M4 Pro, iMac M4, MacBook Air M4, MacBook Pro M4 Pro/Max, Mac Studio M4 Max/M3 Ultra, Mac Pro M2 Ultra). Prices listed are the current Apple Store “from” prices or example configuration prices on the cited pages.
Dell RAM maxima, GPU options, and starting prices are from Dell’s configurator and documentation for Precision 5690/5860/7875.
Key — Shipping = Apple has released the device. Rumored = multiple credible reports. Projected = trend-based forecast from release cadence and chip roadmaps.
Reminder of OpenAI’s on-device targets: gpt-oss-20b ~16 GB memory; gpt-oss-120b ~80 GB (single GPU or equivalent unified memory / sharded GPUs).
Pricing: For the MacBook Pro line, Apple’s own pages confirm the 128 GB unified memory option, and Apple/MacRumors list typical configure-to-order adders. The ~$4,799 figure shown is a clean configuration math example using Apple’s base 16-inch price (+ chip + memory). Final price varies with storage and display options.
Fit vs. GPT-OSS targets: OpenAI’s guidance is ~16 GB for 20B and ~80 GB for 120B. Mobile devices generally do not expose that memory headroom to apps, so they’re listed here for completeness, not as recommended hosts.

Comments

With an account on the Fediverse or Mastodon, you can respond to this post. Since Mastodon is decentralized, you can use your existing account hosted by another Mastodon server or compatible platform if you don't have an account on this one. Known non-private replies are displayed below.