AI for Energy

Case Studies

AI Agents and Models for Energy Plants

The Challenge of an Operator

In a power plant, resolving an equipment issue is far more complex than simply fixing a broken component. Operators and technicians must diagnose the problem, review engineering drawings, assess operational impacts, prepare a work plan, isolate the equipment safely, perform maintenance, verify the results, and document the entire process before the equipment can return to service.

Understanding Complex Industrial Systems

A critical part of this workflow involves analyzing multiple engineering representations—including P&IDs, logic diagrams, electrical schematics, single-line diagrams, physical layouts, and operational time-series data—to understand equipment behavior and system-wide dependencies. Therefore, an AI assistant designed for power plant operations must possess a strong capability to interpret and reason over diverse engineering diagrams and multimodal industrial information.

Data beyond the web

Moreover, the challenge extends far beyond understanding complex engineering documents. Much of the knowledge required to operate and maintain a power plant exists as tacit expertise accumulated by experienced operators and engineers over decades. In addition, many critical data sources, procedures, and operational records are proprietary and never leave the organization. As a result, the information needed to solve real-world problems is often unavailable on the public internet, making it significantly more difficult for generic AI models to acquire the necessary domain knowledge.

PowerBench (Thermal) — Building the Evaluation Foundation for Power plants

Because no benchmark existed to measure AI capabilities in power plant operations, we built a comprehensive evaluation suite covering three tiers of industrial capability: Knowledge, Reasoning, and Decision.

Knowledge — what an operator should know.

Engineering Knowledge — university- and graduate-level thermodynamics, fluid dynamics, heat transfer, thermal-fluid engineering.
Static Diagram Understanding — schematics, P&IDs, electrical diagrams, control logic diagrams.

Reasoning — how an operator analyzes what is happening.

Operational Reasoning — root-cause analysis, impact assessment, response planning, consequence prediction.
Dynamic Diagram Understanding — inferring equipment relationships, process behavior, effects of interventions.

Decision — what an operator actually does, end-to-end.

End-to-end Industrial Operations — full operator-task replication on real plant tasks: document retrieval, drawing navigation, equipment relationship analysis, procedure execution, problem-solving sessions. Expert-rubric assessed against captured operator workflows.

Today, every model we evaluate — ours and frontier-class — scores 0% on the Decision tier. That is what our deployments are pointed at: assembling the operator-task corpus and the grading rubric required to score it credibly, and training the model that closes the gap.

A Specialized Foundation Model for Energy

Using the datasets and benchmarks we developed, we trained Gravity-16B-A3B, one of our latest lightweight STEM-specialized models. Despite its compact size and significantly lower training cost, the model achieved state-of-the-art performance within the power plant domain and demonstrated performance comparable to leading frontier models such as Claude Opus 4.6 and DeepSeek V4 Pro across multiple evaluation categories.

PowerBench composite (Knowledge + Reasoning + Decision, equal-weighted) vs. inference cost. With Decision at 0% across all evaluated models today, the composite caps in the 60s — frontier-model leads on Knowledge and Reasoning do not extend to end-to-end industrial operations.

The Next Frontier

Specialization at 16B parameters validates the thesis that frontier-class industrial capability does not require frontier-scale compute. We are now scaling this approach with Gravity Flash, a specialized model designed to deliver frontier-class performance at sub-frontier cost. First internal PowerBench results are expected in Q3 2026.

The advantage we are building toward is not in parameter count. It is in the corpus required to score and close the Decision tier. Every plant we deploy in generates the operator workflows, decisions, and corrections that the open web does not contain — and that no public benchmark release ever will. Each deployment makes the next one easier to evaluate, and the model harder for a generic frontier lab to catch.