Follow BigDATAwire:

September 26, 2025

New GenAI System Built to Accelerate HPC Operations Data Analytics

(Luke Jade/Shutterstock)

AI continues to play a key role in scientific research – not just in driving new discoveries but also in how we understand the tools behind those discoveries. High-performance computing has been at the heart of major scientific breakthroughs for years. However, as these systems grow in size and complexity, they’re becoming harder to make sense of.

The limitations are clear. Scientists can see what their simulations are doing, but often can’t explain why a job slowed down or failed without warning. The machines generate mountains of system data, but most of it is hidden behind dashboards made for IT teams, not researchers. There’s no easy way to explore what happened. Even when the data is available, working with it takes coding, engineering skills, and machine learning knowledge that many scientists don’t have. The tools are slow, static, and hard to adapt dynamically. 

Scientists at Sandia National Laboratories are trying to change that. They’ve built a system called EPIC (Explainable Platform for Infrastructure and Compute) that serves as an AI-driven platform designed to augment operational data analytics. It leverages the new emerging capabilities of GenAI foundational models into the context of HPC operational analytics.

Researchers can use EPIC to see what is happening inside a supercomputer using plain language. Instead of digging through logs or writing complex commands, users can ask simple questions and get clear answers about how jobs ran or what slowed a simulation down.

(Rawpixel.com/Shutterstock)

“EPIC aims to augment various data driven tasks such as descriptive analytics and predictive analytics by automating the process of reasoning and interacting with high-dimensional multi-modal HPC operational data and synthesizing the results into meaningful insights.”

The people behind EPIC were aiming for more than just another data tool. They wanted something that would actually help researchers ask questions and make sense of the answers. Instead of building a dashboard with knobs and graphs, they tried to design an experience that felt more natural. Something closer to a back-and-forth conversation than a command-line prompt. Researchers can stay focused on their line of inquiry without jumping between interfaces or digging through logs.

What powers that experience is AI working in the background. It draws from many sources, such as log files, telemetry, and documentation. It brings them together in a way that makes sense. Researchers can follow system behavior, identify where slowdowns happen, and spot patterns, all without needing to code or call in support. EPIC helps make complicated infrastructure feel more understandable and less overwhelming.

To make that possible, the team behind EPIC developed a modular architecture that links general-purpose language models with smaller models trained specifically for HPC tasks. This setup allows the system to handle different types of data and generate a range of outputs, from simple answers to charts, predictions, or SQL queries. 

By fine-tuning open models instead of relying on massive commercial systems, they were able to keep performance high while reducing costs. The goal was to give scientists a tool that adapts to the way they think and work, not one that forces them to learn yet another interface.

In testing, the system performed well across a range of tasks. Its routing engine could accurately direct questions to the right models, reaching an F1 score of 0.77. Smaller models, such as Llama 3 8B variants, handled complex tasks like SQL generation and system prediction more effectively than larger proprietary models. 

(wenich_mit/Shutterstock)

EPIC’s forecasting tools also proved reliable. It produced accurate estimates for temperature, power, and energy use across different supercomputer workloads. Perhaps most importantly, the platform delivered these results with a fraction of the cost and compute overhead typically expected from this setup. For researchers working on complex systems with limited support, that kind of efficiency can make a significant difference.

“There is an unmistakable gap between data and insight mainly bottlenecked by the complexity of handling large amounts of data from various sources while fulfilling multi-faceted use cases targeting many different audiences,” emphasized the researchers.


Closing that last mile between raw system data and real insight remains one of the biggest hurdles in high-performance computing. EPIC offers a glimpse at what’s possible when AI is woven directly into the analytics process, and not just an add-on. It can help reshape how scientists interact with the tools that power their work. As models improve and systems scale even further, platforms like EPIC could help ensure that understanding keeps pace with innovation.

Related Items

MIT’s CHEFSI Brings Together AI, HPC, And Materials Data For Advanced Simulations

Feeding the Virtuous Cycle of Discovery: HPC, Big Data, and AI Acceleration

Deloitte Highlights the Shift From Data Wranglers to Data Storytellers

 

BigDATAwire