Follow BigDATAwire:

June 26, 2025

Are Data Engineers Sleepwalking Towards AI Catastrophe?

(New Africa/Shutterstock)

Since the earliest days of big data, data engineers have been the unsung heroes doing the dirty work of moving, transforming, and prepping data so highly paid data scientists and machine learning engineers can do their thing and get the glory. As the agentic AI era dawns on us, it opens up a host of new data engineering opportunities–as well as potentially catostrphic pitfalls.

Frank Weigel, the former Googel and Microsoft executive who was recently hired by Matillion to be its new chief product officer, openly wondered to a reporter recently whether the Agentic AI Air was on a glideslope for disaster.

“Basically, we see there’s a huge problem coming for data engineering teams,” Weigel said in an interview during the recent Snowflake Summit. “I’m not sure everybody is fully aware of it.”

Here’s the issue, as Weigel explained it:

The explosion of source data is one aspect of the problem. Data engineers who are accustomed to working with structured data are now being asked to manage, prep, and transform unstructured data, which is more difficult to work with, but which ultimately is the fuel for most AI (i.e. words and pictures processed by neural networks).

Data engineers are already overworked. Weigel cited a study that indicated 80% of data engineering teams are already overloaded. But when you add AI and unstructured data to the mix, the workload issue becomes even more acute.

Agentic AI provides a potential solution. It’s natural that overworked data engineering teams will turn to AI for help. There’s a bevy of providers building copilots and swarms of AI agents that, ostensibly, can build, deploy, monitor, and fix data pipelines when they break. We are already seeing agentic AI have real impacts on data engineering teams, as well as the downstream data analysts who ultimately are the ones requesting the data in the first place.

Source: Shutterstock

But according to Weigel, if we implement agentic AI for data engineering the wrong way we’re potentially setting ourselves a trap that will be tough to get out of.

The problem that he’s foreseeing would stem from AI agents that access source data on their own. If an analyst can kick off an agentic AI workflow that ultimately involves the AI agent writing SQL to obtain a piece of data from some upstream system, what happens when something goes wrong with the data pipeline? AI agents might be able to fix basic problems, but what about serious ones that demand human attention?

“You will have autonomous AI agents that run entire business functions,” Weigel said. “But equally, they start to have a huge need for data. And so if the data team already was overloaded before, well, it’s now going to be like looking down the abyss and saying ‘How the heck can we do anything? How am I going to have a human data engineer answer a question from an AI agent?’”

Once human data engineers are out of the loop, bad things can start happening, Weigel said. They potentially face a situation where the volume of data requests–which originally were served by human data engineers but now are being served by AI agents–is beyond their capability to keep up.

The accuracy of data will also suffer, he said. If every AI agent writes its own SQL and pulls data directly out of its source, the odds of getting the wrong answer goes up considerably.

“We’re now back in the dark ages, where we were 10 years ago [when we wondered] why we need data warehouses,” he said. “I know that if person A, B, and C ask a question, and previously they wrote their own queries, they got different results. Right now, we ask the same agent the same question, and because they’re non-deterministic, they will actually create different queries every time you ask it. And as a result, you now have the different business functions all getting different answers, insisting of course that it’s right.

Matillion CPO Frank Weigel

“You have lost all the governance and control of why you established a central data team,” Weigel continued. “And for me, that’s the angle that I think a lot of data orgs haven’t really thought about. When I get a demo of an AI agent, they never talk about that. They just have the agent access the data directly. And sure, it can. But the problem is, it shouldn’t really.”

The answer to this dilemma, according to Weigel, is twofold. First, it’s important to keep data warehouses, as it serves as a repository for data that has been vetted, checked, and standardized.

It’s also critical to keep humans in the loop, according to Weigel. And to keep humans in the loop, human data engineers must somehow be prevented from becoming completely overwhelmed by the unstructured data requests and the new AI workflows. To accomplish that, he said, they essentially must become superhuman data engineers, augmented with AI.

Matillion is building its agentic AI solutions around this strategy. Instead of setting AI agents loose to write their own SQL against source data systems, Matillion is using AI agents as supporting cast members who’s goal is to assist the human data engineer in getting the work done.

This on-demand team of virtual data engineers is dubbed Maia, which the company announced earlier this month. The agents, which run in the Matillion Data Producdtivity Cloud (DPC), are able to assist data engineers with a range of tasks, including creating data connectors, building data pipelines, documenting changes, testing pipelines, and analyzing failures.

“We need to supercharge the data engineering function, and we need to let them match the AI capabilities,” he said. “Instead of just a copilot concept, it has become a component, a selection of different data engineers that have different tasks. They can do different things.”

Maia acts as the lead agent that controls various sub-agents. The company has three or four such data engineering sub-agents today, Weigel said, and it will have more in the future. Maia, which is built using a collection of large language models (LLMs), including Anthropic’s Claude–can even correct itself when it does something wrong.

Matillion is close to shipping a preview of Maia

“It’s really fascinating,” Weigel said. “When you see it work, it will break down the problem into the steps. Then it will start doing it. It will look at the data and decide whether it’s going on the right track. It might roll back. ‘That wasn’t quite right.’ And so it really is like a data engineer in its task and thinking, including looking at the data. It will ask the human for certain at certain points if it wants input.”

Despite the potential for agentic autonomy, that is not part of the Matillion plan, as the company sees the human engineer as a critical backstop that can’t be eliminated from the equation.

Another important backstop that could help Matillion customers avoid agentic AI pitfalls: No AI generation of SQL.

While LLMs like Claude have gotten really, really good at writing SQL, Matillion will not hand the reins over to AI for this critical component. The ETL vendor has been automatically generating SQL as part of its data pipeline solution for Snowflake, Databricks, and other cloud data warehouses for years, and it’s not about to start from scratch.

“The secret in Matillion is we’ve abstracted that layer so we’re much closer to the user intent,” Weigel said. “So the user is building that data pipeline intent with predefined building blocks that ultimately write SQL. But it’s Matillion that writes SQL, not the user.”

This approach also avoids the problem of getting spaghetti SQL code that can’t be updated and modified over time, which is a possibility with AI-generated code.

“We have this abstraction of this intermediate representation of these components that in turn issues SQL,” Weigel said. “And so our agent doesn’t have to generate whatever code you need. Instead, it’s about picking the right component and configuring the right component and then sequencing them together.”

It’s easy to get mesmerized by the “shiny object” syndrome in the tech world. With all the advances in generative AI, it’s tempting to letting those shiny new copilots loose to try and replicate the job of the overworked, under-appreciated data engineer, at a fraction of her cost.

But if replacing data engineers with AI also means replacing much of the governance and control the data engineer brings, that could spell disaster for companies. “I think data engineering teams aren’t maybe fully aware of the potential doom that’s there,” Weigel said.

Instead, companies should be looking to super-charge those overworked data engineers using AI, which Weigel said is the best hope for surviving the AI data deluge.

Related Items:

Are We Putting the Agentic Cart Before the LLM Horse?

Matillion Bringing AI to Data Pipelines

Matillion Looks to Unlock Data for AI

BigDATAwire