Follow BigDATAwire:

July 21, 2025

Top 10 Big Data Technologies to Watch in the Second Half of 2025

(raker/Shutterstock)

 

With the tech industry currently in the midst of its mid-summer lull, now it’s the perfect time to take stock of where we’ve come this year and take a look at where big data tech might take us for the remainder of 2025.

Some may not like the term “big data,” but here at BigDATAwire, we’re still partial to it. Managing vast amounts of diverse, fast-moving and always-changing data is never easy, which is why organizations of all stripes spend so much time and effort to building and implementing technologies that can make data management at least a little less painful.

Amid the drum beat of ever-closer AI-driven breakthroughs, the first six months of 2025 have demonstrated the vital importance of big data management. Here are the top 10 big data technologies to keep an eye on for the second six months of the year:

1. Apache Iceberg and Open Table Formats

Momentum for Apache Iceberg continues to build after a breakthrough year in 2024 that saw the open table format become a defacto standard. Organizations want to store their big data in object stores, i.e. data lakehouses, but they don’t want to give up the quality and control they had grown accustomed to with less-scalable relational databases. Iceberg essentially lets them have their big data cake and eat it too.

Just when Iceberg appeared to have beaten out Apache Hudi and Delta Lake for table format dominance, another competitor landed on the pond: DuckLake. The folks at DuckDB rolled out DuckLake in late May to provide another take on the matter. The crux of their pitch: If Iceberg requires a database to manage some of the metadata, why not just use a database to manage all of the metadata?

Credits: DuckDB

The folks behind the Iceberg and its joined-at-the-hip metadata catalog, Apache Polaris, may have been listening. In June, word began to emerge that the open source projects are looking at streamlining how they store metadata by building out the scan API spec, which has been described but not actually implemented. The change, which could be made with Apache Iceberg version 4, would take advantage of increased intelligence in query engines like Spark, Trino, and Snowflake, and would also allow direct data exports among Iceberg datalakes.

2. Postgres, Postgres Everywhere

Who would have thought that the hottest database of 2025 would trace its roots to 1986? But that actually seems to be the case in our current world, which has gone ga-ga for Postgres, the database created by UC Berkeley Professor Michael Stonebraker as a follow-on project to his first stab at a relational database, Ingres.

Postgres-mania was on full display in May, when Databricks shelled out a reported $1 billion to buy Neon, the Nikita Shamgunov startup developed a serverless and infinitely scalable version of Postgres. A few weeks later, Snowflake found $250 million to nab Crunchy Data, which had been building a hosted Postgres service for more than 10 years.

The common theme running through both of these big data acquisitions is an anticipation in the number and scale of AI agents that Snowflake and Databricks will be deploying on behalf of their customers. Those AI agents will need behind them a database that can be quickly scaled up to handle a variety data tasks, and just as quickly scaled down and deleted. You don’t want some fancy, new database for that; you want the world’s most reliable, well-understood, and cheapest database. In other words, you want Postgres.

3. Rise of Unified Data Platforms

(Shutterstock AI Generator/Shutterstock)

The idea of a unified data platform is gaining steam amid the rise of AI. These systems, ostensibly, are built to provide a cost-effective, super-scalable platform where organizations can store huge amounts of data (measured in the petabytes to exabytes), train massive AI models on huge GPU clusters, and then deploy AI and analytics workloads, with built-in data management capabilities to boot.

VAST Data, which recently announced its “operating system” for AI, is building such a unified data platform. So is its competitor WEKA, which last month launched NeuralMesh, a containerized architecture that connects data, storage, compute, and AI services. Another contender is Pure Storage, which recently launched its enterprise data cloud. Others looking at building unified data platforms include Nutanix, DDN, and Hitachi Vantara, among others.

As data gravity continues to shift away from the cloud giants toward distributed and on-prem deployments of co-located storage and GPU compute, expect these purpose-built big data platforms to proliferate.

4. Agentic AI, Reasoning Models, and MCP, Oh My!

We’re currently witnessing the generative AI revolution morphing into the era of agentic AI. By now, most organizations have an understanding of the capabilities and the limitations of large language models (LLMs), which are great for building chatbots and copilots. As we entrust AI to do more, we give them agency. Or in other words, we create agentic AI.

Many big data tool providers are adopting agentic AI to help their customers manage more tasks. They’re using agentic AI to monitor data flows and security alerts, and to make recommendations about data transformations and user access control decisions.

Many of these new agentic AI workloads are powered by a new class of reasoning models, such as DeepSeek R-1 and OpenAI GPT-4o that can handle more complex tasks. To give AI agents access to the data they need, tool providers are adopting something Model Context Protocol (MCP), a new protocol that Anthropic rolled out less than a year ago. This is a very active space, and there is much more to come here, so keep your eyes peeled.

5. It’s Only Semantics: Independent Semantic Layer Emerges

The AI revolution is shining a light on all layers of the data stack and in some cases leading us to question why things are built a particular way and how they could be built better. One of the layers that AI is exposing is the so-called semantic layer, which has traditionally functioned as a sort of translation layer that takes the cryptic and technical definitions of data stored in the data warehouse and translates it into the natural language understood and consumed by analysts and other human users of BI and analytic tools.

Source: Shutterstock

Normally, the semantic layer is implemented as part of a BI project. But with AI forecast to drive a huge increase in SQL queries sent to organizations’ data warehouse or other unified database of record (i.e. lakehouses), the semantic layer suddenly finds itself thrust into the spotlight as a crucial linchpin for ensuring that AI-powered SQL queries are, in fact, getting the right answers.

With an eye toward an independent semantic layers becoming a thing, data vendors like dbt Labs, AtScale, Cube, and others are investing in their semantic layers. As the importance of an independent semantic layer grows in the latter half of 2025, don’t be surprised to hear more about it.

6. Streaming Data Goes Mainstream

While streaming data has been critical for some applications for a long time–think gaming, cybersecurity, and quantitative trading–the costs have been too high for wider use cases. But now, after a few false starts, streaming data appears to finally be going mainstream–and it’s all thanks to AI leading more organizations to conclude it’s critical to have the best, most up-to-date data possible.

Streaming data platforms like Apache Kafka and Amazon Kinesis are widely used across all industries and use cases, including transactional, analytics, and operational. We’re also seeing a new class of analytics databases like Clickhouse, Apache Pinot, and Apache Druid gain traction thanks to real-time streaming front-ends.

Whether an AI application is tapping into the firehose of data or the data is first being landed in a trusted repository like a distributed data store, it seems unlikely that batch data will be sufficient for any future use cases where data freshness is even remotely a priority.

7. Connecting with Graph DBs and Knowledge Stores

How you store data has a large impact on what you can do with said data. As one of the most structured types of databases, property graph data stores and their semantic cousins (RDFs, triple stores) reflect how humans view the real world, i.e. through connections people have with other people, places, and things.

That “connectedness” of data is also what makes graph databases so attractive to emerging GenAI workloads. Instead of asking an LLM to determine relevant connectivity through 100 or 1,000 pages of prompt, and accepting the cost and latency that necessarily entails, GenAI apps can simply query the graph database to determine the relevance, and then apply the LLM magic from there.

Lots of organizations are adding graph tech to retrieval-augmented generation (RAG) workloads, in what’s called GraphRAG. Startups like Memgraph are adopting GraphRAG with in-memory stores, while established players like Neo4j are also tailoring their solutions toward this promising use case. Expect to see more GraphRAG in the second half of 2025 and beyond.

8. Data Products Galore

The democratization of data is a goal at many, if not most organizations. After all, if allowing some users to access some data is good, then giving more users access to more data has to be better. One of the ways organizations are enabling data democratization is through the deployment of data products.

In general, data products are applications that are created to enable users to access curated data or insights generated from data. Data products can be developed for an external audience, such as Netflix’s movie recommendation system, or they can be used internally, such as a sales data product for regional managers.

Data products are often deployed as part of a data mesh implementation, which strives to enable independent teams to explore and experiment with data use cases while providing some centralized data governance. A startup called Nextdata is developing software to help organizations build and deploy data products. AI will do a lot, but it won’t automatically solve tough last-mile data problems, which is why data products can be expected to grow in popularity.

9. FinOps or Bust

Frustrated by the high cost of cloud computing, many organizations are adopting FinOps ideas and technologies. The core idea revolves around gaining better understanding of how cloud computing impacts an organization’s finances and what steps should be taken to lower cloud spending.

The cloud was originally sold as a lower-cost option to on-prem computing, but that rationale no longer holds water, as some experts estimate that running a data warehouse on the cloud is 50% more expensive than running on prem.

Organizations can easily save 10% by taking easy steps, such as adopting the cloud providers’ savings plans, an expert in Deloitte Consulting’s cloud consulting business recently shared. Another 30% can be reclaimed by analyzing one’s bill and taking basic steps to curtail waste. Further lowering cost requires completely rearchitecting one’s application around the public cloud platform.

10. I Can’t Believe It’s Synthetic Data

As the supply of human-generated data for training AI models gets lower, we’re forced to get creative in finding new sources of training data. One of those sources is synthetic data.

Synthetic data isn’t fake data. It’s real data that’s artificially created to possess the desired features. Before the GenAI revolution, it was being adopted in computer vision use cases, where users created synthetic images of rare instances or edge use cases to train a computer vision model. Use of synthetic data today is growing in the medical field, where companies like Synthema are creating synthetic data for researching treatment for rare hematological diseases.

The potential to use synthetic data with generative and agentic AI is a subject of great interest to the data and AI communities, and is a topic to watch in the second half of 2025.

As always, these topics are just some of what we’ll be writing about here at BigDATAwire in the second half of 2025. There will undoubtedly be some unexpected occurrences and perhaps some new technologies and trends to cover, which always keeps things interesting.

Related Items:

The Top 2025 GenAI Predictions, Part 2

The Top 2025 Generative AI Predictions: Part 1

2025 Big Data Management Predictions

 

BigDATAwire