We’ve entered a period of punctuated equilibrium in the evolution of big data over the past month, thanks to community congregating around open table formats and metadata catalogs for storing data and enabling processing engines to access that data. Now attention is shifting to another element of the stack that’s been living quietly in the shadows: the semantic layer.
The semantic layer is an abstraction that sits between a company’s data and the business metrics that it has chosen as its standard unit of measurement. It’s a critical layer to ensure correctness.
For instance, while various departments in a company may have different opinions of what the best way to measure what “revenue” means, the semantic layer defines what is the correct way to measure revenue for that company, thereby eliminating (or at least greatly reducing) the chance of getting bad analytic output.
Traditionally, the semantic layer has travelled with the business intelligence or data analytics tool. If you were a Tableau shop or a Qlik shop or a Microsoft PowerBI shop or a ThoughtSpot shop or a Looker shop, you used the semantic layer provided by those vendors to define your business metrics.
This approach works well for smaller companies, but it created problems for larger enterprises that used two or more BI and analytic tools. Now the enterprise is faced with the task of hardwiring two or more semantic layers together to ensure that they’re pulling data from the correct tables and applying the right transformations to ensure their reports and dashboards would continue generating accurate information.
In recent years, the concept of a universal semantic layer has started to bubble up. Instead of defining the business metrics in a semantic layer that is tied directly to the BI or analytics tool, the universal semantic layer lives outside of the BI and analytics tools, thereby providing a semantic service that any BI or analytics tool could tap into to ensure accuracy.
As the cloud data estates of companies have grown during the past five years, smaller companies started dealing with the increased complexity that comes from using multiple data stacks. That has helped to drive some interest in the universal semantic layer.
Natural Language AI
More recently, another factor has driven a surge of interest in the semantic layer: generative AI. Large language models (LLMs) like ChatGPT are leading many companies to experiment with using natural language as an interfaces for a range of applications. LLMs have shown an ability to generate text in any number of languages, including English, Spanish, and SQL.
While the English generated by LLMs generally is quite good, the SQL is usually quite poor. In fact, a recent paper found that LLMs generate accurate SQL on average only about one-third of the time, said Tristan Handy, the CEO of dbt Labs, the company behind the popular dbt tool, and the purveyor of a universal semantic layer.
“A lot of people are experimenting in this space are AI engineers or software engineers who don’t actually have knowledge of how BI works,” Handy told Datanami in an interview at Snowflake’s Data Cloud Summit last month. “And so they’re just like, ‘I don’t know, let’s have the model write SQL for me.’ It just doesn’t happen to work that well.”
The good news is that it’s not difficult to introduce a semantic layer into the GenAI call stack. Using a tool like LangChain, one could simply instruct the LLM to use a universal semantic layer to generate the SQL query that will fetch the data form the database, instead of letting the LLM do it itself, Handy said. After all, this is exactly what semantic layers were created for, he pointed out. Using this approach increases the accuracy of natural language queries using LLMs to about 90%, Handy said.
“We are having a lot of conversations about the semantic layer, and a lot of them are driven by the natural language interface question,” he said.
Not Just Semantics
Dbt Labs isn’t the only vendor plying the universal semantic layer waters. Two other vendors have staked a flag in this space, including AtScale and Cube.
AtScale recently announced that its Semantic Layer Platform is now available on the Snowflake Marketplace. This support ensures that Snowflake customers can continue to rely on the data they’re generating, no matter which AI or BI tool they’re using in the Snowflake cloud, said
“The semantic models you define in AtScale represent the metrics, calculated measures, and dimensions your business consumers need to analyze to achieve their business objectives,” AtScale Vice President of Growth Cort Johnson wrote in a recent blog post. “After your semantics are defined in AtScale, they can be consumed by every BI application, AI/ML application, or LLM in your organization.”
Databricks is also getting into the semantic game. At its recent Data + AI Summit, it announced that it has added first-class support for metrics in Unity Catalog, its data catalog and governance tool.
“The idea here is that you can define metrics inside Unity Catalog and manage them together with all the other assets,” Databricks CTO Matei Zaharia said during his keynote address two weeks ago. “We want you to be able to use the metrics in any downstream tool. We’re going to expose them to multiple BI tools, so you can pick the BI tool of your choice. … And you’ll be able to just use them through SQL, through table functions that you can compute on.”
Databricks also announced that it was partnering with dbt, Cube, and AtScale as “external metrics provider,” to make it easy to bring in and manage metrics from those vendors’ tools inside Unity Catalog, Zaharia said.
Cube, meanwhile, last week launched a couple of new products, including a new Semantic Catalog, which is designed to give users “a comprehensive, unified view of connected data assets,” wrote David Jayatillake, the VP of AI at Cube, in a recent blog post.
“Whether you are looking for modeled data in Cube Cloud, downstream BI content, or upstream tables, you can now find it all within a single, cohesive interface,” he continued. “This reduces the time spent jumping between different data sources and platforms, offering a more streamlined and efficient data discovery process for both engineers and consumers.”
The other new product announced by Cube, which recently raised $25 million from Databricks and other venture firms, includes an AI Assistant. This new offering is designed to “empower non-technical users to ask questions in natural language and receive trusted answers based on your existing investment into Cube’s universal semantic layer,” Jayatillake wrote in a blog.
Opening More Data
GenAI may be the biggest factor driving interest in a universal semantic layer today, but the need for it predates GenAI.
According to dbt Labs’ Handy, who is a 2022 Datanami Person to Watch, the rise of the universal semantic layer is happening for the same reason that the database is being decomposed into constituent parts.
Dbt Labs originally got into this universal semantic layer space because the company saw it as “a cross-platform source of truth,” said Handy, who is a 2022 Datanami Person to Watch.
“It should be across your different data tools, it should be across your BI tools,” he said. “In the same way that you govern your data transformation in this independent way, you should be governing your business metrics that way, too.”
The rise of open table formats like Apache Iceberg, Apache Hudi, and Delta Lake–along with open metadata catalogs like Snowflake Polaris and Databricks Unity Catalog–show that there’s an appetite for dismantling the traditional monolithic database and data structures into a collection of independent components, linked through a federated architecture.
At the moment, all of the universal semantic layers are proprietary, which is unlike what’s happening at the table format and metastore layers, where open standards reign, Handy pointed out. Eventually, the market will settle on a standard, but it’s still very early days, he said.
“Semantic layers used to be kind of a niche thing,” he said, “and now it’s becoming a hot topic.”
Related Items:
Cube Secures $25M to Advance Its Semantic Layer Platform
AtScale Announces Major Upgrade To Its Semantic Layer Platform
Semantic Layer Belongs in Middleware, and dbt Wants to Deliver It
October 3, 2024
- Dataiku Launches LLM Guard Services for Enhanced GenAI Cost, Quality, and Safety
- Microsoft Launches Azure VMs Optimized for AI Supercomputing, the ND H200 V5 Series
- GIGABYTE Announces General Availability of AmpereOne Servers for Cloud-Native Workloads
- Yugabyte Announces 6th Annual Distributed SQL Summit, Unveils Keynote and Lineup for Nov. Hybrid Event
- Cirrascale Powers AI and HPC Advancements with NVIDIA HGX H200 Server Integration
- Normalyze: New Research Highlights at Least 1/4 of Businesses Don’t Know Where Their Sensitive Data Is
- Algolia Launches Next-Gen Crawler: Simplifying Data Ingestion for Developers
October 2, 2024
- MongoDB Announces General Availability of MongoDB 8.0
- Jasper Research Releases 3 New Models to Improve Image Output for Marketers
- MicroStrategy Unveils Latest AI-Powered Features in MicroStrategy ONE
- Cerabyte to Present on AI-Driven Paradigm Shifts in Data Storage at Yotta 2024
- DataPelago Unveils World’s First Universal Data Processing Engine
- Moveworks Launches Agentic Automation: A First-of-its-Kind Solution for Building AI Agents
- VAST Data Partners with Industry Leaders for Cosmos AI Ecosystem Launch
October 1, 2024
- MOSTLY AI Launches Synthetic Text to Overcome AI Training Plateau and Unlock High-Value Proprietary Data
- Exabeam Launches New AI-Powered LogRhythm Intelligence for Enhanced Threat Detection
- NetApp Expands Collaboration with Google Cloud to Provide Data Storage for Distributed Cloud Infrastructure
- IBM Expands AI and HPC Offerings with NVIDIA H100 GPUs on IBM Cloud
- Anaconda Brings GenAI Models to Desktops with Launch of AI Navigator
- SQream Launches Snowflake Connector, Doubling Data Processing Speed