

(Anton Balazh/Shutterstock)
NASA collects all kinds of data. Some of it comes from satellites orbiting the planet. Some of it travels from instruments floating through deep space. Over the years, these efforts have built up a massive collection: images, measurements, signals, scans. It is a goldmine of information, but getting to it, and making sense of it, is not always simple.
For many scientists, the trouble starts with the basics. A file might not say when it was recorded, what tool gathered it, or what the numbers mean. Without that information, even experienced researchers can get stuck.
With AI systems, the challenges are even more complex. Machines can learn from patterns, but they still need some structure. If the data is vague or missing key labels, the model cannot do much with it or it may have to connect dots that are just too far apart. This means that some of the most valuable data ends up overlooked or the output is not reliable.
NASA has developed new tools to address the problem. These include automated metadata pipelines that process and standardize information about the agency’s vast datasets.
These automated pipelines clean up and clarify the metadata, which is the information about the data itself. Once that layer is solid, datasets become easier to find, easier to sort, and more useful to both humans and machines. The goal is to make this improved metadata available on familiar platforms like Data.gov, GeoPlatform, and NASA’s own data portals. The hope is that this shift will support faster research and better results across a wide range of projects.
Part of this effort is about opening access beyond NASA’s usual networks. Not everyone looking for data is familiar with internal tools or technical systems. That challenge is part of the reason these pipelines exist. “In NASA Earth science, we do have our own online catalog, called the Common Metadata Repository (CMR), that is particularly geared towards our NASA user community,” said Newman.
“CMR works great in this case, but people outside of our immediate community might not have the familiarity and specific knowledge required to get the data they need. More general portals, such as Data.gov, are a natural place for them to go for government data, so it’s important that we have a presence there.”
NASA’s new metadata pipelines are an attempt to make those stories easier to find and easier to understand. The first phase of the effort is centered on more than 10,000 public data collections, covering over 1.8 billion individual science records. These are being reformatted and aligned with open standards so they can be shared through platforms like Data.gov and GeoPlatform, where researchers outside NASA are more likely to search. This shift also helps AI systems. When the structure is clear and consistent, models are better able to interpret the data and apply it without making unnecessary assumptions.
Improving structure is only part of the process. NASA is also looking closely at the quality of the metadata itself. That work is handled through the ARC project, short for Analysis and Review of CMR. The goal is to make sure records are not just formatted properly, but also accurate, complete, and consistent. By reviewing and strengthening these records, ARC helps ensure that what shows up in search results is not only visible, but also reliable enough to be used with confidence.
Translating NASA’s internal metadata into formats that work across public platforms takes detailed and technical work. That effort is being led by Kaylin Bugbee, a data manager with NASA’s Office of the Chief Science Data Officer. She helps run the Science Discovery Engine, a system that supports open access to NASA’s research tools, data, and software.
Bugbee and her team are building a process that gathers metadata from across the agency and maps it to the formats used by platforms like Data.gov. It is a careful, step-by-step workflow that needs to match NASA’s unique terms with more universal standards. “We’re in the process of testing out each step of the way and continuing to improve the metadata mapping so that it works well with the portals,” Bugbee said.
NASA is also working on geospatial data. Some of these datasets are used by other agencies for things like mapping, transportation, and emergency planning. They are known as National Geospatial Data Assets, or NGDAs.
Bugbee’s team is building a system that helps connect these files to Geoplatform.gov, with links that send users straight to NASA’s Earthdata Search. The process builds on metadata NASA already has, which saves time and reduces the need to start from scratch. They began with MODIS and ASTER products from the Terra platform and will expand from there. The goal is to make these datasets easier to access, while keeping the structure clear and consistent across platforms that serve both public and scientific users.
Related Items
IBM’s New Geospatial AI Model on Hugging Face Harnesses NASA Data for Climate Science
Agentic AI and the Scientific Data Revolution in Life Sciences
NIH Highlights AI and Advanced Computing in New Data Science Strategic Plan
September 19, 2025
- Cisco and Qumulo Partner to Bridge the Data Gap from Edge to Core to Cloud
- Adastra Achieves Elite Partner with Databricks to Accelerate AI-Driven Innovation
- Digital.ai Strengthens Key and Data Protection with White-box Cryptography Agent
- Datalinx AI Builds Leadership Team to Transform Messy Data Into Actionable Intelligence
- Huawei Cloud Showcases Global Infrastructure, Data, and AI Engines at HUAWEI CONNECT 2025
- ScaleOut Software Delivers Next-Gen Caching with Version 6
- Argonne: Turning Materials Data into AI-Powered Lab Assistants
- EY Announces Alliance with Boomi to Offer Broad Integrated Solutions and AI-Powered Transformation
- Rice University Accelerates AI Innovation to Transform Teaching, Learning and Research
- Tigera Launches Solution to Protect AI Workloads Running on Kubernetes
September 18, 2025
- EDB Research Shows 87% of Enterprises Lag in Sovereign AI Adoption
- Zencoder Brings AI Coding to a Billion Users with Universal AI Development Platform
- PingCAP Brings Global Data Leaders Together for TiDB SCaiLE 2025 in California
- Qlik Connect 2026 Set for April 13–15 in Florida
- Domo Selected by Showpass to Deliver Scalable, Real-Time Embedded Analytics Worldwide
- Databricks Launches AI Accelerator Program to Scale the Next Generation of AI Startups
- Kennesaw State Researchers Tackling AI-Generated Fraud to Protect Data Integrity
- Salesforce Deepens Commitment to UK AI Innovation, Increases Investment to $6B
September 17, 2025
- Inside Sibyl, Google’s Massively Parallel Machine Learning Platform
- What Are Reasoning Models and Why You Should Care
- Rethinking Risk: The Role of Selective Retrieval in Data Lake Strategies
- In Order to Scale AI with Confidence, Enterprise CTOs Must Unlock the Value of Unstructured Data
- The AI Beatings Will Continue Until Data Improves
- Beyond Words: Battle for Semantic Layer Supremacy Heats Up
- Software-Defined Storage: Your Hidden Superpower for AI, Data Modernization Success
- How to Make Data Work for What’s Next
- What Is MosaicML, and Why Is Databricks Buying It For $1.3B?
- Top-Down or Bottom-Up Data Model Design: Which is Best?
- More Features…
- Mathematica Helps Crack Zodiac Killer’s Code
- Solidigm Celebrates World’s Largest SSD with ‘122 Day’
- GigaOm Rates the Object Stores
- Promethium Wants to Make Self Service Data Work at AI Scale
- AI Hype Cycle: Gartner Charts the Rise of Agents, ModelOps, Synthetic Data, and AI Engineering
- MIT Report Flags 95% GenAI Failure Rate, But Critics Say It Oversimplifies
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- The Top Five Data Labeling Firms According to Everest Group
- Career Notes for August 2025
- Sphinx Emerges with Copilot for Data Science
- More News In Brief…
- Seagate Unveils IronWolf Pro 24TB Hard Drive for SMBs and Enterprises
- Gartner Predicts 40% of Generative AI Solutions Will Be Multimodal By 2027
- DataSnap Expands with AI-Enabled Embedded Analytics to Accelerate Growth for Modern Businesses
- Acceldata Announces General Availability of Agentic Data Management
- Qlik Announces Canada Cloud Region to Empower Data Sovereignty and AI Innovation
- Transcend Expands ‘Do Not Train’ and Deep Deletion to Power Responsible AI at Scale for B2B AI Companies
- NVIDIA AI Foundry Builds Custom Llama 3.1 Generative AI Models for the World’s Enterprises
- Pecan AI Brings Explainable AI Forecasting Directly to Business Teams
- Databricks Surpasses $4B Revenue Run-Rate, Exceeding $1B AI Revenue Run-Rate
- SETI Institute Awards Davie Postdoctoral Fellowship for AI/ML-Driven Exoplanet Discovery
- More This Just In…