
Analyze This: How to Prepare Your Unstructured Data in Six Steps

(BestForBest/Shutterstock)
We live in a data-rich world where information is ours for the taking. But throwing just any data at your algorithm is a bad idea. With AI, small inconsistencies quickly become big ones. And those mistakes affect your decision-making, reputation, and bottom line. That’s why you need to prepare your data before you hand it over to your algorithms.
Here’s how to put quality data in—so you get quality data out.
Step: 1 Clean Your Data
Junk data is part of life, especially with qualitative (text-based) data. Before you hit “upload” to your analytics platform, find and strip out low- or no-value data. You’ll improve your quality—and avoid wasting valuable processing credits.
Remove fields containing nothing, n/a, and gibberish. You can generally also remove very short text responses. Exceptions are when a question specifically asks for a very short response, or when users write something in a “further comments” box.
Sometimes you might want to add data, such as the term “N/A” to an empty cell. This will let the system process those data points instead of just skipping over blank text fields. Check how your system handles special characters—some will skip over text fields consisting of a certain percentage of these. Adjust these cells if you need to.
Step 2: Combine Like Data
Scraped or exported data often arrives in multiple files. You’ll get better results by combining your data into fewer files – or even just one. When deciding how to combine your data, know:
- What you’re looking for
- If you’re comparing and contrasting, and if so what your main comparison point is
- If you’ll aggregate, then sort data
- If your data sources need to be separated, and if so whether you’ll build separate dashboards for each source
- How big the files are (NB, combining sources creates very large files that take longer to process)
For example, say you’re comparing reviews from “App A,” “App B,” and “App C” from both the Apple App Store and the Google Play Store. The review data will be 6 files: one for each app from each source.
You can combine this data in a few different ways. You could collate the data from the Apple App store in one file and the data from the Google Play store data in another. Or save the reviews from each app across both stores into three separate files. Or you could combine all of the data into one large file.
Why one over the other? It depends on your goals. If you want to contrast the Apple App reviews with the Google App reviews, then two files makes sense. If you’re comparing the Apps themselves, then three files might make more sense. If you only need to process the data once, a single huge file is fine.
Step 3: Add Metadata
Metadata provides information about your data. It helps you find, use, filter, sort and preserve your data. The more metadata, the better—just so long as it’s good quality.
Always add the essentials:
- Document source
- Date(s) created
- Date(s) scraped/pulled
- Author
You can also add:
- URLs
- Groupings
- Notes
- Names
- Locations
- Tags
- Other relevant info
You can upload as much or as little metadata as you want. But more metadata makes it easier to sort and filter your data.
Step 4: Sort Your Metadata
Consistency matters. Properly format your metadata so that you can find and filter it in your system. Unformatted metadata just makes life harder for you. To get started:
- Check date formats
- Standardize formatting
- Fix misspellings or variations (Apple vs apple vs Apple Inc)
Uploading multiple documents to analyze, filter and graph? The formatting must be consistent across all the files so you can sort and compare them. In the above App store example, you’d want the source field to always be “Apple App Store” or “Google Play Store” across every document.
Step 5: Save Your File
If you’re using Excel or Google Docs to prepare your data, save two copies of your cleaned and prepared data: one in a native file type and one in a CSV. CSVs tend to upload faster. They’re also easy to open and edit on the fly, no matter what OS or software you use.
Step 6: Tune Your Sample Sets
Planning to tune and reprocess your data multiple times? Create special tuning sets. These are small selections of your data used to “tune” your configurations. Because they’re smaller, Kyour system can quickly process them. You’ll get timely feedback without burning through processing credits. Once you’ve tuned with your smaller sets, move on to a larger set to confirm that your results align with your expectations. Or repeat the process with a new tuning set.
Getting your data in great shape before you feed it to an algorithm will net you better results and let you get more from your tech. Use the data prep best practices above, and you’ll spend less time fixing and more time analyzing.
About the author: Paul Barba is the chief scientist at Lexalytics, an InMoment company and a provider of analytic solutions for structured and unstructured data. Paul has 10 years of experience developing, architecting, researching and generally thinking about machine learning, text analytics and NLP software. He earned a degree in Computer Science and Mathematics from UMass Amherst.
Related Items:
Stemming vs Lemmatization in NLP
Five Ways Big Data Projects Can Go Wrong (And What You Can Do About Them)
June 13, 2025
- PuppyGraph Announces New Native Integration to Support Databricks’ Managed Iceberg Tables
- Striim Announces Neon Serverless Postgres Support
- AMD Advances Open AI Vision with New GPUs, Developer Cloud and Ecosystem Growth
- Databricks Launches Agent Bricks: A New Approach to Building AI Agents
- Basecamp Research Identifies Over 1M New Species to Power Generative Biology
- Informatica Expands Partnership with Databricks as Launch Partner for Managed Iceberg Tables and OLTP Database
- Thales Launches File Activity Monitoring to Strengthen Real-Time Visibility and Control Over Unstructured Data
- Sumo Logic’s New Report Reveals Security Leaders Are Prioritizing AI in New Solutions
June 12, 2025
- Databricks Expands Google Cloud Partnership to Offer Native Access to Gemini AI Models
- Zilliz Releases Milvus 2.6 with Tiered Storage and Int8 Compression to Cut Vector Search Costs
- Databricks and Microsoft Extend Strategic Partnership for Azure Databricks
- ThoughtSpot Unveils DataSpot to Accelerate Agentic Analytics for Every Databricks Customer
- Databricks Eliminates Table Format Lock-in and Adds Capabilities for Business Users with Unity Catalog Advancements
- OpsGuru Signs Strategic Collaboration Agreement with AWS and Expands Services to US
- Databricks Unveils Databricks One: A New Way to Bring AI to Every Corner of the Business
- MinIO Expands Partner Program to Meet AIStor Demand
- Databricks Donates Declarative Pipelines to Apache Spark Open Source Project
June 11, 2025
- What Are Reasoning Models and Why You Should Care
- The GDPR: An Artificial Intelligence Killer?
- Fine-Tuning LLM Performance: How Knowledge Graphs Can Help Avoid Missteps
- It’s Snowflake Vs. Databricks in Dueling Big Data Conferences
- Snowflake Widens Analytics and AI Reach at Summit 25
- Top-Down or Bottom-Up Data Model Design: Which is Best?
- Why Snowflake Bought Crunchy Data
- Change to Apache Iceberg Could Streamline Queries, Open Data
- Inside the Chargeback System That Made Harvard’s Storage Sustainable
- dbt Labs Cranks the Performance Dial with New Fusion Engine
- More Features…
- Mathematica Helps Crack Zodiac Killer’s Code
- It’s Official: Informatica Agrees to Be Bought by Salesforce for $8 Billion
- AI Agents To Drive Scientific Discovery Within a Year, Altman Predicts
- Solidigm Celebrates World’s Largest SSD with ‘122 Day’
- DuckLake Makes a Splash in the Lakehouse Stack – But Can It Break Through?
- The Top Five Data Labeling Firms According to Everest Group
- Who Is AI Inference Pipeline Builder Chalk?
- ‘The Relational Model Always Wins,’ RelationalAI CEO Says
- IBM to Buy DataStax for Database, GenAI Capabilities
- VAST Says It’s Built an Operating System for AI
- More News In Brief…
- Astronomer Unveils New Capabilities in Astro to Streamline Enterprise Data Orchestration
- Yandex Releases World’s Largest Event Dataset for Advancing Recommender Systems
- Astronomer Introduces Astro Observe to Provide Unified Full-Stack Data Orchestration and Observability
- BigID Reports Majority of Enterprises Lack AI Risk Visibility in 2025
- Databricks Announces Data Intelligence Platform for Communications
- MariaDB Expands Enterprise Platform with Galera Cluster Acquisition
- Snowflake Openflow Unlocks Full Data Interoperability, Accelerating Data Movement for AI Innovation
- Databricks Unveils Databricks One: A New Way to Bring AI to Every Corner of the Business
- Gartner Predicts 40% of Generative AI Solutions Will Be Multimodal By 2027
- Databricks Announces 2025 Data + AI Summit Keynote Lineup and Data Intelligence Programming
- More This Just In…