
What Can Enterprises Learn From Genome Sequencing?
The data handing requirements for genetic discovery are increasing as the tools become more sophisticated, causing data sets to become increasingly large. However, dealing with these large data sets is nothing new for practitioners in traditional science, and enterprises can learn from the strategies and processes that these disciplines have put into play.
Mario Caccamo, head of Bioinformatics at The Genome Analysis Center (TGAC) based in the UK, recently outlined the work and challenges that researchers face, explaining the explosive nature of the data and the processes that that genome scientists use to wrestle with it.
“We have super exponential growth in data generation,” explained Caccamo. “To give you a concrete example, with two sequencing instruments in 2010, we generated 1.2 terabases of data. We can now generate half of that (600 gigabases) in two weeks in only one sequencing round with our current instruments.”
This, of course, is a familiar refrain that enterprise executives are hearing as they consider their own data plans and consider what their own data requirements will be down the line. But having an infrastructure to handle and store this data isn’t the same thing as having a strategy to turn it into predictive intelligence.
“This is very much hitting the point where you need a strategy to cope with more data today than you have ever generated before,” says Caccamo, explaining the growing challenges TGAC faces as their data volumes increase in size and complexity. The organization looks at these challenges holistically, explained Caccamo in describing how they have built a cultural infrastructure to support their data-driven science goals.
Caccamo explains that a technology focus drives their cultural underpinnings. “In one end, developing the state of the art intensive platform, both sequencing and computational – this is what I call the ‘hard infrastructure.’ The other focus of our activities is on what we can develop between intensive algorithms and databases – this is what we call the ‘soft infrastructure,’”
Caccamo explains that TGAC’s challenge is on developing these components together through what he calls a systems approach in order to produce “predictive assumptions” that help them towards discovery and ultimately understanding biology.
Included in their base equation is the development of skills, says Caccamo, being very sensitive to the talent acquisition challenges that organizations face in the exploding data science fields. “We really take training into this,” explained Caccamo. “Technologies and training toward developing new strategies and tools is a very important part of what we do.”
Putting this organizational foundation into place, explained Caccamo, they are then able to give focus to what their true aim is: data driven science. Their process is something that other organizations can learn from, suggests Caccamo, explaining TGAC’s view of the pipeline that starts with data on the input end, and biology science as the output.
The process starts with huge volumes of sequencing data and using efficient algorithms to tackle the data, endeavor to distill it into what can then be classified as “information.” In the case of genomic research, the information are the ~150 letter base strings that the researchers use to assemble genome maps with. The research head says that once they have achieved the information step in the process, the focus, they shift priorities from efficiency to quality. At this step, says Caccamo, they’re focused on turning “information” into “knowledge” which can then be transformed into biology.
In the case of genome sequencing, the research head used the discovery process of the wheat genome as an example. The first step is in extracting and sequencing the DNA into strands of information called “bases,” which are ready for assembling into something more actionable.
These bases are then passed through sophisticated hardware (TGAC uses an array of SGI UV supercomputers for much of their assembly work) and assembled into enormous graphs. When mapped out in a graph, wheat has 10 billion nodes of sequencing information, says Caccamo. These graphs are now considered knowledge that can be used by biologists use for the purposes of obtaining better crops.
Caccamo explains that having the right strategy in place is essential for the long term success of TGAC because their data growth is exponential. He notes that as the data becomes cheaper and cheaper to process, the scope of the research tends to expand. In the example of wheat genetics, he explains that the research has expanded to examine environmental factors.
“A recent addition of the toolkit of the bioinformatician is that we can look now into what is present as well in the soil,” says Caccamo. “In this case, it’s not going to be about one species, but instead about a community of species – what we call a microbial community. That’s what we call metagenomics.”
While the concept of metagenomics is contained within the esoteric domain of the genomic community, the concept of the run-away project is not. Virtually every enterprise has experienced the resource drain that happens when a project expands beyond its original scope. In the case of big data, these run-away project can be very costly if there aren’t strategies in place to govern the direction they take.
As enterprises ramp up their big data initiatives, TGAC’s example suggests that organizations would be wise to consider the culture and processes already in place where traditional sciences have already blazed a trail in managing and processing enormous amounts of data.
Related Items:
Boosting Big National Lab Data
Intel CIO’s Big Data Prescription
August 29, 2025
- Argonne Researchers Give Presentations at Data Management Conference
- Big Data Expo 2025 in China Spotlights Data Resources, AI, and Industrialization
- NSF Facilities Partner to Transform Data Processing for Next-Gen Radio Astronomy
August 28, 2025
- Pecan AI Brings Explainable AI Forecasting Directly to Business Teams
- Virtualitics Introduces Iris with Mission-Tuned AI Agents for DoD and Beyond
- EDB Outlines Future of Lakehouse and Strategies for Intelligent Applications at Supermicro Open Storage Summit
- Cerebras and Core42 Launch Global Access to OpenAI’s gpt-oss-120B
August 27, 2025
- Acceldata Announces General Availability of Agentic Data Management
- Data Streaming Summit 2025 Returns to San Francisco with 30-Plus Sessions Across Four Tracks
- CDAO Fall 2025 Opens Registration for Boston Conference
- Ataccama Data Trust Assessment Reveals Data Quality Gaps Blocking AI and Compliance
- Apache Software Foundation Expands Tools, Governance, and Community in FY2025
- Coalesce Launches JOIN Community Discussions on Data Strategy and AI
- Alluxio Reports Q2 Growth as Enterprise AI 3.7 Advances AI Data Performance
- OpenText and Ponemon Institute Survey of CIOs Finds Lack of Information Readiness Threatens AI Success
- Domo Announces Enhanced Cloud Integration Capabilities with BigQuery
August 26, 2025
- MariaDB Accelerates Cloud Deployments, Adds Agentic AI and Serverless Capability with Acquisition of SkySQL
- OpenLight Raises $34M Series A to Scale Next-Gen Integrated Photonics for AI Data Centers
- Domo Unveils Enhanced Cloud Integration Upgrades for Snowflake
- NVIDIA: Industry Leaders Transform Enterprise Data Centers for the AI Era with RTX PRO Servers
- Rethinking Risk: The Role of Selective Retrieval in Data Lake Strategies
- Why Metadata Is the New Interface Between IT and AI
- Why OpenAI’s New Open Weight Models Are a Big Deal
- Beyond Words: Battle for Semantic Layer Supremacy Heats Up
- What Are Reasoning Models and Why You Should Care
- Software-Defined Storage: Your Hidden Superpower for AI, Data Modernization Success
- This Big Data Lesson Applies to AI
- LinkedIn Introduces Northguard, Its Replacement for Kafka
- What Is MosaicML, and Why Is Databricks Buying It For $1.3B?
- Cube Ready to Become the Standard for Universal Semantic Layer, If Needed
- More Features…
- Mathematica Helps Crack Zodiac Killer’s Code
- BigDATAwire Exclusive Interview: DataPelago CEO on Launching the Spark Accelerator
- GigaOm Rates the Object Stores
- Solidigm Celebrates World’s Largest SSD with ‘122 Day’
- McKinsey Dishes the Goods on Latest Tech Trends
- Google Pushes AI Agents Into Everyday Data Tasks
- The Top Five Data Labeling Firms According to Everest Group
- Oracle Launches Exadata Service for AI, Compliance, and Location-Critical Workloads
- Databricks Now Worth $100B. Will It Reach $1T?
- EU’s AI Act Enters New Enforcement Phase
- More News In Brief…
- Gartner Predicts 40% of Generative AI Solutions Will Be Multimodal By 2027
- Seagate Unveils IronWolf Pro 24TB Hard Drive for SMBs and Enterprises
- LF AI & Data Foundation Hosts Vortex Project to Power High Performance Data Access for AI and Analytics
- Deloitte Survey Finds AI Use and Tech Investments Top Priorities for Private Companies in 2024
- Dell Unveils Updates to Dell AI Data Platform
- NVIDIA AI Foundry Builds Custom Llama 3.1 Generative AI Models for the World’s Enterprises
- Redpanda Partners with Databricks to Deliver One‑Step Stream‑to‑Table Iceberg Integration for Real‑Time Lakehouses
- Computing Community Consortium Outlines Roadmap for Long-Term AI Research
- Transcend Expands ‘Do Not Train’ and Deep Deletion to Power Responsible AI at Scale for B2B AI Companies
- Acceldata Announces General Availability of Agentic Data Management
- More This Just In…