

(Connect-world/Shutterstock)
What does “open” mean in the context of AI? Must we accept hidden layers? Do copyrights and patents still hold sway? And do consumers have the right to opt out of data collection? These are the types of questions that the folks at the Open Source Initiative are trying to get to the bottom of, as part of a deep dive to define “open source AI.”
The rules around what could be considered open source in tech used to be fairly well-defined, according to Stefano Maffulli, the executive director of the Open Source Initiative. Back in the 1970s, it was generally accepted that only things generated by a human could be legally protected with a copyright or a patent. Stuff generated by a machine, such as binary code, generally could not be protected.
That began to change with the PC revolution in the 1980s and Microsoft’s massive success selling software. Following several policy changes and landmark lawsuits, people began seeking and gaining protection for things such as source code and machine-generated binary code, Maffulli says.
With the advent of massive generative AI models that are trained on public data scraped from the Internet, we find ourselves at the edge of what current copyright law can cover. In fact, according to Maffulli, we’ve likely already passed that point, and now find ourselves in dire need of new ideas and new frameworks to define what can and should be protected, and what can and should be open and accessible to all.
“When [GitHub] CoPilot was announced [in October 2021], it suddenly dawned that there were new copyright issues appearing on the horizon,” Maffulli tells Datanami in a recent interview. “Then I started diving a little bit deeper into how AI [works], how machine learning, deep learning, neural networks work, and it dawned on me again that there were new artifacts, new things. And we were really at the dawn of a new era where we need new laws, we need new frameworks to understand what’s happening. And we need to do that very quickly.”
OSI ‘Deep Dive’

You can access the OSI deep dive report on open AI here
With its “Defining Open Source Deep Dive” program, the OSI organization is taking a disciplined and multi-pronged approach to understanding all aspects of the openness in AI question.
It set the process in motion earlier this year with a 20-page report on AI openness in February. In early June, it posted a public call for papers and research on the topic, followed by a set of kickoff meetings in San Francisco later that month. There were two community review workshops in July, in Oregon and Switzerland, followed by a third workshop last week in Spain.
If all goes according to schedule, OSI hopes to submit the first release candidate of a new definition of open source for AI paper next month. The process will continue into 2024, according to the group’s website.
The group is trying to remain open to all perspectives in coming up with its definition and policy recommendations. “It largely depends on what people want to do,” Maffulli says. “At the Open Source Initiative, we’re just driving this conversation. We’re not really forcing our opinions on anyone.”
A New Age of Data
The radical openness that defined the first 40 years of the Internet served the community well and sowed the seeds of technological progress to come. The egalitarianism of the Internet’s first phase of development fostered a community that thrived with openness and a ethos of sharing.
That started to change with the dawn of the big data age and the advent of social media and smart phones. Tech firms realized they could scrape the Internet for data freely shared by users, as well as some data not freely shared but still available (such as books), to amass huge data sets. Those data sets are now being used to train massive generative AI models that have the potential to not only reshape consumers’ relationship with technology for years to come, but also separate winners from losers on the corporate and creative battlefields.
One of the big questions that OSI is struggling with is: Does current copyright law still work in the age of AI? The answer hasn’t been determined yet, but it doesn’t look like it will.
“I think we’re at the point where we should make a decision whether we want those to be covered by copyright or whether we need to create new rights and new obligations for society,” Maffulli says. “What’s the best approach?”
There are different perspectives to these questions, and each deserves to be considered. The debate touches on several aspects of intellectual property rights, including copyrights, patents, trademarks, and trade secrets. But it’s also tied up into privacy rights, security obligations, and labor law, which adds to the complexity.
Maffulli says he understand the plight of creative workers whose past work can be harnessed to train a GenAI model that can re-create that workers’ output, potentially putting him out of work. Is there any legal recourse for him? Should he be granted legal protections? It’s tempting, he says.
“The reaction to that is to say, wait a second, you have been feeding my images, my text, into this machine and now this machine is capable of replacing me? No!” he says. “I have copyright rights on the work that I have produced. I never authorized anyone to use the archive of my work as a data mining source. Therefore, I want you to ask me for permission. I think that that’s a very fair approach a very fair reaction.”
However, if communities and government opt to stiffen data protections, it will naturally make it harder to obtain data to train AI models. That will not only slow down the overall rate of AI innovation, but it will likely also have the side effect of entrenching the already dominant positions that OpenAI, Google, and Meta enjoy in the space, he says.
“I think the biggest threat is there will not be the possibility to have a diverse amount of players in the field,” he says. “This is a field that naturally, at every step, favors the ones with the big resources, large amounts of resources. Because the main three components are data, knowledge, and hardware.”
The tech giants already have the data, which they have been systematically scraping from the Internet for years. They have the financial resources to afford the giant GPU clusters needed to train AI models. And they naturally attract the top minds in the field as a byproduct of having massive GPU clusters and lots of data to play with.
Maffulli sounds pragmatic about the potential to enact meaningful change by strengthening copyright protections. The tech giants already have the means to bury lawsuits brought by individuals, he says. And besides, they already have all the data. In many cases, they acquired it fair and square, thanks to consumers’ tendency to click “yes” on every privacy policy dialog box they’re presented.
‘Cat’s Out of the Bag’
For years Maffulli shared his image and title liberally across the Web. Then at one point, he tried to rein in back in by deleting his image on every major site. It’s his likeness and his right, he figured. He would force the tech giants to forget they ever saw him, he thought. At some point, he realized it was likely impossible.
That experience has informed his view on what is possible to be done with data and the open future of AI. “I think it’s better off if we just let it go,” Maffulli says. “The cat is out of the bag.”
In other words, instead of trying to put the cats back in the bag, we are better off just managing the loose cats as best we can. That means stronger operational controls on data that’s already out in the open, and better guardrails to guide those cats to happy homes.
“I do think that it cannot be solved by copyright law,” Maffulli says. “It needs to be solved by having strong policy, privacy protection laws, strong control from the individual to say ‘I don’t want to be recognized. Therefore, even if you have my face in the database, it gets deactivated. You cannot use it.’”
There are plusses and minuses to open source and to copyright protections, and they must be weighed carefully. OSI’s policy is not to judge how practitioners use open source software, noting that it’s impossible to draw a line between moral and immoral uses. As the debate plays out over what open means in AI, that line is murkier than ever.
Related Items:
Why Truly Open Communities are Vital to Open Source Technology
Do Customers Want Open Data Platforms?
Open Data Hub: A Meta Project for AI/ML Work
June 19, 2025
- Sifflet Lands $18M to Scale Enterprise Data Observability Offering
- Pure Storage Introduces Enterprise Data Cloud for Storing Data at Scale
- Incorta Connect Delivers Frictionless ERP Data to Databricks Without ETL Complexity
- KIOXIA Targets AI Workloads with New CD9P Series NVMe SSDs
- Hammerspace Now Available on Oracle Cloud Marketplace
- Domino Launches Spring 2025 Release to Streamline AI Delivery and Governance
June 18, 2025
- WEKA Introduces Adaptive Mesh Storage System for Agentic AI Workloads
- Zilliz Launches Milvus Ambassador Program to Empower AI Infrastructure Advocates Worldwide
- CoreWeave and Weights & Biases Launch Integrated Tools for Scalable AI Development
- BigID Launches 1st Managed DPSM Offering for Global MSSPs and MSPs
- Starburst Named Leader and Fast Mover in GigaOm Radar for Data Lakes and Lakehouses
- StorONE Unveils ONEai for GPU-Optimized, AI-Integrated Data Storage
- Cohesity Adds Deeper MongoDB Integration for Enterprise-Grade Data Protection
- Fivetran Report Finds Enterprises Racing Toward AI Without the Data to Support It
- Datavault AI to Deploy AI-Driven Supercomputing for Biofuel Innovation
June 17, 2025
- CTERA Becomes 1st Hybrid Cloud Platform to Embed MCP Server
- Ataccama Report Finds 42% of Enterprises Still Don’t Trust AI Model Outputs
- SingleStore Unveils AI-Focused Enhancements for Real-Time Data and Serverless Workloads
- EDB Postgres AI Accelerates New Era of Sovereign Data and AI
- Coralogix Secures $115M Series E to Expand AI-Powered Observability Platform
- What Are Reasoning Models and Why You Should Care
- Inside the Chargeback System That Made Harvard’s Storage Sustainable
- The GDPR: An Artificial Intelligence Killer?
- It’s Snowflake Vs. Databricks in Dueling Big Data Conferences
- Snowflake Widens Analytics and AI Reach at Summit 25
- Databricks Takes Top Spot in Gartner DSML Platform Report
- Fine-Tuning LLM Performance: How Knowledge Graphs Can Help Avoid Missteps
- Top-Down or Bottom-Up Data Model Design: Which is Best?
- Change to Apache Iceberg Could Streamline Queries, Open Data
- Why Snowflake Bought Crunchy Data
- More Features…
- Mathematica Helps Crack Zodiac Killer’s Code
- It’s Official: Informatica Agrees to Be Bought by Salesforce for $8 Billion
- Solidigm Celebrates World’s Largest SSD with ‘122 Day’
- AI Agents To Drive Scientific Discovery Within a Year, Altman Predicts
- DuckLake Makes a Splash in the Lakehouse Stack – But Can It Break Through?
- The Top Five Data Labeling Firms According to Everest Group
- Who Is AI Inference Pipeline Builder Chalk?
- ‘The Relational Model Always Wins,’ RelationalAI CEO Says
- IBM to Buy DataStax for Database, GenAI Capabilities
- Databricks Is Making a Long-Term Play to Fix AI’s Biggest Constraint
- More News In Brief…
- Astronomer Unveils New Capabilities in Astro to Streamline Enterprise Data Orchestration
- Yandex Releases World’s Largest Event Dataset for Advancing Recommender Systems
- Astronomer Introduces Astro Observe to Provide Unified Full-Stack Data Orchestration and Observability
- BigID Reports Majority of Enterprises Lack AI Risk Visibility in 2025
- Databricks Unveils Databricks One: A New Way to Bring AI to Every Corner of the Business
- MariaDB Expands Enterprise Platform with Galera Cluster Acquisition
- FICO Announces New Strategic Collaboration Agreement with AWS
- Snowflake Openflow Unlocks Full Data Interoperability, Accelerating Data Movement for AI Innovation
- Databricks Announces 2025 Data + AI Summit Keynote Lineup and Data Intelligence Programming
- Databricks Announces Data Intelligence Platform for Communications
- More This Just In…