Follow BigDATAwire:

November 7, 2024

Meet Haoyuan ‘HY’ Li, a 2024 BDW Person to Watch

The big data revolution exposed the inadequacy of older technologies and paved the way for newer technologies. One of those technologies is Alluxio, which was developed by Haoyuan “HY” Li, one of the BigDATAwire People to Watch for 2024.

Li created Alluxio (formerly Tachyon) to serve as a virtual distributed file system to be used with frameworks, such as Apache Hadoop and Apache Spark. Li also founded a company called Alluxio, where he is also the chairman and the CEO.

BigDATAwire recently caught up with Li to talk about his work. Here is what he said:

BigDATAwire: You created Alluxio while working in the AMPLab at UC Berkeley.  What was the source of the inspiration for the project?

HY Li: When I was doing research at Google during my undergraduate time, I saw the power of data as the foundation of many aspects of our world in the future. With that belief, I was very fortunate to have the opportunity to pursue my Ph.D. at Berkeley AMPLab under the tutelage of Professor Ion Stoica and Professor Scott Shenkar. While at AMPLab, I was inspired by people around me, such as my colleagues Matei Zaharia and Ali Ghodsi.

At the time, there was an explosion in innovation at the compute layer and storage layer, which created a unique problem associated with data orchestration (including data access, management, etc). While the introduction of new technologies enabled many new applications, every new storage system became yet another data silo. The rise of cloud storage only exacerbated these challenges. I believe that data teams should be able to serve data to applications with high performance and reasonably low costs, without the need for extensive retooling.

As a result, I co-created Alluxio, a data platform that bridges the gap between compute and storage and provides high performance data access for all data driven workloads, including analytics and AI, in any environment. Alluxio holds a unique position in the data stack, neither as a compute engine nor just another storage system, but instead sitting right at the intersection of compute and storage, as a data platform. By being close to storage, we have a universal view of the workloads on the data platform across stages of a data pipeline. This is the knowledge we tap into. Being close to compute is what makes the Alluxio Data Platform smart, by tapping into a view of what the applications on the compute engines are trying to achieve. Leveraging this unique position is what differentiates Alluxio.

BDW: What is missing from the big data stack today?

Li: Companies are racing to leverage AI and machine learning in their businesses, and what they are realizing is that machine learning applications create a new set of challenges for their data platforms. Traditional data infrastructures often struggle to cope with these demands, leading to cost inefficiencies, slower innovation, and complex data engineering.

With the rise of machine learning workloads such as computer vision and LLMs, the need for a high performance data layer that serves all critical data driven applications is even greater. Alluxio provides an efficient offline model training cache capable of serving datasets of any size directly to training nodes without impacting the training performance. This enables data teams to achieve magnitudes higher training performance without the need for costly specialized storage, thereby greatly reducing development cycles and accelerating innovation.

Some examples include, model training for autonomous driving applications where Alluxio serves data efficiently to models, increasing GPU utilization and decreasing cloud costs. This ensures that model training is faster and more accurate, ultimately contributing to the development of safer autonomous vehicles.

BDW: Alluxio is also being utilized by online content communities to power their Q&A applications based on large language models. Alluxio accelerates model updates from experimentation to production, facilitating a better user experience and deeper user engagement.

Li: You had a role in developing Spark Streaming. What’s the relationship between distributed file systems and streaming data platforms?

We see streaming data applications as a type of data driven applications that the data platform such as Alluxio serves.

BDW: Outside of the professional sphere, what can you share about yourself that your colleagues might be surprised to learn – any unique hobbies or stories?

Li: Outside of work, I enjoy exploring the great outdoors through hiking and scuba diving. I love what I do, but it can be difficult to find the space to step back and appreciate the world. I’ve found scuba diving to be the perfect activity as it requires focus to ensure safety, which allows me to be fully present and appreciate the wonders of the sea world. I also enjoy long scenic hikes in nature, which provide me the opportunity for deeper self-reflection.

I also have a keen interest in world history and cultural exchange. I enjoy learning about different cultures and traditions from around the world. This curiosity has led me to travel extensively and engage with people from diverse backgrounds, enriching my understanding of the world and fostering meaningful connections.

You can meet the rest of the 2024 BigDATAwire People to Watch here.

 

BigDATAwire