People to Watch 2016
Todd Lipcon
Software Engineer
Cloudera
Todd Lipcon is currently a Software Engineer at Cloudera. He joined the company in 2009 and lead the development of Kudu, a new Hadoop storage engine that Cloudera released in 2015 to help speed analysis of fast-moving data. Prior to the Kudu project, Todd designed and developed several major features across the Hadoop ecosystem, including highly-available metadata journaling (QJM) and automatic failover for HDFS. Before Cloudera, Todd was a Software Engineer at Amie Street. He received his Computer Science degree from Brown University.
Datanami: Hi Todd. Congratulations on being selected as a Datanami 2016 Person to Watch. The Hadoop market seems to be having a moment around real-time streaming with the rise of Cloudera’s Kudu and similar projects at its competitors. Why is real-time analytics emerging now?
Todd Lipcon: I think there are two factors that are causing the push towards more real-time analytics. The first is the so-called “internet of things”. I’m not a big fan of the term (computers aren’t “things”) but the trend is real: we have more and more devices (both computers and otherwise) generating data at an increasing rate. We want to use that data to make more rapid decisions that feed back into products in our everyday lives. For example, if sensors in an airplane’s engine start to indicate a potential fault, we’d obviously prefer to ground that plane before it’s midway over the Pacific.
I think the second factor is that big data analysis has become mainstream enough that enterprises are looking for the next big leg up they can get over their competition. Now that everyone is analyzing a lot of data, the next step is to do so more rapidly and turn data into business impact quicker than ever before.
Datanami: Can you talk to us about the development of Kudu? What challenges did you face?
We started the Kudu project at Cloudera towards the end of 2012. So, even though it’s quite new to the public, it’s actually been in development for a reasonably long time compared to some other new open source projects that have been recently announced.
I started the project to address a couple of trends that I saw in our customer base and the larger big data community in 2012. The first was that people were starting to get interested in these more real-time workloads, running analysis workloads that reflect up-to-the-second data. The second was that hardware was changing, with technologies like solid state storage getting cheaper and RAM getting steadily larger. The first generation of Hadoop storage systems (HDFS and HBase) are terrific, mature pieces of software, but neither one really addresses these two trends head-on. Luckily, I was able to pitch the idea to build a new storage engine to the Cloudera leadership team and had the opportunity to kick off the Kudu project.
One of the biggest challenges with a project like this is the long time scale it takes to get to what would be considered a “minimum viable product”. It probably took a year and a half of work before we really had anything resembling a distributed system, and another year on top of that until we had true multi-node support and fault tolerance. When working on such a long time scale, it’s important to keep a work-life balance and prevent burn-out. We did our best to set intermediate milestones and simple demos to track our progress and feel like we were getting closer to the eventual goal, despite having no customers or real workloads for 3 years. We’ve also been really fortunate to get a great team on the project — both transfers from other teams at Cloudera as well as some excellent new hires. Working with people you like is the best way to stay happy on a long term project.
Datanami: Generally speaking, on the subject of big data, what do you see as the most important trends for 2016 that will have an impact now and into the future?
One of the most exciting new technologies that people should be looking at is 3D Xpoint, a new persistent memory technology developed by Intel and announced late last year. This is a new type of device that will fit into a DIMM slot and offer performance that’s similar to DRAM for many applications, but at a lower price point, and with persistence across power loss. Just like we saw with NAND flash SSDs over the last decade, I think this technology will really transform the way we build and operate latency-sensitive workloads.
If we only look at big data workloads from the last couple years, we might not immediately think this super-low-latency storage is that relevant. However, there’s also the new trend of trying to do more and more analysis in real time. The new hardware technology should be a really great enabler for this new type of workload. The Kudu team has been working with engineers from Intel so that when this hardware becomes generally available to the public, we’ll already be ready to take advantage of it. From an engineer’s perspective, that’s pretty cool!
Datanami: Outside of the professional sphere, what can you tell us about yourself – personal life, family, background, hobbies, etc.?
I grew up on the east coast, but have lived in San Francisco since I started working on the Hadoop ecosystem at Cloudera in 2009. Outside of work, I enjoy cooking, playing piano and cello, and taking occasional camping trips around the bay area.
Datanami: Final question: Who is your idol in the big data industry and why?
I don’t have one single “idol”, but I’ve been fortunate to work with and learn a lot from some really great engineers. Michael Stack from the Apache HBase project has always been a model of how to run an inclusive and welcoming community. Doug Cutting, the founder of the Hadoop project, has shown me how devoting yourself to an area of work over the long term can lead to a huge and vibrant open-source ecosystem. And, of course, I think all distributed systems developers are a little bit of a Jeff Dean fanboy for his contributions published by Google.