People to Watch 2018
Tyler Akidau
Senior Staff Software Engineer (Google)
Founding Member of Apache Beam PMC
Tyler Akidau is a senior staff software engineer at Google, where he is the technical lead for the Data Processing Languages & Systems group, responsible for Google’s Apache Beam efforts, Google Cloud Dataflow, and internal data processing tools like Google Flume, MapReduce, and MillWheel. His also a founding member of the Apache Beam PMC. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the main author of the Dataflow Model paper, the Streaming 101 and Streaming 102 articles, and the upcoming Streaming Systems book. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.
Datanami: Congratulations on being named a Datanami Person to Watch in 2018! Apache Beam has emerged as an important unifying framework for building applications that process data in batch and in real time. Why has this shift to a new continuous processing paradigm been so difficult?
Tyler Akidau: Thanks for the nomination and the chance to chat. I think there are a few aspects to the difficulty in shifting to a new approach.
Firstly, I like how you phrased the question as “continuous processing” paradigm, not stream processing. The focus for continuous processing is so often on low-latency streaming engines, but the fact remains that batch engines continue to provide an attractive alternative for a broad set of continuous processing use cases that don’t demand ultra-low latency. We’ve seen this at Google, as our batch engine (evolved from the original MapReduce, and available in our serverless data processing offering Google Cloud Dataflow) continues to advance the state of the art, with things like pervasive auto-tuning and dynamic scaling. From a cost/performance perspective, you just can’t beat that with a streaming engine, and as you deal with larger and larger data sets, that cost difference becomes all the more important. So folks continue to use our batch engine to solve a number of continuous processing problems where resource cost is the dominating constraint. In my opinion, we as an industry still have not embraced the full system-level integration of these two approaches (high-throughput batch and low-latency streaming) in a way that completely obviates the need for there to even be a choice. Until we do, broad adoption of a truly continuous processing paradigm will remain an unattainable goal.
Beyond that, continuous processing often incorporates the element of time into the equation. That added dimension alone increases the complexity considerably, and we as an industry haven’t always had the best tools available for dealing with time. And when you throw things like out-of-order data into the mix, it just complicates things even further. We’ve tried to improve things in this realm with our approach in Apache Beam, and I think we’ve made some good progress. But there’s always more room for improvement.
Talking about low-latency stream processing in particular, I think the online aspect of streaming is less forgiving than the more offline approach of classic batch processing. With batch processing, if your job fails or you realize there’s a bug in the code, there’s often time to just re-run things once you’ve fixed the issue. With stream processing, at least for the use cases where you want answers quickly, there’s less margin for error.
And lastly and somewhat relatedly, from a historic perspective, the idea of re-running your streaming pipelines to deal with errors or changes in data wasn’t always something that was cleanly supported as a first-class operation. We’re still a ways off from it being ubiquitous and effortless, but if you’ve ever seen a demo that mixes together the replayability of Apache Kafka with the savepoints feature in Apache Flink, allowing for seamless stateful backfills, you can see we as an industry are heading in the right direction. A few more significant advances like that and life will be a lot easier for folks trying to adopt continuous processing.
Datanami: Beam was accepted as a top-level project by the Apache Software Foundation in late 2016. How does working with the ASF community compare to working at Google?
The thing I’ve liked most about working with the ASF community is the passion for collaboration. I’ve had the opportunity to interact closely with not only the Apache Beam community, but folks across a number of other Apache projects as well: Calcite, Flink, Apex, Spark, etc. There’s a lot of excitement to work together, do cool things, and see them move forward. That never gets old.
But perhaps more importantly, the vision for Beam is to provide a comprehensive portability framework for data processing pipelines, one that allows you to write you pipeline once in your language of choice (currently Java, Python, or SQL, someday Go and others) and run it with minimal effort on the execution engine of choice (both open source platforms like Spark, Flink, or Apex, as well as cloud-native systems like Google Cloud Dataflow and IBM Streams). That’s a remarkably massive vision, and it’s not something we as Googlers could reasonably build by ourselves. We absolutely need the support of the broader open source ecosystem, and we’re building a better system as a result.
It has been a shift for us, as Googlers, learning how to be effective and transparent participants in a fully open source development environment. We’ve done well in many ways, but we know we can do better. Our goal going forward is to have a consistent and open approach to any and all Beam-related efforts. There will always necessarily be a business motivation behind the projects Google funds in Beam (I was very open about this when we first co-founded the project), and that’s fine. The key point is that everything we do in Beam, we need to do in partnership with the community. No initial brainstorming internally, then delivering a mostly-baked design proposal for comments from the community. Everything in Beam is designed and built with the community, in the open. And we intend to track and react to project health metrics like code review latency just as we do internally at Google, to help ensure Beam is a place that’s accessible and enjoyable for anyone to contribute to. That’s our pledge going forward. Now we just have to make sure we live up to it.
Datanami: Some people say Apache Beam is amazing technology, but question its usefulness and say it’s ahead of its time. How do you respond to that?
There’s a non-zero chance I would laugh awkwardly and compliment them on their ability to combine flattery and skepticism in such vague and non-specific terms. If I were allowed a followup, I’d ask for more details.
But more seriously, because I’ve done a lot of advocacy around stream processing and the approach Beam takes, I think people sometimes view that as the core value that Beam brings to the table. If that’s the view someone has, and they don’t have a lot of experience with a broad set of streaming use cases, I can see why they might wonder if all of the sophisticated streaming features in Beam are really necessary; there are indeed a number of use cases well served by a subset of the functionality in Beam. But everything we add in Beam is motivated by a real-world use case.
That said, while I do think Beam continues to be at the forefront of streaming semantics, that’s not really why Beam itself exists as a project; that’s just a part of it that’s near and dear enough to my heart that I enjoy talking about it a lot. As I outlined above, Beam is first and foremost a portability framework. The goal of Beam is to be the programmatic equivalent of SQL, a common lingua franca that you can use to build pipelines that then run with minimal effort on multiple platforms. Even though Beam as a project has existed for over two years, we’re still not done with the full portability vision, and we still have work to do on our open source runners. Frankly, it’s a really non-trivial thing we’re trying to accomplish, and it’s taken a lot of work to get where we are. We expect the full portability layer to finally land this year, and we’re investing more heavily in the OSS runners going forward. So not only will Beam provide portability across runners (meaning you can write once, run anywhere on both open source and cloud native platforms), but also portability across languages. And when you can start running Beam Python pipelines on-premise with Flink and cloud-native with Cloud Dataflow, or Beam Go pipelines on Spark, with a solid and robust experience comparable to that of using a native API, I think the benefits of portability will start to become even more apparent for a much wider audience.
Datanami: What do you hope to see from the big data community in the coming year?
From a stream processing perspective, I want to see us continue to move towards making things more robust and easy to use. Building and maintaining streaming pipelines remains a struggle in many ways. I want to see more magical things like savepoints in Flink: simple but powerful abstractions that make life easier. There’s lots of room for improvement still. I tend to give talks in open source arenas, so I don’t often advocate for Google’s Cloud Dataflow service directly. But if you’re looking for some of that magic, with a true no-ops experience, you should come try it out. Our team has built an incredible service, for both batch and streaming. It feels a bit like operating in the future. And we have more great stuff on the way this year.
From an open source perspective, I’d like to see the industry move towards a more standard way of specifying pipelines. There’s often very little value in the differences between one API or another for building pipelines; most of the magical goodness is in the engines themselves. So having to rewrite things just to try out a new or different technology is a waste of everyone’s time. If we do things right, and incorporate the API differences that are beneficial along the way, Beam will be in a good position to become that standard.
And lastly, I’m especially excited to see how the things we have coming in Beam this year can enable the big data community to start doing things that were previously just impossible. I previously alluded to the completion of the full portability framework coming this year. That’s going to set the stage for us to provide things like our nascent Go SDK (which will work on any runner built on the full portability framework, including JVM-based engines like Spark or Flink), or SDKs in languages that are often poorly represented in data processing, like Javascript (for better or worse ;-). We’ll also be able to start providing mixed-language pipelines, so that you can take that really awesome and performant data connector written in Go and use it seamlessly from within your Python pipeline, all running on a top-tier stream processing engine written in Java. There are a lot of cool practicalities that this is going to open up, and I’m excited to see big data community take it and run with it.
Datanami: Outside of the professional sphere, what can you share about yourself that your colleagues might be surprised to learn – any unique hobbies or stories?
My parents gave me a Super Nintendo Classic for Christmas, and it’s been fun to see my daughters (ages four and six) get so excited to learn how to play those games. I think my four year old may have already surpassed my skills at Kirby Super Star. And we recently started playing Mario Kart as a family on the Nintendo Switch, which is a lot of fun and kind of adorable now that I write it down. I actually missed my flight to QCon London earlier this week because I stayed home playing Mario Kart with them too long. I ended up having to spend the next 24 hours flying on a different airline across multiple legs to get there on time. Lesson learned.