Python Now a First-Class Language on Spark, Databricks Says
The Apache Spark community has improved support for Python to such a great degree over the past few years that Python is now a “first-class” language, and no longer a “clunky” add-on as it once was, Databricks co-founder and Chief Architect Reynold Xin said at Data + AI Summit last week. “It’s actually a completely different language.”
Python is the world’s most popular programming language, but that doesn’t mean that it always plays well with others. In fact, many Python users have been dismayed over the poor integration with Apache Spark over the years, including its tendency to be “buggy.”
“Writing Spark jobs in Scala is the native way of writing it,” Airbnb engineer Zach Wilson said in a widely circulated video from 2021, which Xin shared on stage during his keynote last Thursday. “So that’s the way that Spark is most likely to understand your job, and it’s not going to be as buggy.”
Scala is a JVM language, so performing stack traces through Spark’s JVM is arguably more natural than doing it through Python. Other negatives faced by Python developers are weird error messages and non-Pythonic APIs, Xin said.
The folks at Databricks who lead the development of Apache Spark, including Xin (currently the number three committer to Spark), took those comments to heart and pledged to do something about Python’s poor integration and performance with Spark. The work commenced in 2020 around Project Zen with the goal of providing a more, ah, soothing and copasetic experience for Python coders writing Spark jobs.
Project Zen has already resulted in better integration between Python and Spark. Over the years, various Zen-based features have been released, including a redesigned pandas UDF, better error reporting in Spark 3.0, and making PySpark “more Pythonic and user-friendly” in Spark 3.1.
The work continued through Spark 3.4 and into Spark 4.0, which was released to public preview on June 3. According to Xin, all the investments in Zen are paying off.
“We got to work three years ago at this conference,” Xin said during his keynote last week in San Francisco. “We talked about the Project Zen initiative by the Apache Spark community and it really focuses on the holistic approach to make Python a first-class citizen. And this includes API changes, including better error messages, debuggability, performance improvement–you name it. It incorporates almost every single aspect of the development experience.”
The PySpark community has developed so many capabilities that Python is no longer the buggy language that it once was. In fact, Xin says so much improvement has been made that, at some levels, Python has overtaken Scala in terms of capabilities.
“This slide [see below] summarizes a lot of the key important features for PySpark in Spark 3 and Spark 4,” Xin said. “And if you look at them, it really tells you Python is no longer just a bolt-on onto Spark, but rather a first-class language.”
In fact, there are many Python features that are not even available in Scala, Xin said, including defining a UDF and using that to connect to arbitrary data sources. “This is actually a much harder thing to do in Scala,” he said.
The enhancements undoubtedly will help the PySpark community get more work done. Python was already the most popular language in Spark before the latest batch of improvements (and Databricks and the Apache Spark community aren’t done). So it’s interesting to note the level of usage that Python-developed jobs are getting on the Databricks platform, which is one of the biggest big data systems on the planet.
According to Xin, an average of 5.5 billion Python on Spark 3.3 queries run on Databricks every single day. The comp-sci PhD says that that work–with one Spark language on one version of Spark–exceeds the volume of every other data warehousing platforms on the planet.
“I think the leading cloud data warehouse runs about 5 billion queries per day on SQL,” Xin said. “This is matching that number. And it’s just a small portion of the overall PySpark” ecosystem.
Python support in Spark has improved so much that it even gained the approval of Wilson, the Airbnb data engineer. “Things have changed in the data engineering space,” Wilson said in another video shared by Xin on the Data + AI Summit stage. “The Spark community has gotten a lot better at supporting Python. So if you are using Spark 3, the differences between PySpark and Scala Spark in Spark 3 is, there really isn’t very much difference at all.”
Databricks CEO Ali Ghodsi doesn’t do a lot of coding these days, but he still fires up the old IDE on occassion. When he tried installing PySpark recently, he found it so much easier than in the past.
“You can just go to any terminal and say ‘Pip, install PySpark,’ and that’s it. It’ll just install the whole thing,” Ghodsi said last week. “It’s hugely different from 10 years ago. We would have to set up the servers and the daemons and all of that, and configure it and use it in local mode. Just ‘Pip, install PySpark.”
Related Items:
Databricks Unveils LakeFlow: A Unified and Intelligent Tool for Data Engineering
Spark Gets Closer Hooks to Pandas, SQL with Version 3.2