Apache Spark vs Presto
Last updated: November 20, 2019
Apache Spark is a fast and general engine for large-scale data processing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Write applications quickly in Java, Scala or Python. Combine SQL, streaming, and complex analytics.
Presto is a highly parallel and distributed query engine for big data, that is built from the ground up for efficient, low latency analytics.
Apache Spark vs Presto in our news:
2019 - Starburst raises $22M to modernize data analytics with Presto
Starburst, the company that’s looking to monetize the open-source Presto distributed query engine for big data (which was originally developed at Facebook), has announced that it has raised a $22 million funding round. The general idea behind Presto is to allow anybody to use the standard SQL query language to run interactive queries against a vast amount of data that can sit in a variety of sources. Starburst plans to monetize Presto by adding a number of enterprise-centric features on top, with the obvious focus being security features like role-based access control, as well as connectors to enterprise systems like Teradata, Snowflake and DB2, and a management console where users can configure the cluster to auto-scale, for example.
2015 - IBM bets on big data Apache Spark project
IBM has announced that it would devote 3500 researchers to the open source big data project Apache Spark. It also announced that it was open sourcing its own IBM SystemML machine learning technology in a move designed to help push it to the forefront of big data and machine learning. These two technologies are part of the IBM transformation strategy that includes cloud, big data, analytics and security as its pillars. As part of today’s announcement, IBM has pledged to build Spark into the core of its analytics products and will work with Databricks, the commercial entity created to support the open source Spark project. IBM isn’t just giving all of these resources away out of largesse. It wants to be a part of this community because it sees these tools as the foundation for big data moving forward. If it can show itself to be a committed member to the open source project, it gives it clout with companies who are working on big data and machine learning projects using open source tools — and that opens the door to consulting services and other business opportunities for Big Blue.
2015 - Google partners with Cloudera to bring Cloud Dataflow to Apache Spark
Google announced that it has teamed up with the Hadoop specialists at Cloudera to bring its Cloud Dataflow programming model to Apache’s Spark data processing engine. With Google Cloud Dataflow, developers can create and monitor data processing pipelines without having to worry about the underlying data processing cluster. As Google likes to stress, the service evolved out of the company’s internal tools for processing large datasets at Internet scale. Not all data processing tasks are the same, though, and sometimes you may want to run a task in the cloud or on premise or on different processing engines. With Cloud Dataflow — in its ideal state — data analysts will be able use the same system for creating their pipelines, no matter the underlying architecture they want to run them on.