Apache Spark vs Hadoop
Last updated: June 22, 2015
Apache Spark is a fast and general engine for large-scale data processing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Write applications quickly in Java, Scala or Python. Combine SQL, streaming, and complex analytics.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Apache Spark vs Hadoop in our news:
2015 - IBM bets on big data Apache Spark project
IBM has announced that it would devote 3500 researchers to the open source big data project Apache Spark. It also announced that it was open sourcing its own IBM SystemML machine learning technology in a move designed to help push it to the forefront of big data and machine learning. These two technologies are part of the IBM transformation strategy that includes cloud, big data, analytics and security as its pillars. As part of today’s announcement, IBM has pledged to build Spark into the core of its analytics products and will work with Databricks, the commercial entity created to support the open source Spark project. IBM isn’t just giving all of these resources away out of largesse. It wants to be a part of this community because it sees these tools as the foundation for big data moving forward. If it can show itself to be a committed member to the open source project, it gives it clout with companies who are working on big data and machine learning projects using open source tools — and that opens the door to consulting services and other business opportunities for Big Blue.
2015 - Google partners with Cloudera to bring Cloud Dataflow to Apache Spark
Google announced that it has teamed up with the Hadoop specialists at Cloudera to bring its Cloud Dataflow programming model to Apache’s Spark data processing engine. With Google Cloud Dataflow, developers can create and monitor data processing pipelines without having to worry about the underlying data processing cluster. As Google likes to stress, the service evolved out of the company’s internal tools for processing large datasets at Internet scale. Not all data processing tasks are the same, though, and sometimes you may want to run a task in the cloud or on premise or on different processing engines. With Cloud Dataflow — in its ideal state — data analysts will be able use the same system for creating their pipelines, no matter the underlying architecture they want to run them on.
2014 - MapR partners with Teradata to reach enterprise customers
The last independent Hadoop provider MapR and big data analytics provider Teradata announced that they will work together to integrate and co-develop their joint products and to create a unified go to market strategy. Teradata will also be able to resell MapR software, professional services, and provide customer support. In other words, Teradata will be the face of MapR to enterprises who use, or want to use, both technologies. Until recently Teradata partnered most closely with Hortonworks, but now it’s sharing love and its analytic market leadership with all three providers. Similarly, earlier this week, HP announced Vertica for SQL on Hadoop, which allows users to access and explore data residing in any of the three primary Hadoop distros — Hortonworks, MapR, Cloudera.
2014 - HP plugs the Vertica analytics platform into Hadoop - a new advantage over Amazon Redshift
HP announced Vertica for SQL on Hadoop. Vertica is an analytics platform that enables customers to access and explore data residing in any of the three primary Hadoop distros — Hortonworks, MapR, Cloudera — or any combination thereof. Large companies are often using all three kinds of Hadoop because they don’t know which will be dominant. HP is one of the first big vendors to say “any flavor of Hadoop will do” by taking action, though it has invested $50 million in Hortonworks which is, at present, the flavor of Hadoop inside HAVEn, its analytics stack. HP’s announcement centers not only around its interoperability, but also its power on data stored in a data lake, enterprise data hub, whatever you want to call it. HP now provides a seamless way to explore and exploit value in data that’s stored on the Hadoop Distributed File System (HDFS). The power, speed, and scalability of HP Vertica with the ease with which Hadoop lassos big data might persuade reticent managers to come out from underneath their desks and take big data on.
2014 - Cloudera helps to manage Hadoop on Amazon cloud
Hadoop vendor Cloudera announced a new product called Director that will make it easier for customers to manage their Hadoop clusters on the Amazon Web Services cloud. Senior Director of Product Marketing Clarke Patterson acknowledged that has not been easy to date while still maintaining the breadth of capabilities. Although there’s no difference between the cloud version and the on-premises version of the software, he added, the Director interface is designed to be self-service and includes cloud-specific capabilities such as instance-tracking so administrators can keep an eye on whose cloud instances are costing what.