Apache Kafka vs Apache Spark
Last updated: June 22, 2015
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization.
Apache Spark is a fast and general engine for large-scale data processing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Write applications quickly in Java, Scala or Python. Combine SQL, streaming, and complex analytics.
Apache Kafka vs Apache Spark in our news:
2015 - IBM bets on big data Apache Spark project
IBM has announced that it would devote 3500 researchers to the open source big data project Apache Spark. It also announced that it was open sourcing its own IBM SystemML machine learning technology in a move designed to help push it to the forefront of big data and machine learning. These two technologies are part of the IBM transformation strategy that includes cloud, big data, analytics and security as its pillars. As part of today’s announcement, IBM has pledged to build Spark into the core of its analytics products and will work with Databricks, the commercial entity created to support the open source Spark project. IBM isn’t just giving all of these resources away out of largesse. It wants to be a part of this community because it sees these tools as the foundation for big data moving forward. If it can show itself to be a committed member to the open source project, it gives it clout with companies who are working on big data and machine learning projects using open source tools — and that opens the door to consulting services and other business opportunities for Big Blue.
2015 - Google partners with Cloudera to bring Cloud Dataflow to Apache Spark
Google announced that it has teamed up with the Hadoop specialists at Cloudera to bring its Cloud Dataflow programming model to Apache’s Spark data processing engine. With Google Cloud Dataflow, developers can create and monitor data processing pipelines without having to worry about the underlying data processing cluster. As Google likes to stress, the service evolved out of the company’s internal tools for processing large datasets at Internet scale. Not all data processing tasks are the same, though, and sometimes you may want to run a task in the cloud or on premise or on different processing engines. With Cloud Dataflow — in its ideal state — data analysts will be able use the same system for creating their pipelines, no matter the underlying architecture they want to run them on.