Top 23 Big Data platforms
Last updated: March 14, 2020
Big Data platforms allow to manage and analyse data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. They use a network to solve problems involving massive amounts of data and computation. Big Data platforms can be deployed in local data center or used from the Cloud (Big Data as a Service).
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Apache Spark is a fast and general engine for large-scale data processing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Write applications quickly in Java, Scala or Python. Combine SQL, streaming, and complex analytics.
Snowflake is the only data platform built for the cloud for all your data & all your users. Learn more about our purpose-built SQL cloud data warehouse.
Cloudera helps you become information-driven by leveraging the best of the open source community with the enterprise capabilities you need to succeed with Apache Hadoop in your organization. Designed specifically for mission-critical environments, Cloudera Enterprise includes CDH, the world’s most popular open source Hadoop-based platform, as well as advanced system management and data management tools plus dedicated support and community advocacy from our world-class team of Hadoop developers and experts. Cloudera is your partner on the path to big data.
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. You can start small for just $0.25 per hour with no commitments or upfront costs and scale to a petabyte or more for $1,000 per terabyte per year, less than a tenth of most other data warehousing solutions.
Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.
Teradata Aster features Teradata Aster SQL-GR analytic engine which is a native graph processing engine for Graph Analysis across big data sets. Using this next generation analytic engine, organizations can easily solve complex business problems such as social network/influencer analysis, fraud detection, supply chain management, network analysis and threat detection, and money laundering.
Unified Data Analytics Platform - One cloud platform for massive scale data engineering and collaborative data science.
Amazon EMR is a service that uses Apache Spark and Hadoop, open-source frameworks, to quickly & cost-effectively process and analyze vast amounts of data.
on Live Enterprise
Presto is a highly parallel and distributed query engine for big data, that is built from the ground up for efficient, low latency analytics.
Apache Impala is a modern, open source, distributed SQL query engine for Apache Hadoop.
BigQuery is a serverless, highly-scalable, and cost-effective cloud data warehouse with an in-memory BI Engine and AI Platform built in.
HDInsight is a Hadoop distribution powered by the cloud. This means HDInsight was architected to handle any amount of data, scaling from terabytes to petabytes on demand. You can spin up any number of nodes at anytime. We charge only for the compute and storage you actually use.
Vertica offers organizations new and faster ways to store, explore and serve more data. Vertica lets organizations store data in a cost-effectively, explore it quickly and leverage well-known SQL-based tools to get customer insights. By offering blazingly-fast speed, accuracy and security, it offers operational advantages to the entire organization.
Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage. Get faster insights without the overhead (data loading, schema creation and maintenance, transformations, etc.). Analyze the multi-structured and nested data in non-relational datastores directly without transforming or restricting the data
The MapR Distribution for Apache Hadoop provides organizations with an enterprise-grade distributed data platform to reliably store and process big data. MapR packages a broad set of Apache open source ecosystem projects enabling batch, interactive, or real-time applications. The data platform and the projects are all tied together through an advanced management console to monitor and manage the entire system.
Qubole is a Big Data as a Service (BDaas) Platform Running on Leading Cloud Offerings Like AWS. Qubole enables you to utilize a variety of Cloud Databases and Sources, including S3, MySQL, Postgres, Oracle, RedShift, MongoDB, Vertica, Omniture, Google Analytics, and your on-premise data
Build, deploy, and run data processing pipelines that scale to solve your key business challenges. Google Cloud Dataflow enables reliable execution for large scale data processing scenarios such as ETL, analytics, real-time computation, and process orchestration.
Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig, and Hive service designed to easily and cost effectively process big datasets. You can quickly create managed clusters of any size and turn them off when you are finished, so you only pay for what you need. Cloud Dataproc is integrated across several Google Cloud Platform products, so you have access to a simple, powerful, and complete data processing platform.
SAP HANA converges database and application platform capabilities in-memory to transform transactions, analytics, text analysis, predictive and spatial processing so businesses can operate in real-time.
IBM Netezza appliances - expert integrated systems with built in expertise, integration by design and a simplified user experience. With simple deployment, out-of-the-box optimization, no tuning and minimal on-going maintenance, the IBM PureData System for Analytics has the industry’s fastest time-to-value and lowest total-cost-of-ownership.
1010data provides a cloud-based platform for big data discovery and data sharing that delivers actionable, data-driven insights quickly and easily. 1010data offers a complete suite of products for big data discovery and data sharing for both business and technical users. Companies look to 1010data to help them become data-driven enterprises.
Latest news about Big Data platforms
2020. BackboneAI scores $4.7M seed to bring order to intercompany data sharing
BackboneAI, an early-stage startup that wants to help companies dealing with lots of data, particularly coming from a variety of external sources, announced a $4.7 million seed investment. BackboneAI is an AI platform specifically built for automating data flows within and between companies. This could involve any number of scenarios from keeping large, complex data catalogues up-to-date to coordinating the intricate flow of construction materials between companies or content rights management across an entertainment industry.
2019. Starburst raises $22M to modernize data analytics with Presto
Starburst, the company that’s looking to monetize the open-source Presto distributed query engine for big data (which was originally developed at Facebook), has announced that it has raised a $22 million funding round. The general idea behind Presto is to allow anybody to use the standard SQL query language to run interactive queries against a vast amount of data that can sit in a variety of sources. Starburst plans to monetize Presto by adding a number of enterprise-centric features on top, with the obvious focus being security features like role-based access control, as well as connectors to enterprise systems like Teradata, Snowflake and DB2, and a management console where users can configure the cluster to auto-scale, for example.
2019. HPE acquires big data platform MapR
Hewlett Packard Enterprises has acquired MapR Technologies, the distributor of a Hadoop-based data analytics platform. The deal includes MapR’s technology, intellectual property, and domain expertise in AI, machine learning, and analytics data management. The MapR portfolio will bolster HPE’s existing big data offerings, which includes the BlueData software it acquired in November. BlueData’s software delivers a container-based approach for spinning up and managing Hadoop, Spark, and other environments on bare metal, cloud, or hybrid platforms. The MapR platform provides a number of capabilities for running distributed applications. The software exposes storage APIs for e S3 API, to go along with APIs for HDFS, POISX, NFS, and Kafka.
2018. Big Data platforms Cloudera and Hortonworks merge
Over the years, Hadoop, the once high-flying open-source platform, gave rise to many companies and an ecosystem of vendors emerged. The problem with Hadoop was the sheer complexity of it. That’s where companies like Hortonworks and Cloudera came in. They packaged it for IT departments that wanted the advantage of a big data processing platform, but didn’t necessarily want to build Hadoop from scratch. These companies offered different ways of helping to attack that complexity, but over time, with all the cloud-based big data solutions, rolling a Hadoop system seemed futile, even with the help of companies like Cloudera and Hortonworks. Today the two companies announced are merging in a deal worth $5.2 billion. The combined companies will boast 2,500 customers, $720 million in revenue and $500 million in cash with no debt, according to the companies.
2015. MapR tries to separate from Hadoop
MapR is one of several companies built on the open source Hadoop platform, and as such it has a bit of competition in the space. In an effort to create some separation from its better heeled rivals, it announced a new product called MapR Streams. This new product takes a constant stream of data like feeding consumer data to advertisers to create custom offers or distributing health data to medical professionals to tailor medication or treatment options — all of this in near real-time. Streams let customers share data sources with people or machines that need to make use of that information in a subscription-style model. A maintenance program could subscribe to the data coming from the shop floor of a manufacturer and learn about usage, production, bottlenecks and wear and tear, or IT could subscribe to a data stream with log information looking for anomalies that signal maintenance issues or a security breach.
2015. Google launched new managed Big Data service Cloud Dataproc
Google is adding another product in its range of big data services on the Google Cloud Platform - Cloud Dataproc service, that sits between managing the Spark data processing engine or Hadoop framework directly on virtual machines and a fully managed service like Cloud Dataflow, which lets you orchestrate your data pipelines on Google’s platform. Dataproc users will be able to spin up a Hadoop cluster in under 90 seconds — significantly faster than other services — and Google will only charge 1 cent per virtual CPU/hour in the cluster. That’s on top of the usual cost of running virtual machines and data storage, but you can add Google’s cheaper preemptible instances to your cluster to save a bit on compute costs. Billing is per-minute, with a 10-minute minimum. Because Dataproc can spin up clusters this fast, users will be able to set up ad-hoc clusters when needed and because it is managed, Google will handle the administration for them.
2015. Hortonworks acquired dataflow solutions developer Onyara
Hortonworks, a publicly traded company selling a commercial distribution of the Hadoop open-source big data software, announced today that it has acquired Onyara, an early-stage startup whose employees developed Apache NiFi, a piece of open-source software that was first used inside the National Security Agency (NSA). Apache NiFi allows to to deliver sensor data to the right systems and keep track of what was happening to the data. Hortonworks, which itself spun out of Yahoo, has previously acquired XA Secure and SequenceIQ. Now Hortonworks will be selling a new subscription based on the Apache NiFi software, under the name Hortonworks DataFlow.
2015. Data transformation service Tamr raised $25.2 million
Tamr, the startup that helps companies understand and unify all of the disparate databases across a company, announced a $25.2 million Series B round today. Tamr wants to have the same impact on the enterprise that Google had on the web. Instead of having an algorithm that goes out and finds web pages, the Tamr algorithm goes out and finds databases. The problem for larger companies today is that they have all of these databases and have no idea what data they have. Not knowing what you have is a dangerous situation because data can walk out the door and the company will have no idea it happened. Tamr can create a central catalogue of all these data sources (and spreadsheets and logs) spread out across the company and give greater visibility into what exactly a company has. This has value on so many levels, but especially on a security level in light of all the recent high-profile breaches.
2015. IBM bets on big data Apache Spark project
IBM has announced that it would devote 3500 researchers to the open source big data project Apache Spark. It also announced that it was open sourcing its own IBM SystemML machine learning technology in a move designed to help push it to the forefront of big data and machine learning. These two technologies are part of the IBM transformation strategy that includes cloud, big data, analytics and security as its pillars. As part of today’s announcement, IBM has pledged to build Spark into the core of its analytics products and will work with Databricks, the commercial entity created to support the open source Spark project. IBM isn’t just giving all of these resources away out of largesse. It wants to be a part of this community because it sees these tools as the foundation for big data moving forward. If it can show itself to be a committed member to the open source project, it gives it clout with companies who are working on big data and machine learning projects using open source tools — and that opens the door to consulting services and other business opportunities for Big Blue.
2015. MapR adds Apache Drill to its Hadoop distribution
MapR announced that its Hadoop distribution now ships with Apache Drill - an open source, low latency SQL query engine for Hadoop and NoSQL. Its promise is that it makes it easier for end users to interact with data from both legacy transactional systems and new data sources, such as Internet of Things (IoT) sensors, web click-streams and other semi-structured data, along with support for popular business intelligence (BI) and data visualization tools. Apache Drill 1.0, which is now included in MapR’s distro, is free for the taking. So should a competitor, like Hortonworks, who has at least one contributor on the project, find it extremely valuable, they can engineer it into their distro as well.
2015. Google launched NoSQL database Cloud Bigtable
Google is launching a new NoSQL database Cloud Bigtable, based on the company’s Bigtable data storage system that powers the likes of Gmail, Google Search and Google Analytics, so this is definitely a battle-tested service. Google promises that Cloud Bigtable will offer single-digit millisecond latency and 2x the performance per dollar when compared to the likes of HBase and Cassandra. Because it supports the HBase API, Cloud Bigtable can be integrated with all the existing applications in the Hadoop ecosystem, but it also supports Google’s Cloud Dataflow. This is not Google’s first cloud-based NoSQL database product. With Cloud Datastore, Google already offers a high-availability NoSQL datastore for developers on its App Engine platform. That service, too, is based on Bigtable. Cory O’Connor, a Google Cloud Platform product manager, tells me Cloud Datastore focuses on read-heavy workload for web apps and mobile apps.
2015. MapR revamps its Hadoop platform with more real-time analytics
The latest release of MapR enterprise-grade distributed Hadoop data platform is built for the real time, data-centric enterprise. It leverages table replication features designed to extend access to “big and fast” data enabling multiple instances to be updated in different locations, with all the changes synchronized across them. Reacting to business as it happens with the right offer is a must. Wrong offers are not only missed opportunities but put enough of them together and they could threaten a company’s viability. That’s one of the reasons why some enterprises are ditching their RDBMS and going with MapR. It offers both a top-rated NoSQL database and Hadoop in nicely bundled solution. MapR, unlike its competitors Hortonworks and Cloudera, is a software company whose aim is to make big data plug and play.
2015. Google partners with Cloudera to bring Cloud Dataflow to Apache Spark
Google announced that it has teamed up with the Hadoop specialists at Cloudera to bring its Cloud Dataflow programming model to Apache’s Spark data processing engine. With Google Cloud Dataflow, developers can create and monitor data processing pipelines without having to worry about the underlying data processing cluster. As Google likes to stress, the service evolved out of the company’s internal tools for processing large datasets at Internet scale. Not all data processing tasks are the same, though, and sometimes you may want to run a task in the cloud or on premise or on different processing engines. With Cloud Dataflow — in its ideal state — data analysts will be able use the same system for creating their pipelines, no matter the underlying architecture they want to run them on.
2015. Teradata acquired app marketing platform Appoxee
Analytics company Teradata acquired (for about $20 million) Appoxee, an Israeli push-messaging startup aimed at publishers and developers that want to increase user engagement in their apps. Appoxee’s business addresses one of the bigger issues in the world of apps today: keeping users coming back to and using your app, in the face of those users downloading yet another new app instead, always moving on to the next big thing. Appoxee gives developers a way to addresses this using push messages — sending messages to you to remind you to finish playing a game, or to send you info about an app update, or coupons for goods in the app. It also has a platform to help build these push messaging campaigns.
2014. Teradata acquired data-archiving service RainStor
Data warehouse vendor Teradata continues to buy its way into Big Data leadership. It has made its fourth acquisition of the year, announcing on Wednesday it has bought data-archiving specialist RainStor for an undisclosed amount. RainStor builds an archival system that can sit on top of Hadoop and, it claims, compress data volumes by up to 95 percent. Taken as a whole with the company’s other acquisitions, including Hadapt and Think Big Analytics, it’s pretty clear that Teradata wants to play a bigger role in companies’ big data environments than just that of a data warehouse and business intelligence provider.
2014. Business analytics provider Palantir raises $50 Million
Palantir, the big data company, has raised another $50 million. Even at $9 billion, Palantir was already among Silicon Valley’s most valuable private technology companies, some of which have seen massive bumps in valuations recently. Palantir, co-founded by entrepreneur Peter Thiel in 2004, got its start selling its software, which looks for patterns across broad data sets, to government agencies like the CIA and NSA. The company expects to end the year with over $1 billion in revenue and is working to expand its customer base. It is selling data analysis technology to Wall Street firms that want to detect fraud and to pharmaceutical companies looking to expedite the development of new drugs. Hershey HSY -1.06% has been using Palantir’s tools to find correlations between weather patterns and consumer behavior.
2014. After IPO, Hortonworks is a $1 billion Hadoop company
Shares for Hadoop vendor Hortonworks finished their first day of trading at $26.48, so the company’s total market cap is $1.1 billion at the close of trading Friday. Hortonworks offers only open source software and makes its money on support and services. Hortonworks was founded and launched in 2011, after a group of engineers spun the company out from Yahoo, which had been driving much of the work on the open source Apache Hadoop project. But the stock rallied late in a trading day that was awful for most major stocks. No doubt Cloudera and MapR, Hortonworks’ two largest rivals in the pure-play Hadoop space, will be watching the company’s stock closely over the coming months. MapR also claims a private-market valuation of more than $1 billion, while Cloudera’s valuation is more than $4 billion.
2014. Big Data as a Service company Qubole raises $13 million
Hadoop-as-a-service startup Qubole has raised a $13 million series B round of venture capital. Qubole is hosted on the Amazon Web Services cloud, but can also run on Google Compute Engine, and acts like one might expect a cloud-native Hadoop service to act. It has a graphical user interface, connectors to several common data sources (including cloud object stores), and it takes advantage of cloud capabilities such as autoscaling and spot pricing for compute. What’s interesting about Qubole is that although it originally boasted optimized versions of Hive and other MapReduce-based tools, the company also lets users analyze data using the Facebook-created Presto SQL-on-Hadoop engine, and is working on a service around the increasingly popular and very fast Apache Spark framework.
2014. MapR partners with Teradata to reach enterprise customers
The last independent Hadoop provider MapR and big data analytics provider Teradata announced that they will work together to integrate and co-develop their joint products and to create a unified go to market strategy. Teradata will also be able to resell MapR software, professional services, and provide customer support. In other words, Teradata will be the face of MapR to enterprises who use, or want to use, both technologies. Until recently Teradata partnered most closely with Hortonworks, but now it’s sharing love and its analytic market leadership with all three providers. Similarly, earlier this week, HP announced Vertica for SQL on Hadoop, which allows users to access and explore data residing in any of the three primary Hadoop distros — Hortonworks, MapR, Cloudera.
2014. HP plugs the Vertica analytics platform into Hadoop
HP announced Vertica for SQL on Hadoop. Vertica is an analytics platform that enables customers to access and explore data residing in any of the three primary Hadoop distros — Hortonworks, MapR, Cloudera — or any combination thereof. Large companies are often using all three kinds of Hadoop because they don’t know which will be dominant. HP is one of the first big vendors to say “any flavor of Hadoop will do” by taking action, though it has invested $50 million in Hortonworks which is, at present, the flavor of Hadoop inside HAVEn, its analytics stack. HP’s announcement centers not only around its interoperability, but also its power on data stored in a data lake, enterprise data hub, whatever you want to call it. HP now provides a seamless way to explore and exploit value in data that’s stored on the Hadoop Distributed File System (HDFS). The power, speed, and scalability of HP Vertica with the ease with which Hadoop lassos big data might persuade reticent managers to come out from underneath their desks and take big data on.