Apache Hive vs Presto

May 18, 2023 | Author: Michael Stromann

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Presto

Presto is a highly parallel and distributed query engine for big data, that is built from the ground up for efficient, low latency analytics.

Apache Hive and Presto are both popular open-source distributed query engines used for processing and analyzing large datasets, but they differ in their approach, performance, and use cases.

Apache Hive is built on top of Apache Hadoop and provides a SQL-like interface, called HiveQL, to query and analyze data stored in distributed file systems like Hadoop Distributed File System (HDFS). Hive is designed for batch processing and offers a familiar SQL-based querying experience for users already familiar with SQL. It leverages the MapReduce framework for distributed processing and is optimized for handling structured and semi-structured data at scale. Hive is commonly used for data warehousing and ETL (Extract, Transform, Load) processes in big data environments.

Presto, on the other hand, is a distributed SQL query engine designed for interactive and ad-hoc querying. It supports querying data from various sources, including HDFS, cloud storage, and databases, using a federated architecture. Presto utilizes an in-memory processing model and parallel query execution to deliver fast query response times. It provides a wide range of SQL features, including complex joins, subqueries, and aggregations, and supports dynamic scaling to handle varying workloads. Presto is well-suited for interactive analytics, data exploration, and real-time dashboards.

See also: Top 10 Big Data platforms

Apache Hive vs Presto in our news:

2019. Starburst raises $22M to modernize data analytics with Presto

Starburst, the company seeking to commercialize the open-source Presto distributed query engine for big data (originally developed at Facebook), has announced a successful funding round, raising $22 million. The primary objective of Presto is to enable anyone to utilize the standard SQL query language for executing interactive queries on vast amounts of data stored across diverse sources. Starburst intends to monetize Presto by introducing several enterprise-oriented features. These additions will primarily focus on enhancing security, such as role-based access control, and integrating connectors to enterprise systems like Teradata, Snowflake, and DB2. Additionally, Starburst plans to provide a management console that empowers users to configure the cluster for automatic scaling, among other functionalities.

Author: Michael Stromann

Michael is an expert in IT Service Management, IT Security and software development. With his extensive experience as a software developer and active involvement in multiple ERP implementation projects, Michael brings a wealth of practical knowledge to his writings. Having previously worked at SAP, he has honed his expertise and gained a deep understanding of software development and implementation processes. Currently, as a freelance developer, Michael continues to contribute to the IT community by sharing his insights through guest articles published on several IT portals. You can contact Michael by email stromann@liventerprise.com

1	Snowflake
2	ElasticSearch
3	Hadoop
4	Apache Spark
5	Apache Hive
6	Cloudera
7	Apache Cassandra
8	Amazon Redshift
9	Teradata
10	Databricks