DBeaver is a universal database tool. It is a free multi-platform database tool for developers, database administrators, analysts and all people who need to work with databases. Connect to Presto from popular data migration, ESB, iPaaS, and BPM tools. Our drivers and adapters provide straightforward access to Presto data from popular applications like BizTalk, MuleSoft, SQL SSIS, Microsoft Flow, Power Apps, Talend, and many more. Popular Workflow & Automation Tool Integrations.
17 Oct 2020Presto is an open source distibruted query engine built for Big Data enabling high performance SQL access to a large variety of data sources including HDFS, PostgreSQL, MySQL, Cassandra, MongoDB, Elasticsearch and Kafka among others.
Update 6 Feb 2021: PrestoSQL is now rebranded as Trino. Read more about it here. If you installed PrestoSQL before, have a look at the migration guide. This tutorial was done using PrestoDB 0.242 and PrestoSQL 344.
To start off with a bit of history: Presto started 2012 in Facebook and was later released in 2013 as an open source project under the Apache Licence. It is most comparable to Apache Spark in the Big Data space as it also offers query optimization with the Catalyst Optimizer and an SQL interface to its data sources. Presto and Apache Spark have its own resource manager, but Apache Spark is generally run on top of Hadoops’ YARN resource manager. Presto on the other hand uses its own coordinator within the cluster to schedule queries among its workers.
Presto itself does not offer a database and should be only used for large analytical queries that fall into Online Analytical Processing (OLAP). Therefore Online transaction processing (OLTP) workloads should be avoided. Presto offers a large variety of connectors like for example MySQL, PostgreSQL, HDFS with Hive, Cassandra, Redis, Kafka, ElasticSearch, MongoDB among others. Further, Presto enables federated queries which means that you can query different databases with different schemas in the same SQL statement at the same time.
To read further into the inner workings and architecture behind Presto, check out the 2019 paper Presto: SQL on Everything.
Installation
Prerequesite for this tutorial is having a running Hadoop and Hive installation, you can follow the instructions in the tutorial How to Install and Set Up a 3-Node Hadoop Cluster and this Hive Tutorial. The configuration and setup scripts used for this tutorial including further configurations of the HDFS cluster can be found in this repository. This installation also requires Java version >= 11. To install Java you can type:
Note, that there are two active projects of Presto, Trino and PrestoDB. To clarify the difference between both, have a read into What is the relationship of prestosql and prestodb?. All of this article including the configuration runs on both with the releases of presto-server 0.242 and trino-server 352. In this article we will focus on Trino. First, download Trino and unpack it to a desired location. In this case it will be located in /usr/local/trino
.
Next, add TRINO_HOME
environment variable and add the TRINO_HOME/bin
directory to the PATH
environment variable in ~/.bashrc
:
If you aim to run multiple servers, this needs to be done for all servers.
Before being able to start Presto, you need to to configure Presto on your system. For this you will need to add the following files:
etc/node.properties
etc/jvm.config
etc/config.properties
etc/log.properties
etc/cataloc/hive.properties
Let’s start by creating the etc
folder where all these files will be located:
Now, create a node properties file etc/node.properties
with the contents:
Each server needs a unique node.id
. For this, you can generate a UUID with Python, by typing python -c 'import uuid; print(uuid.uuid1())'
Another way is to install the uuid package with sudo apt install uuid
and then typing uuid
. Next, create JVM config file etc/jvm.config
with the contents:
It is important to set the flag -Xmx16G
to the available RAM on the nodes (in this case 16 GB). A rule of thumb is to allocate around 80% of the available RAM to leave some for the operating system and other processes. The flag -Djdk.attach.allowAttachSelf=true
needs to be added in order to avoid the error Error injecting constructor, java.io.IOException: Can not attach to current VM. These two files (except the node.id
) are the same for all servers.
Single Node
Create config properties file etc/config.properties
with the contents:
Dbeaver Presto Ssl
The discovery.uri
specifies the URI of the discovery server. This is generally the same server where the coordinator is located, so take the host and port of this server.
Multiple Nodes
Configuration for the coordinator is the following:
If the coordinator should be also used to compute queries set node-scheduler.include-coordinator=true
.
Configuration for the workers:
Some more points and the various properties are covered in the documentation on Deploying Presto
Create log configuration file etc/log.properties
with the following contents:
The log levels are DEBUG
, INFO
, WARN
and ERROR
, default is INFO
. By default, the log files are located in data/var/log
and are generally helpful when searching for issues with failed Presto queries or crashed Presto nodes. Additionally, Presto offers a JMX Connector to monitor and debug Java Management Extensions (JMX) information from all nodes.
In order to connect to HDFS, we will use Apache Hive, which is commonly used together with Hadoop and HDFS to provide an SQL-like interface. Apache Hive was open sourced 2008, again by Facebook. Presto was later designed to further scale operations and reduce query time. Presto and the Hive connector do not use the Hive runtime, but rather act as a replacement in order to run interactive queries.
Add the Hive connector by adding the configuring the connection with etc/catalog/hive.properties
with the following contents (port 9083 by default):
For more information have a look at the documentation for the Hive Connector. Start the HiveServer2 if it is not already running with:
Start the command-line interface of Hive with Beeline and create a schema that we use for our data with:
Here, we will use the often used and abused Iris Data Set. After downloading the data set, create a table for our data with:
Then, insert the downloaded data to Hive with:
To see the freshly create table type SHOW TABLES tutorial;
. To show metadata about a table such as column names and their data types, you can type DESCRIBE tutorial.iris;
which should return the following output:
For even more information you can use DESCRIBE FORMATTED tutorial.iris;
.
The Hive connector is also used with the various cloud-based object stores like S3, GCS, Azure Blob Storage, Minio and others. To read more on this have a read in this explainer.
Now, everything is set to start Presto. To start the Presto daemon simply run on each node:
The status of the daemon and its PID can be checked with:
The deamon can be stopped with:
The server can be accessed at http://localhost:8080
, which was previously configured in etc/config.properties
. This would give you an overview of the cluster and statistics on the queries that have been run:
Presto does not install the command-line by default, therefore we need to download the command-line interface separately:
Start the Presto CLI for Hive catalog with the previously created tutorial
schema:
Here is the documentation on the Command Line Interface. Instead of using the Presto command-line interface, you can also use DBeaver which offers a Presto connection via Java Database Connectivity (JDBC). Somewhat similar to Hive you can list all available Hive schemas with:
To list all the available tables from a Hive schema, type:
Now, let’s try querying the data we previously added. Here we answer the question: What is the average sepal length and width per iris class:
Which should return the following output:
That’s nice, we can already see clear differences here between the flowers without resorting to deep learning. Also note that Presto uses the ANSI SQL Dialect. To read more about the SQL Syntax in Presto have a look at the documentation on SQL Statement Syntax and to analyze the execution plan, you can use EXPLAIN or EXPLAIN ANALZE in front of a statement or explore the Live Plan for a query in the Presto UI.
For more information have a look at the paper Presto: SQL on Everything, which explains the inner workings of Presto in much more technical detail and also explains some of the challenges that Presto tries to solve. Additionally there is Presto: The Definitive Guide, a great book that goes into much more detail on how to use and configure Presto in an optimal way.
Further resources and links are listed in the homepage of the Presto Software Foundation. It is also helpful to directly check the issues and pull requests in the prestodb/presto and prestosql/presto Github pages as they often include detailed descriptions of some of the more advanced features. Finally, Mark Litwintischik did a great performance comparision of Spark 2.4.0 versus Presto 0.214.
Image from Wikimedia CommonsPlease enable JavaScript to view the comments powered by Disqus.Related Posts
The CData JDBC Driver for Presto implements JDBC standards that enable third-party tools to interoperate, from wizards in IDEs to business intelligence tools. This article shows how to connect to Presto data with wizards in DBeaver and browse data in the DBeaver GUI.
Create a JDBC Data Source for Presto Data
Follow the steps below to load the driver JAR in DBeaver.
- Open the DBeaver application and, in the Databases menu, select the Driver Manager option. Click New to open the Create New Driver form.
- In the Driver Name box, enter a user-friendly name for the driver.
- To add the .jar, click Add File.
- In the create new driver dialog that appears, select the cdata.jdbc.presto.jar file, located in the lib subfolder of the installation directory.
- Click the Find Class button and select the PrestoDriver class from the results. This will automatically fill the Class Name field at the top of the form. The class name for the driver is cdata.jdbc.presto.PrestoDriver.
- Add jdbc:presto: in the URL Template field.
Create a Connection to Presto Data
Follow the steps below to add credentials and other required connection properties.
- In the Databases menu, click New Connection.
- In the Create new connection wizard that results, select the driver.
- On the next page of the wizard, click the driver properties tab.
Enter values for authentication credentials and other properties required to connect to Presto.
Set the Server and Port connection properties to connect, in addition to any authentication properties that may be required.
To enable TLS/SSL, set UseSSL to true.
Authenticating with LDAP
In order to authenticate with LDAP, set the following connection properties:
- AuthScheme: Set this to LDAP.
- User: The username being authenticated with in LDAP.
- Password: The password associated with the User you are authenticating against LDAP with.
Authenticating with Kerberos
In order to authenticate with KERBEROS, set the following connection properties:
- AuthScheme: Set this to KERBEROS.
- KerberosKDC: The Kerberos Key Distribution Center (KDC) service used to authenticate the user.
- KerberosRealm: The Kerberos Realm used to authenticate the user with.
- KerberosSPN: The Service Principal Name for the Kerberos Domain Controller.
- KerberosKeytabFile: The Keytab file containing your pairs of Kerberos principals and encrypted keys.
- User: The user who is authenticating to Kerberos.
- Password: The password used to authenticate to Kerberos.
Built-in Connection String Designer
For assistance in constructing the JDBC URL, use the connection string designer built into the Presto JDBC Driver. Either double-click the JAR file or execute the jar file from the command-line.
java -jar cdata.jdbc.presto.jar
Fill in the connection properties and copy the connection string to the clipboard.
Below is a typical connection string:
jdbc:presto:Server=127.0.0.1;Port=8080;
Query Presto Data
Preston Beaver Mount Pleasant Mills Pa
You can now query information from the tables exposed by the connection: Right-click a Table and then click Edit Table. The data is available on the Data tab.