In the previous incarnation of this blog, we demonstrated how Starburst Enterprise Presto (SEP) outperformed AWS EMR Presto. We recently had a third party vendor run the same benchmark against EMR Presto and some but not much has changed. Over the last three months, we announced the availability of Starburst Enterprise Presto on Google Cloud Platform, Red Hat Marketplace, and Microsoft Azure Marketplace. With this offering, you can deploy a Presto cluster and begin querying data in cloud storage on any cloud platform in a matter of minutes! Coupled with simple deployment, Presto is automatically configured for your cloud computing instances, ultimately saving you both time and cloud costs. Previously, Presto was only available on AWS via EMR; in this blog post, we’ll dive into the performance benchmark comparisons between Starburst’s Presto on AWS and AWS EMR Presto.
SEP Cloud Configuration
Before we dive in to the benchmark, we also want to note the latest options to configure and deploy your SEP cluster. SEP has long offered CloudFormation Template formats to simplify deployment on AWS but now offers different deployment options that work across all cloud platforms. The first option is to use kubectl along with a YAML file which contains all of your SEP cluster’s configurations. The other option is to utilize our Mission Control Web UI to create and manage multiple data clusters. Having these options makes it simple to move deployment configuration for those customers that are running multicloud environments. This is one other aspect to consider outside of the performance dimension we are discussing today.
Concurrency Labs executed a number of tests using data and queries from the TPC Benchmark™ DS (TPC-DS), comparing Starburst Enterprise Presto 338-e against EMR 6.0.0 Presto 0.230, running on AWS infrastructure. All infrastructure was launched using Amazon Web Services in the North Virginia AWS region (us-east-1). Tests were run for both ORC and Parquet datasets. Concurrency Labs executed tests against the same 1TB ORC dataset and 1TB Parquet dataset, stored in the same S3 bucket in the AWS North Virginia region.
Infrastructure for SEP was launched via the Starburst Enterprise Presto AWS marketplace offering using a CloudFormation template provided by Starburst and using an AMI made available to Concurrency Labs. There were 20 worker nodes with one coordinator on r5.4xlarge EC2 instances.. For a fair and accurate comparison, Starburst’s caching was not enabled for these experiments.
The other Presto environment was launched using the AWS Elastic Map Reduce (EMR) Service. Concurrency Labs launched the EMR cluster and provided the appropriate parameters for instance types, cluster size, VPC, security groups and additional configuration files for Presto. This cluster included 20 executor nodes and 1 coordinator running on r5.4xlarge EC2 instances.
Each experiment was run 4 times to control for random variations in the system and the results below show the average query times of those four executions.
On average, SEP was 2x faster compared to EMR Presto for the ORC dataset and 3x faster for the Parquet dataset. Furthermore, EMR Presto fails to complete all TPC-DS queries out of the box whereas SEP can complete all the queries. The charts below detail a plot of all the 99 TPC-DS queries, except for the four queries that failed on EMR Presto, as well as, a larger query that completed on both but ruined the chart scaling for most queries. The first two bar charts summarize the queries ran over 1TB of ORC data, and the second two bar charts summarize the queries ran over 1TB of Parquet data. Starburst is represented by the blue bars, while EMR is represented by red bars. With time in seconds being the vertical axis, we prefer to see lower time results. Notice below that in all cases SEP has the lower time and performs better.
Starburst Presto gave us an average of 2x improvement in terms of AWS cluster cost as compared to EMR Presto. While the average improvement factor is 2 for ORC data, over half of the queries ran 3 times faster all the way up to 11 times faster. This is roughly the same speedups that you will see for Parquet except up to 12 times faster for multiple queries. To see a more fine-grained summary of the speedup factor distribution, look at the following charts:
You’ll notice in the ORC pie chart, that 17% of the queries have approximately the same speed to answer queries. SEP answers about 30% of the queries 2 times faster than EMR. Another 23% of queries are answered with a 3 times faster, and so on. Similar findings occurred with the Parquet dataset and we even had two queries with a speedup factor of 12.
Overall, because of the improvement in query execution times, SEP enables you to run more analytics and with our autoscaling features you can reduce your AWS costs. Additionally, SEP is much easier to configure and offers the freedom to deploy your cluster on any cloud provider you choose, working seamlessly with your cloud providers’ native object storage data. For those running entirely native to AWS, EMR does have the typical integrations into the AWS ecosystem that simplify aspects of the deployment and management workflow. While only 4 out of 99 TPCD-S queries fail out of the box on EMR, you will notice an average speedup factor of 2 over ORC data and 3 over Parquet data when using SEP in place of EMR.
Need help with getting setup or want to learn more? Contact us firstname.lastname@example.org