Explore the next frontier of data

Read the latest news and opinions from our experts

 

Featured Post

Recent Posts

Rev up your data queries up to 12x with Starburst Enterprise on AWS

Introduction

Over the last three months, we announced the availability of Starburst Enterprise on Google Cloud Platform, Red Hat Marketplace, and Microsoft Azure Marketplace. With this offering, you can deploy a cluster and begin querying data in cloud storage on any cloud platform in a matter of minutes! Coupled with simple deployment, Starburst  is automatically configured for your cloud computing instances, ultimately saving you both time and cloud costs. In this blog post, we’ll dive into a third party vendor’s  performance benchmark comparisons between Starburst Enterprise on AWS and OS Presto® on AWS EMR.

Download Starburst Free

Starburst Enterprise Cloud Configuration

Before we dive into the benchmark, we also want to note the latest options to configure and deploy your cluster. Starburst Enterprise  has long offered CloudFormation Template formats to simplify deployment on AWS but now offers different deployment options that work across all cloud platforms. The first option is to use kubectl along with a YAML file which contains all of your Starburst Enterprise cluster’s configurations. The other option is to utilize our Mission Control Web UI to create and manage multiple data clusters. Having these options makes it simple to move deployment configuration for those customers that are running multi cloud environments. This is one other aspect to consider outside of the performance dimension we are discussing today.

Benchmark Environment

Concurrency Labs executed a number of tests using data and queries from the TPC Benchmark™ DS (TPC-DS), comparing Starburst Enterprise 338-e against EMR 6.0.0 Presto 0.230, running on AWS infrastructure. All infrastructure was launched using Amazon Web Services in the North Virginia AWS region (us-east-1). Tests were run for both ORC and Parquet datasets. Concurrency Labs executed tests against the same 1TB ORC dataset and 1TB Parquet dataset, stored in the same S3 bucket in the AWS North Virginia region.

Infrastructure for Starburst Enterprise was launched via the Starburst Enterprise on AWS marketplace offering using a CloudFormation template provided by Starburst and using an AMI made available to Concurrency Labs. There were 20 worker nodes with one coordinator on r5.4xlarge EC2 instances.. For a fair and accurate comparison, Starburst’s caching was not enabled for these experiments.

The other OS Presto® environment was launched using the AWS Elastic Map Reduce (EMR) Service. Concurrency Labs launched the EMR cluster and provided the appropriate parameters for instance types, cluster size, VPC, security groups and additional configuration files for Presto. This cluster included 20 executor nodes and 1 coordinator running on r5.4xlarge EC2 instances.

Each experiment was run 4 times to control for random variations in the system and the results below show the average query times of those four executions.

Results

On average, Starburst Enterprise was 2x faster compared to OS Presto® on AWS EMR for the ORC dataset and 3x faster for the Parquet dataset. Furthermore, the OS Presto® on AWS EMR  fails to complete all TPC-DS queries out of the box whereas Starburst Enterprise can complete all the queries. The charts below detail a plot of all the 99 TPC-DS queries, except for the four queries that failed on OS Presto® on AWS EMR, as well as, a larger query that completed on both but ruined the chart scaling for most queries. The first two bar charts summarize the queries ran over 1TB of ORC data, and the second two bar charts summarize the queries ran over 1TB of Parquet data. Starburst is represented by the blue bars, while OS Presto® on AWS EMR is represented by red bars. With time in seconds being the vertical axis, we prefer to see lower time results. Notice below that in all cases Starburst Enterprise on AWS  has the lowest time and performs better

 

 

 

 

 

 

 

Starburst Enterprise on AWS gave us an average of 2x improvement in terms of AWS cluster cost as compared to OS Presto® on AWS EMR. While the average improvement factor is 2 for ORC data, over half of the queries ran 3 times faster all the way up to 11 times faster. This is roughly the same speedups that you will see for Parquet except up to 12 times faster for multiple queries. To see a more fine-grained summary of the speedup factor distribution, look at the following charts:

 

 

 

 

 

You’ll notice in the ORC pie chart, that 17% of the queries have approximately the same speed to answer queries. Starburst Enterprise on AWS answers about 30% of the queries 2 times faster than OS Presto® on AWS EMR. Another 23% of queries are answered with a 3 times faster, and so on. Similar findings occurred with the Parquet dataset and we even had two queries with a speedup factor of 12.

Conclusions

Overall, because of the improvement in query execution times, Starburst Enterprise on AWS enables you to run more analytics and with our autoscaling features you can reduce your AWS costs. Additionally, Starburst Enterprise on AWS is much easier to configure and offers the freedom to deploy your cluster on any cloud provider you choose, working seamlessly with your cloud providers’ native object storage data. For those running entirely native to AWS, OS Presto® on AWS EMR does have the typical integrations into the AWS ecosystem that simplify aspects of the deployment and management workflow. While only 4 out of 99 TPCD-S queries fail out of the box on OS Presto® on AWS EMR, you will notice an average speedup factor of 2 over ORC data and 3 over Parquet data when using Starburst Enterprise on AWS  in place of OS Presto® on AWS EMR.

Try Starburst Enterprise on AWS Today!

Need help with getting set up or want to learn more? Contact us aws@starburstdata.com

 

Brian Olsen

Brian is a U.S. Marine turned software engineer and developer advocate working to foster the open-source Presto community. Brian spent four years as a data engineer at a cybersecurity company working on pipeline maintenance and query optimization. While in this role, Brian was responsible for maintaining data pipelines and migrations to include replacing some legacy data warehousing systems to use open-source Presto. Brian is a published author in ACM and IEEE geospatial database conferences.

Presto Book Download CTA

Your Comments :

blog-cta

From Facebook

Read more of what you like.