Explore the next frontier of data

Read the latest news and opinions from our experts

 

Featured Post

Recent Posts

The Great Data Architecture Debate: Data Lake, Data Warehouse or the Data Lakehouse?

This is a crazy and slightly confusing time in the data architecture space. More and more companies are shifting toward data lakes, yet the traditional data warehouse continues to provide value, as it has for decades. Now, to add to that, we have this increasingly popular lakehouse concept, which can potentially string together the best of both worlds. In early February, I had the chance to host a fun debate on data lake, data warehouse or the data lakehouse with proponents of each architecture at Datanova 2021, Starburst’s annual conference. Ultimately, we were trying to determine what the best architecture will be going forward. 

Will one of these three concepts prevail? 

Will each one carve out its own niche? 

Or will the winner be something we haven’t yet imagined?

At the end of the discussion, we asked our audience to weigh in, and the results were surprisingly clear. Before I get to that, though, I’d like to pass on a few key insights from the discussion about what companies ultimately value from these solutions:

Maturity

Our advocate for data lake architectures was Aaron Colcord, Senior Director, Data Analytics Engineering at Northwestern Mutual. He argued that one of the reasons data lakes were so appealing at the start is that they championed openness. We could tolerate some of the early technical shortcomings because of the openness and cost control we gained. The added advantage now, ten years later, is that the tools used in conjunction with data lakes have matured and expanded, unlocking all kinds of capabilities – without sacrificing openness. 

Ease of Use

Traditional data warehouses are nothing if not mature. Greg Taylor, Managing Director at Slalom Consulting, noted that these tried-and-tested platforms are also valuable because you don’t need a whole new set of technical skills or training to work with them. They utilize common, familiar technology and standard tool sets connect to them easily. 

I wouldn’t argue with that, and Richard Jarvis, the Chief Analytics Officer at EMIS Health, seconded that idea later in our debate. But Richard, who argued for lakehouse architectures, also talked about how you can achieve this familiarity and simplicity via other means. His group has deployed tools like Starburst Enterprise to standardize on SQL, which allows them to empower a wider talent pool across their business, granting more users easy access to data stored in more platforms.

Flexibility & Scalability

We also talked about the importance of scale – an essential component given the explosion of data. Richard explained that his team built their cloud analytics platform on a lakehouse architecture. During the pandemic, EMIS has done some incredible work using data to help researchers understand the spread of COVID, and how to positively impact health outcomes and improve vaccine rollout strategies. EMIS needed to grant secure access to this data to a wide range of users with very different access patterns. They required something very flexible and scalable – and the lakehouse architecture delivered.

Openness

Vendor lock-in is a common sticking point for critics of the traditional data warehouse. In the era of ORC and Parquet and other open data formats, companies want to own their data, and not have it locked into a proprietary format. Aaron pointed out that this is a downside of the lakehouse as well, since it can trap you into the same vendor lock-in scenario, which ultimately limits your ability to explore different tools. The counterpoint, he noted, is that data lakes introduce too many solutions, making it very difficult to find the right one.

Virtualization

Another thread we kept coming back to was the role of data virtualization, and how the line between some of these technologies is getting blurry. Greg talked about how data virtualization solutions allow you to utilize warehouses and data lakes together – giving you the power to curate data without having to move or transform it. And Richard at EMIS described how data virtualization helped him create a best-of-all-worlds scenario in which data scientists could run with data raw could get to work immediately, while those who needed curated data could wait for a few hours to analyze it. 

The Results: Blurred Lines, Clear Choices

My final question, after all the back and forth, was what this landscape will look like in three to five years. A convergence is already happening in the form of the lakehouse, but that doesn’t mean the days of traditional warehouses or even standard data lakes are numbered. You don’t just throw away technology and skills we’ve built up and advanced over so many years. I don’t see any of these technologies becoming obsolete in the near term, but there’s no question the pace of innovation is accelerating with more choices for the right design pattern for the right use case and cost. Data fabrics, or data meshes, as some refer to them are also gaining ground in these architecture choices.

Now to the poll results! Did I make you wait too long? I hope not. There was an interesting before and after panel upswing. When we polled our audience about which architecture they expected to be the most popular in three years, here’s the breakdown: only 14% voted for the traditional data warehouse, and just 18% opted for the data lake. Despite the concerns over vendor lock-in, a surprising 68% believed in the lakehouse. For a relatively new technology paradigm, that’s impressive. None of us can say with certainty how architectures will evolve, but I imagine we can all agree that these next few years will be interesting.

 

Chart  

ThoughtSpot has seen the rise of the data lakehouse in the market and based on demand from some of our top customers, I’m excited to see our integration with Starburst Enterprise, based on open source Trino (formerly PrestoSQL), go live this month. Yep, that means Search and AI-driven insights based on your Starburst data lakehouse – without the need to move any data!  

 

Cindi Howson

Cindi Howson is the Chief Data Strategy Officer at ThoughtSpot and host of The Data Chief podcast. Cindi is an analytics and BI thought leader and expert with a flair for bridging business needs with technology. As Chief Data Strategy Officer at ThoughtSpot, she advises top clients on data strategy and best practices to becoming data- driven, influences ThoughtSpot’s product strategy, and is the host of The Data Chief podcast. Cindi was previously a Gartner research Vice President, as the lead author for the data and analytics maturity model and analytics and BI Magic Quadrant, and a popular keynote speaker. She introduced new research in data and AI for good, NLP/BI Search, and augmented analytics and brought both the BI bake offs and innovation panels to Gartner globally. She’s rated a top 12 influencer in big data and analytics by Onalytca, Solutions Review, and Humans of Data. Prior to joining Gartner, she was founder of BI Scorecard, a resource for in-depth product reviews based on exclusive hands-on testing, contributor to Information Week, and the author of several books including: Successful Business Intelligence: Unlock the Value of BI & Big Data and SAP BusinessObjects BI 4.0: The Complete Reference. She served as The Data Warehousing Institute (TDWI) faculty member for more than a decade. Prior to founding BI Scorecard, Howson was a manager at Deloitte & Touche and a global BI standards leader for Dow Chemical. She has an MBA from Rice University.

Your Comments :

blog-cta

From Facebook

Read more of what you like.