Explore the next frontier of data

Read the latest news and opinions from our experts

 

Featured Post

Recent Posts

Trino, Data Governance, and Accelerating Data Science

Back during the Datanova conference, I had the pleasure of interviewing Martin, Dain, and David, three of the original creators of Presto. They started work on Presto in 2012 when they were at Facebook and spent six and a half years working on it and the last two years working on a fork of Presto called Trino. Many of those years were spent working on the Trino codebase, building the community, and emphasizing the governance model as a pure open source project. It has become a global project with contributors and users all over the world. 

Many blogs covering Trino dive into the technical engineering aspects, but I’d like to focus on the user perspective and understand what problems Trino solves for these users. It’s interesting to analyze what's going on in the industry, and the importance of data engineering practices, data science practices, and how those meet and align effectively. Particularly, I’d like to focus on what Trino can do that enables data scientists and analysts to be effective in their jobs, while also discussing what it doesn’t do. 

Trino eliminates tech debt

Trino has grown tremendously in the last few years in adoption and I enjoy learning about ways that the tech is being applied in the industry to solve various pain points. Industry analysis repeatedly shows that tech debt within data infrastructure is a major stumbling point for enterprise. It's considered one of the main impediments for AI adoption. We've seen surveys coming out of MIT Sloan, McKinsey, and my colleague, Ben Lorica and I have been doing industry surveys that cover some of these issues. 

Companies have been working tirelessly in the last few decades to pinpoint and address the source of their tech debt, in terms of their data infrastructure. They are struggling with the age old phenomena of having a lot of silos in enterprise where people have their own fiefdoms and they don't talk enough with each other. Now they're being confronted with a lot more pressures regarding data and this whole constellation of issues that come in about compliance and privacy and security, while also attempting to leverage their data for competitive advantage. On top of all that, we have 2020 and all the other pressures that have come along since the pandemic struck. This pain point where this lack of being able to work across enterprise with effective data infrastructure has become a real show stopper in a lot of corners. So how should you move forward on it? 

While many corporations have focused on removing data silos, these efforts are somewhat misguided as silos are less of a problem and more of a reality of running a business. There are many causes to silos, such as mergers, acquisitions, partnering agreements, and the need for flexibility to launch new business lines rapidly. Each of these will typically merge different stacks with their own data stores into the data pipeline. The traditional theory to solving these silo problems is to pull interesting data together into a new data store, which works well to a certain extent. The problem is, centralization creates tangible risks by adding bottlenecks in operations, regulatory compliance, security concerns, and so on. In practice, the notion that all the data must live in one location is an unattainable idea. Adding new datasets can take upwards of a week if your data engineering team is fast and assuming the department you’re pulling data from will allow it and won’t require a quarter just to go through those conversations. Once the team has closed in on a solution your company buys another company and they have their own store, and now, how do you merge those two together. This cycle only gets worse in practice.

The solution that Martin, Dain, and David came up with when creating Trino was to avoid moving the data around in the first place. Instead of having to centralize and get a new data store ahead of time, they created a single source that joined the data at query time. This avoids moving all the data and only moves a subset of the data ad hoc on request. Trino at its core, is extensible by means of its API for building connectors to data sources. Connectors have to provide metadata and a way to fetch the data for a given table in an external data source. But it also provides advanced features, like the ability to interact with the optimizer by providing statistics about the data. The optimizer can then use these statistics to figure out the best way of ordering joins to make the execution the most efficient possible execution that minimizes memory utilization and so on. 

Another important aspect of Trino is being able to push down certain parts of the computation into the data source. When querying a distributed file system or object store, the connector for Hive understands the data model and the layout of data in the Hive model. It can use information about the shape of the query to prune out partitions that shouldn't be considered as part of the query so that can improve performance dramatically. It can also take advantage of filters over ORC or Parquet data that have some information inside their files that can be used to prune sections of the data that would not answer the query. 

When Trino connects to more sophisticated data sources, like a SQL database that has a smart storage engine, Trino can take advantage of some of the more advanced capabilities by pushing down computation into those data sources. For complex aggregations, Trino can push down some of the computation into the SQL database and let it use its own algorithms, data structures, and optimize indexes, etc, to compute the aggregation in the most efficient way rather than pulling all the data into Trino. Once the connector performs the calculations, the resulting intermediate data is sent to Trino, which is used for the remainder of the query. This complex dance between the engine and the connectors allows each connector to participate in the optimization process. 

Trino solves technical problems, not human problems

This enables some operations to be done either in Trino or in the connected data source depending on the capabilities of each of the systems. What’s best is this is transparent to the data scientist or the business analyst that uses Trino. Dain Sundstrom summarized this notion of simplicity by mentioning one of their core development philosophies, "Simple things should be simple, complex things should be possible." Setting up connections to data sources and running queries is straightforward and works out-of-the-box. Building a system that's gonna join a trillion rows with a billion rows running hundreds of queries a minute, it's possible, but won’t necessarily be easy without some customization.

While this certainly solves some of the issues that come along with centralizing analytics data, it still doesn’t entirely solve this issue of having common data standards and governance. There’s an aphorism around data science teams, that you spend 80% of your time just cleaning up data and it's mostly for this exact reason. This figure becomes important when you look at how much these companies annually spend on their data science teams.

In recent years, there has been a lot more discussion about using data catalogs to solve this data governance problem. The majority of data catalogs aim to normalize the way that you look at data, but they don't actually get you to a query engine and if they do the data catalog is the main focus while the query engine is less emphasized. Conversely, Trino connects to all the different data stores and exposes their catalogs directly to efficiently query all of them. Most cataloging doesn’t actually get you to a big, efficient distributed querying system like Trino.

Another tricky aspect of data catalogs is it can get quite political while implementing a global solution when you attempt to create a normalized view over the entire organization's data. This requires agreement from a big committee to work through all these different departmental hurdles and get everyone to come to a consensus. Many times when it comes time to execute, individual teams normalize the datasets they think are the most important which may not be the most useful to data scientists. This results in a pick and choose solution as opposed to exposing all catalogs. 

Nested within the political hurdles also exists a technical challenge with global catalog. With all these you have all these disparate systems that manage their own catalogs of tables and schemas in it's own way, they have their own constraints and capabilities. It is difficult to match up those concepts and keep data in sync across those systems in a consistent way. So the view of a table may change and the catalog is not able to keep up until a sync occurs, introducing another dimension of challenges to deal with.

Using Trino is a firm step towards eliminating such heavy dependence on data engineering teams to coordinate various governance models and eliminating tech debt. However, you get a whole new set of problems that emerge from a sort of analytics Zeno’s Paradox. Data cleaning, nuances about the metadata, data governance challenges, and getting cross-team agreement on practices and standards are still left on the table. 

A key learning echoed from a dozen leading tech firms in the Metadata Day workshop and meetup events is that the solutions must be part tech-focused, but must also consider the human elements. These systems are still, in many ways, driven by and for humans, which require context to make connections between otherwise seemingly disparate data. Humans are organized by political structures that require navigating human emotions and overall coordination with multiple agendas and priorities. Trino can remove some of the technical and human obstacles, but it cannot solve for approval required from your adjacent team’s boss who is currently out-of-town. In other words, proper tooling like Trino is important, however, without executive support and business culture in the mix, it will only take you so far in improving the overall effectiveness of your data science team. In particular, this should resonate with data managers or any executives in charge of a data science or platform engineering team. A notable reading that touches on this complex coordination between any team sharing data is Data Teams by Jesse Anderson.

Some problems, only humans can solve

We can learn a lot about how crucial management’s role is in steering teams towards success by looking at the case study done at Lyft conducted by Mark Grover. Mark Grover was product manager at Lyft assigned to respond to GDPR requirements. His case study resulted in the Amundsen open source project, with a lot of enterprise adoption for it, as well as a Lyft-funded startup that spun out, called Stemma.ai.

Data teams at Lyft tried to provide details for potential audits about dataset usage using data catalogs. They quickly realized that using data catalogs wasn’t enough for an effective audit. They actually needed to include team structure and policy information to help resolve organizational challenges when teams needed to rely on another team's data. Grover’s team found that their data science team spent approximately 30% of their time just looking up metadata for datasets, as datasets would likely change within the 3 months since an individual on that team last worked with that specific dataset. Lyft decided to invest in a local search application for metadata that was heavily focused on user experience. This went over and beyond what a data catalog could provide. They came to a solution that, by my calculations, is approximately a twenty million dollars per year savings just in terms of data science team efficiency. Executives at Lyft noticed and began to recognize new potential business lines. What initially set out to be a risk-mitigation project turned into serious corporate cost recovery, then into new business lines, then into other large firms (ING Bank, Edmunds, Workday, etc.) adopting their open source project and practices

Centralizing your data requires you to get your data from your operational data stores into your data warehouse or into the data lake. This requires coordinating with other teams, with other people, and that introduces a lot of lag and latency in that process. If you just want to explore your data, at a time when you may not even know what questions you're trying to ask. You don't wanna have to wait weeks and weeks before you can even start asking the questions.

There's this kind of human frailty, that when we're working with complex technology, we tend to revert back to the waterfall method and just suppose that we have the questions in advance. We have to think in more agile terms, where we iterate, and we develop questions as we're working through the problem. There’s a really great article in the Harvard Business Review by Eric Colson talking about one of the main characteristics that you really want to build into a data science team is curiosity, and this idea of not having to have all the questions upfront. You have to dig through them, and iterate.

There is still a lot of ground to be covered in the metadata space, but with a tool like Trino, we take an important step towards by cutting the tech debt associated with constantly changing governance models. With the ability for Trino to tap into the data sources, in the real world where data lives, it allows you to start exploring those questions, and building a mental model of what the data tells you. You can formulate questions, get answers quickly without having to consider much about optimizing your process for how you're gonna ask those questions once you settle on them.

Paco Nathan

Known as a "player/coach", with core expertise in data science, natural language, machine learning, cloud computing; 38+ years tech industry experience, ranging from Bell Labs to early-stage start-ups. Advisor for Amplify Partners, IBM Data Science Community, Recognai, KUNGFU.AI, Primer. Lead committer PyTextRank. Formerly: Director, Community Evangelism @ Databricks and Apache Spark. Cited in 2015 as one of the Top 30 People in Big Data and Analytics by Innovation Enterprise.

Your Comments :

blog-cta

From Facebook

Read more of what you like.

By | on 26, Jun 2020 |   spark presto Technical Blog Data Science

A few days ago I read a Gartner report stating that data scientists spend 23% of their time on data collection and preparation. I think that’s low. At my previous company I specialized in ETL, and bas[...]