In a world where data grows at an unprecedented rate and where technology is constantly being improved and reimagined, companies find themselves with an overwhelming amount of data kept (or, rather, lost) in multiple sources.
This situation commonly leads to countless wasted hours looking for or wrangling data, pervasive poor data quality, inaccurate or incomplete reporting and lack of compliance.
The lack of visibility into the movement of data across the data supply chain can be overwhelming. The idea of capturing data lineage sounds like a good resolution to a multitude of issues, but many will find it’s much easier said than done.
Before undertaking a data lineage initiative in your organization, it’s important to address key questions.
What use cases support the need for data lineage?
Without determining why data lineage is needed, it’s hard to determine where to start — and more importantly — where to stop. In most environments, there’s not enough time or money to map your entire landscape with a detailed level of granularity, so choices need to be made and agreed upon.
Organizational decisions are typically based upon internal reporting and other analysis utilizing data. But one small change in an application or source system has the potential to dramatically impact a report and the decisions based on the information.
Source-to-target mappings used in capturing data lineage quickly illustrate how one change may affect (or impact) upstream or downstream systems. Without data lineage already in place, the effort for this impact analysis could take countless hours and and not be fully comprehensive.
Any time data moves, there’s the opportunity for it to become distorted in some way. This is why it’s important to understand how and where data migrates.
Ensuring the quality of data is also of the utmost importance when basing decisions off of data. If the data isn’t accurate, the decisions aren’t well-informed. However, when thinking about the many places data may move, it begs the question: At what point do you measure data quality?
If it’s measured at the beginning of the data lifecycle, there’s no guarantee the end result will be high-quality. But if the quality is measured at the end, how can you be sure the data was high-quality in the first place?
Any time data moves, there’s the opportunity for it to become distorted in some way. This is why it’s important to understand how and where data migrates to be able to assess its quality throughout the entire data lifecycle.
With major privacy regulations in place and new ones on the horizon, it becomes increasingly important to understand where data is and how it got there. With the GDPR and the CCPA, consumers have the right to know what data is being held about them, where their data is and a right for their personal data to be “forgotten” or deleted.
These regulations make it so that documenting data lineage is critical, because not doing so could result in significant consequences for the business, including hefty penalties and tarnished brand reputations.
When you consider how legacy systems and other technologies change over time, you’ll find data scattered across many systems and, often, not being governed. This is why finding a single consumer’s data throughout an entire organization becomes much more laborious than expected — and a big compliance concern.
Data lineage helps to lessen or possibly eliminate the need to involve as much time and resources to finding the right data.
We hear it time and time again, “I spend ‘x’ amount of time each week just looking for data.” Often times, organizations rely solely on word-of-mouth and tribal knowledge to find or understand data. But asking a data scientist or data architect to comb through ETLs and transformation logic to get to the correct level of data isn’t always feasible.
Data lineage helps to lessen or possibly eliminate the need for so much time and energy to finding the right data. Data lineage, when paired with a data glossary or data catalog, can help the data discovery process to be a “self-service” type of experience. Governing data and creating a data catalog can help to expose the available data, along with its metadata — and, thereafter, the data lineage demonstrates where the data is moving.
What level of data lineage is necessary to support business needs?
It’s helpful to think of data lineage as a map of your data. Different types of maps contain various levels of complexity. Some maps are high-level, showing the world and its continents. Others incorporate more detail and complexity, such as states, cities, highways, roads and elevation.
The same thing is true with data lineage. While it may be sufficient to simply track a source-to-target mapping for applications and data sources for the high-level flow of data, the true value could be revealed at a more granular level.
When tackling a data lineage implementation, keep these use cases in mind and any others that will apply to your organization. Depending on the end goal of the use case, you must decide what level of data lineage mapping is necessary to achieve the maximum value for the use case.
Lineage can always be implemented in an iterative approach and increase the level of detail as time goes on. For example, from the start it may not be necessary to capture ETL (extract, transform, load) and transformations that occur throughout the movement of data. But as the value of basic source to target lineage is proven, ETL may become something that’s added later to increase value and data understanding.
Understanding where data comes from can improve reporting, make data more trusted throughout your organization and improve business decision-making.
How do you start to identify data lineage?
It’s easy to become overwhelmed with the amount of data that is available and worry about tracking the lineage for every single data element. Each of the mentioned use cases has a goal in mind — whether that is decisions based on reports, reports based on quality data, compliance with regulations, or saving time and money on data discovery.
If you know your business’ desired end goal, it can inform the scope of capturing data lineage. For example, if one critical report is being used for decision-making, start there and focus the lineage only on the data needed for that report. Once the value is proven for that report, it should become easier to broaden the scope of data lineage for other use cases.
Embarking on a data lineage initiative, no matter the size or scope, is a worthwhile effort. Understanding where data comes from can improve reporting, make data more trusted throughout your organization and improve business decision-making. And that’s an effort worth fighting for.