Big data icon
Big Data

Is Data Quality Important for Big Data?


By definition, Big Data means a large volume of data. When it comes to the topic of data quality, you may hear that large volumes of data can, in effect, dampen out any data inconsistency. This could lead to a line of thinking that the data quality of Big Data isn’t as important.

While conceptually this may be accurate, in reality it’s proving not to be the case. Data quality issues are impeding the success of many Big Data, data lake, et al projects. There are several reasons for the quality gap, including:

  • Using data well for one purpose does not mean it will be useful for all purposes. Context is important, and many business users set excessive expectations on their organizations’ data scientists.
  • Data scientists, who may be over-confident in the raw data they analyze, are becoming aware of the fact that data that’s suitable for operations, for example, may not be suitable for deeper analysis.
  • Bringing in enormous amounts of raw data actually compounds quality issues vs. Big Data’s volume covering up data issues.

Data quality is important for any data initiative. (If you don’t believe this, it would be challenging to find an example where quality doesn’t matter!) For example, while data quality of a particular type or element may be dampened by volume, inconsistency in related data elements could still ruin a solid analytical model.

Data Quality is Inherently Inherited’s “Why Data Quality is Key to Successful Digital Transformation” story highlights an interview with Informatica’s CEO Anil Chakravarthy, who talks about data quality and how it inhibits innovation and similar change efforts. Chakravarthy mentions common scenario where legacy systems and sources present a metadata mess:

“You may have a customer ID in one place and you may not have any ID at all in a different place. You may have the feeds, but the feeds may be incomplete or incorrect. There is a whole host of problems because many of these databases and data sources were not built with the assumption that the data would be taken off those systems and repurposed.”

Untangling the mess — e.g., figuring out what data elements you need for analysis, including where they came from originally — is part and parcel with validating the quality of data. And it’s particularly important with Big Data.

Data Quality is More Than Just Accuracy

In the article I wrote for, “Ensuring the Quality of Fit for Purpose Data,” I encouraged the C-Suite and data management professionals to consider quality with a focus on the data’s usage.

I wrote about the key data quality dimensions — completeness, timeliness, conformity, uniqueness, integrity, consistency and accuracy — and stressed the importance of aligning business goals with data understanding (quality being a part of that understanding). An organization’s business goals should drive the definition and prioritization of data quality aspects and the respective requirements. Data understanding tells you whether the quality of data meets those business requirements. If they don’t, your findings should also indicate who to contact, such as the appropriate data steward, to determine the steps needed to improve the quality. (Find my full article here.)

Data Quality Was “Big” Before Big Data

In 2012 when I wrote How to Design, Deploy and Sustain an Effective Data Governance Program, I talked about data quality and said it was the root cause of the majority of data and information problems. I said that for any quality-related business case or proposal, it’s important to address the costs and risks associated with poor data quality.

I outlined data governance’s importance when it comes to data quality, because it ensures that:

  • Data quality standards and rules are defined and integrated into development and day-to-day operations
  • On-going evaluation of data quality occurs
  • Organizational issues related to changed processes and priorities are addressed

These points not only still hold true today, their relevance is increasing. The advent of the Internet of Things and the acceleration of artificial intelligence is bringing Big Data more and more into the operational aspects of organizations, with the accompanying impact and risk to operations and customer facing data points.

Make Data Quality for Big (and Small) Data Your Mission

At the April Enterprise Data World conference, four of my peers joined me in announcing the new Leader’s Data Manifesto. Data quality is one of the aspects of the manifesto, and it’s something we and other data management professionals care passionately about.

How passionate are you about data quality? Read the manifesto and consider signing it as a public statement that you agree data quality is necessary — make that, critical — for Big Data and all data types and initiatives.

Without a formal acknowledgment and an organizational commitment to data quality any data management, monetization or governance will falter.

Article contributed by John Ladley. He is a business technology thought leader and recognized authority in all aspects of Enterprise Information Management. He has 30 years’ experience in planning, project management, improving IT organizations and successful implementation of information systems. John is widely published, co-authoring a well-known data warehouse methodology and a trademarked process for data strategy planning. His books, “Making EIM Work for Business – A Guide to Understanding Information as an Asset” and “Data Governance – How to Design, Deploy and Sustain an Effective Data Governance Program,” are recognized as authoritative sources in the EIM field.