data governance articles

Data Governance for Sourcing Data from the Web

By FSFP

When you think of “big data,” there’s nothing bigger than the massive amount of data on the web. So much to read and learn from, and so much for individuals and companies to want to bookmark or download to peruse and use later.

Big, too, are the endless ways organizations want to use this information. For example, suppose an organization wants to know about initial public offerings (IPOs). It could find a publicly accessible filing site where IPOs are listed and purchase or download the information for free. Another organization might want to know how many homeowners live in a specific city or ZIP code, because it hopes to add a new retail store in the area. If they could quickly find the information on a city government’s “open data” site — easy to download and at no cost — it could quickly be put to use.

Using Online Data: It’s Not That Simple

Earlier this summer, I partnered with Lisa Baughman, North America Data Governance Leader from Dun & Bradstreet (D&B), on a presentation for the Data Governance and Information Quality Conference (DGIQ), Governing the Source of Data from the Internet.

Judging from the interest we saw from people attending our DGIQ session, it was apparent data management professionals have many questions and concerns about this topic. For example:

  • What information is available online, and how can it help my business?
  • Can we collect data and then later repackage or resell it?
  • How does data governance play a role in data collected from the web?
  • How can I make sure data gathered online is legal and compliant?

These are important questions for a somewhat sensitive topic — and one that continues to evolve, despite the web being around for so long. While I won’t cover everything you need to know, I will give you things to consider as you work within the guidelines of what’s considered legal, compliant and data governance-friendly in your organization.

How does data governance play a role in data sourced from the web?Click To Tweet

The Role of Data Governance

Web-sourced data needs governing. And while data governance (DG) does not actually do these things, DG should require business areas with data needs to do so. Here are key points, written from the perspective of the business sponsor, that DG should check to be satisfactory.  After going through this exercise a few times, business sponsors will be familiar with what is needed and be prepared to provide it.

Define what data is needed. While this may seem obvious, carefully think through and then summarize your business need for the data. This will come in handy when you represent the need to your organization and to DG. Outline specifically what you plan to do with the data. Because while it may be perfectly acceptable to use it for one purpose, it could be prohibited for another.

– Inventory the data you require, and make sure it does not already exist in the enterprise.

Define if the data is to be stored. If you’re using the data to just confirm a fact you think you know and are not storing it in your database, you may have fewer or even no requirements from a governance standpoint. For example, you believe you already have the names and addresses of key suppliers and just need to verify information. In this instance, you’re not downloading the information — just using it for confirmation.

– Verifying information from reputable sources is a common practice and is generally acceptable to DG, with little to no risk.

Know the data provider’s policies. There are many companies that sell data today, like D&B. Each company has contractual limitations on what can be done with the data. You need to understand these limitations, as they often apply to how the data can be used. This is not easy to appreciate in many organizations. The business may think they can use all data within their firewalls for any purpose, as long as access is controlled and privacy is respected. However, this is not the case any longer. All usage of purchased data must be defined and approved in advance. This also applies to data that is from “open sites.” While these sites might not charge a fee, you may be agreeing to contractual limitations when you source data from such a site.

– Once you know the policies, keep good records of this information.

Know the data provider’s collection methods. Not all data providers sell pre-packaged data. Many offer services to go out and collect data on your behalf. It is very important to qualify these providers to ensure they are following best practices. For instance, do they anonymize their searches? (Generally, this is not a good idea as it reduces transparency.) Do they hit sites at a frequency that might cause disruption?

– Develop a standard checklist for data providers with detail that will be specific enough to discern in the future, no matter who reads it.

Understand the terms and conditions of source sites. Every site typically has terms and conditions and data providers are no exception. They may even be more specific. This “fine print” needs to be understood as it may prohibit commercial reuse or have similar restrictions. Some terms and conditions restrict the way in which data can be obtained, e.g., prohibiting “screen scraping.”

– Record each site’s terms and conditions (even though this can be time-consuming), and periodically recheck them if you keep taking the data.

Understand the data’s origin, because different countries have different rules. You’ll need to know the data’s providence and pedigree – where it came from and what it went through to where it is today. And if you collect data from one country, it doesn’t mean you can use it in another, as different jurisdictions often have their own laws.

– Again, not only document this information – but be prepared to share it with internal partners like DG and legal.

Engage legal, compliance and risk departments at your organization. Your legal partners understand contacts, laws and regulations. But they don’t necessarily understand the nuances of how things work in the data world. If this is a new area for your company, your legal (et al) partners may need to engage outside counsel to get answers. Prep them with adequate information — and more is often better. And this should go without saying, but be sure to thoroughly read what you’re providing to legal, like terms and conditions, so you are not surprised by obvious concerns they bring up.

– Get internal partners involved early on in the data acquisition process, since vetting and approvals vary depending on your bureaucracy and industry.

How does a policy for sourcing data from the web play a role in data governance?Click To Tweet

Getting Governance Done

How can DG organize itself to make sure every concern is adequately addressed?

One way is to issue a policy on sourcing data from the web. This should lay out what can and can’t be done, and provide clear points where the DG organization must be engaged. The policy does not have to be very detailed — that can be left to separate procedures and checklists.

Unfortunately, not every DG organization has authority to issue data policies. That can be a separate discussion, but ultimately, DG does need to get this authority — and not just for sourcing data from the web. There are a wide range of data needs that are best dealt with via policies. A related issue is that some DG organizations feel uncertain about issuing policies. Typically, these are deeply rooted in IT and do not take a leadership role because they expect requirements and decisions to be made by the business.

If policies can’t be enforced, an alternative is to engage with the Architectural Review Board, Project Management Office or equivalent body that either approves or runs projects where data sourcing may occur. DG can become a checkpoint at which the web sourcing approach needs to be approved.

Also, DG needs to build a strong relationship with the legal department (usually the office of the general counsel in the U.S.). This can include getting them to review contracts and to keep up to date on changes in laws and regulations that may impact sourcing data from the web. This relationship must be put in context, as in some enterprises the DG organization is perceived as being merely paralegals working for the legal department.

A Future-Proof Framework

The rate at which enterprises are bringing in external data seems to be increasing. The technology is also putting these capabilities in the hands of end users in the business, who may not understand the implications of what they are doing.

While this puts extra pressure on governance professionals, with the right policies, processes and documentation in place, the DG organization will have a thoughtful framework to support this growing and complex area.


Article contributed by Malcolm Chisholm. He brings more than 25 years’ experience in data management, having worked in a variety of sectors including finance, insurance, manufacturing, government, defense and intelligence, pharmaceuticals and retail. Malcolm’s deep experience spans specializations in data governance, master/reference data management, metadata engineering, business rules management/execution, data architecture and design, and the organization of enterprise information management.

You have Successfully Subscribed!