Computers are good at many things, but let’s focus on just two attributes — sorting and searching — and their relationship to data warehouses and data lakes.
What I like about these two attributes is their relatability, in that we sort and search for things every day. We might sort books in our bookshelf in alphabetical order. Or if we have many books, we might sort first by subject and then by alphabetical order. And when it comes to searching for something on a typical day, we might look for a specific recipe in a cookbook or some misplaced keys.
Sorting is organizing information
In the nineteenth century, there was need for a more effective way to tabulate census data. American inventor Herman Hollerith created a machine that sorted punched manila cards that contained information about individuals. (The start of collecting customer master data!)
Sorting makes the management of people, things and information easier. But the greater the number of things, the harder it is to sort them. That’s why people created algorithms to help with sorting. The best we can do with sorting is to achieve what is called in computer science parlance, linearithmic time. Sorting makes things neat.
Searching helps us find specific things, among many
Let’s look at the two types of search algorithms:
- Algorithms that do not make any assumptions about the order of things you are searching for (like those misplaced keys)
- Algorithms that assume things are already sorted (the recipes in a cookbook)
Search algorithms essentially search by comparing the specific thing we are searching for against the many things that can be searched. Interestingly enough, the best performance we can achieve with searching is also linearithmic time. Searching can deal with messy groups of things, since the comparison is done on the fly. And, obviously, you can do much better if you are searching things that are already sorted.
This works well because a computer is able to sort and search at the same time. And here’s how this relates to a data warehouse and data lake.
Data warehouses are neat
Data is acquired, cleansed, integrated and stored in neat little boxes, so the information can be accessed for various purposes. During this process, data can go through various stages before it’s ready for further use. With a data warehouse, we control what data — and in what form — is allowed to be stored in our data warehouse. Think of a New York-based co-op where a committee approves each tenant who wants to move into that building, based on a certain set of standards. In a data warehouse, not just any data can get in.
Data lakes are messy
Data lakes are large repositories of unsorted data that are looking for a purpose.
Data comes into a data lake in its raw form. Most of the time, we do not know what the data contains, the quality of that data or the purpose for the data. Think of the data lake like a condominium, where few rules exist — but, in general, all it takes is a simple credit check and a deposit.
Finding data in a warehouse is very straightforward — things are organized, so we know where to look. Finding data in a data lake is not easy, because we do not have a lot to go from. Data warehouses are organized for a purpose, and the data is sorted. Data lakes are large repositories of unsorted data that are looking for a purpose.
Other differences between warehouses and lakes
Data warehouses are traditionally modeled as dimensional repositories (the Kimball approach to building a data warehouse) or normalized as corporate information factories (the Inmon approach). Data lakes, on the other hand, are not modeled. And this is one of the primary attractions of data lake.
We know that sorted things are easy to retrieve. When we go to a library and want to find a book, we find the Library of Congress number and walk toward the shelves that include that number. The shelves are typically organized by subject and lexicographically within the subject. This makes it easy for us to narrow our search to a specific shelf and section to find the book.
When things are not organized, it can make searching more difficult. Despite my earlier reference, my bookshelf at home is not sorted. But I have an idea of where the books are stored, which helps in my search. When I am looking for something to read, I do a quick scan of my books to find what I am looking for. You could say that I have tagged my books in my mind by subject, by size, the way the cover looks and whether it is a paperback or hardcover. I’m searching with prior knowledge, and, essentially, this is what a search engine does. Google and the other search platforms do not necessarily sort all the data they gather, but they tag it for easy retrieval.
What’s best: data warehouse or data lake?
What is the best way to store and use corporate data? As it is with many questions, the answer is it depends. Warehouses are purpose-built to serve a specific need. Data lakes are large repositories of corporate data that can serve many purposes.
We have to know many things about the data before it gets into the warehouse in a form that is usable. And data warehouses typically do not differentiate between users.
Data warehouses, by design, are inflexible. We have to know many things about the data before it gets into the warehouse in a form that is usable. And data warehouses typically do not differentiate between users. For example, a business user and a data scientist can have access to similar data. But receiving a new data feed generally takes weeks to months, which may not be fast enough for a data scientist to generate insights from the new data.
Data lakes are, by design, flexible for receiving new data feeds. It is fairly simple to land a new data feed, since all data is stored in its raw form. Data lakes can also serve different classes of users. For example, a data mart can be built on top of a data lake to serve business users, while raw data can be supplied to data scientists for quick analysis.
Knowing your organization’s specific data storage needs is critical, but I would say the ideal solution could be in the middle. Modern warehouse architecture should have a landing zone (i.e., the extract, load, transform or ELT method), and this is, essentially, a data lake. It should be easy for your users to acquire new streams and types of data, but also easy to use. This requires minimal sorting (by categorizing and tagging) and powerful search features.
Long live warehouses and lakes
Going forward, new data-storage technologies will emerge. But data warehouses and data lakes will continue to exist in some shape or form, because they’re suited for different needs. It’s the same thing with my bookshelf and home office. It may look like I have many messy piles of things, but each pile has a certain order that makes it easy to work with.
References:
– Computer Science Field Guide
– brianchristian.org (Algorithms to live by)
– adrianmejia.com (8 time complexities that every programmer should know)
Article contributed by Kasu Sista. He has more than 25 years of experience in information technology, strategic/solution alignment and project/program management, partnering with organizations on initiatives that are focused on data governance, metadata, data quality, business intelligence and analytics.