Eventually, you see master data management (MDM) problems everywhere. A few years ago, I was turning off the Garden State Parkway in New Jersey and noticed an advertisement for an upcoming concert at a local venue. The performer was identified as “The artist formerly known as Prince.”
Prince, full name Prince Rogers Nelson, is by all accounts a talented and creative rocker. In 1995, he somehow got into a dispute with his record label, Warner Bros., and decided to change his name. He felt that they had controlled his creative output for too long and wanted a completely new identity. To achieve this goal, Prince changed his name to a symbol that he invented and which he deliberately chose not to call anything. This meant that Prince’s new name could never be pronounced or written down as words. Anyone can invent a surrogate key that has no meaning and describes no characteristic of what it is identifying. To create one that cannot be communicated is a mark of true genius.
The MDM Connection
Obviously, Prince is in a business where marketing and communication are important, and where a huge number of publications have to create content about popular music on a regular basis. The name change disconcerted the trade media, so they came up with the identifier “The artist formerly known as Prince” which soon got shortened to “TAFKAP.” True to form in MDM, Prince ended up being the subject of several identifiers. Eventually, he patched up his differences with Warner Bros. and went back to being Prince, except that some in the media now termed him “The artist formerly known as the artist formerly known as Prince” or “TAFKATAFKAP” for short.
Surrogate keys are basically made-up data elements that identify things in databases. They are not a characteristic of the real-world thing they are identifying, but simply serve to make instances of data about the thing unique.
The opposite of a surrogate key is a “natural key.” This is one or more pieces of data that uniquely identify a real-world instance of a thing, and which are actual properties of that thing. Here we can run into some problems, however. Exactly what characteristics identify a person? Are we talking about fingerprints, DNA profiles or iris patterns? No doubt we are for some specialized databases, but for most business applications, we have an identifier that is in reality a surrogate key, but which is managed externally to our business databases. A good example is Social Security number. We tend to think of these external surrogate keys as the equivalent of natural keys, but we are actually relying on the mechanisms that manage them to work perfectly. As Prince showed, that expectation may sometimes be misplaced.
Irrespective of the problems in trying to define natural and surrogate keys, they generate an awful lot of emotion in the world of data management. It is easier to talk to a group of data modelers about religion and politics than it is to discuss the relative merits of natural and surrogate keys.
Affinity for Surrogate Keys
Programmers in particular love surrogate keys. If you see a data model where every table has a single attribute primary key with the term “ID” in its name and no identifying relationships in sight, then there is a strong chance that it was made by a programmer. What is it about the programmer mind-set that makes them gravitate to surrogate keys? Having thought about this, I now believe that it is connected with the nearness of programmers to the technical objects they manipulate and their remoteness from the business world. This perspective can make surrogate keys seem vital and natural keys irrelevant, which inverts the real-world perspective of more business-oriented data analysts, leading to the clashes over keys that we have all witnessed.
A programmer has to manipulate data. The data represents real-world things, but not to a programmer. To a programmer, the data itself is a real-world thing. The data sits in databases, where it is organized into tables and columns. But it is also organized by one other concept that is the Cinderella of the data world. It is organized by records.
Pretend for just a moment that we are programmers. We know that programmers have to deal with the data stored in databases. SQL has always carried the promise of “set at a time” processing in terms of records, and this is probably where the demise of the prestige of the record originated. However, a great number of data manipulation problems are not soluble by “set at a time” processing with SQL. Individual records have to be processed, even if they are part of a set extracted from a database via SQL. And here comes a very unpleasant conclusion: a record is a thing of interest to a programmer. Therefore, a programmer, albeit operating in a different plane of abstraction, is in exactly the same position as a business user manipulating business concepts and instances. The programmer feels, and justifiably so, that a record is as vital to his or her work as a customer is to the manager of a call center. A record, then, must have a unique identifier. The surrogate key is not intended to identify the business thing that is abstracted into a database table — it is intended to identify the records that actually live and move and have their being in this table. Nothing outside of this table corresponds to the concept of a record, and so the key must be assigned within the context of the application that manages the table.
Until we get a good handle on this problem, programmers are likely to regard data analysts with the same disfavor as Prince viewed Warner Bros. — that is, as a group out to prevent them from doing what they are supposed to do. Likewise, data analysts are likely to regard programmers as little “Princes” who cannot follow the rules that are there to make things work.