Chris Date is one of the most prolific and influential writers on database topics. His ‘Introduction to Database Systems,’ now in its eighth printing, is the gold standard. Having a dog-eared copy of an early printing is a badge of honor among db types.

Chris Date is also an opinionated cuss. His introduction of new terminology (quick, what’s a RELVAR?), and his sharp opinions presented in his instantly recognizable, slightly supercilious style tend to polarize his audience. So taking pot shots at Date’s opinions is a common sport in the industry. But Date is smart as a whip, and his acolytes are legion, so if you are going at him you’d better bring your A game.

My turn.

Date first described something he called ‘The First Great Blunder’ in a paper titled ‘Introducing…The Third Manifesto’ he and his frequent co-author Hugh Darwen wrote back in ’95. The ‘Blunder’ struck me as wrong at first reading. But it was wrong in some low level conceptual way that was, that has been, difficult to articulate. And Date, who seems to be the most fond of his opinions when they skewer some conventional praxis, has put references to said Blunder in many of writings since then – not only the full book version of ‘Third Manifesto,’ but into updates of ‘Intro to Database Systems’ as well – so that it is In My Face All The Damn Time.

So the time has finally come to drive a stake through Date’s Blunder, and show that it is really a case of the First Great Blinder. And the refutation will take us through the pretty interesting territory that lies under databases and data models – the land of set theory, logic and identity – and I think we’ll end up not only with a refutation of Date, but also an interesting conceptual perspective on the world of object-relational mapping.

Ok, so what is The Blunder? The Blunder comes at the intersection of The Relational World and The Object World. Date suggests the key question in object-relational mapping is this: ‘What concept is it in the relational world that is the counterpart to the concept object class in the object world’? And he characterizes the most common answer, mapping a table (in Date-Speak a relational variable or ‘RELVAR’) to a class, as The First Great Blunder.

Date understands that that mapping is a very natural one to make. If we look at the DDL for a Person table

CREATE TABLE PERSON {
FirstName VARCHAR2(20),
LastName VARCHAR2(20),
Gender CHAR(1),
Birthdate DATE
}

and the (Java) code for a class definition

class Person {
String FirstName;
String LastName;
char Gender;
Date Birthdate
}

it looks like a no-brainer. What could be more obvious? So what problem could Date have with this?

The answer requires a longer explanation of the relational world. Date’s vocabulary is full of ‘domains’ and ‘attributes’ and ‘tuples’ and ‘relations’ and ‘relvars’ instead of ‘data types’ and ‘columns’ and ‘rows’ and ‘resultsets’ and ‘tables.’ I’ll try to use the latter vocabulary so that those who haven’t had the KoolAid can follow without a dictionary, and those of us who found the KoolAid too bitter to swallow can follow without doing the internal simultaneous translation of the non-native speaker.

Date believes the correct mapping is not from class to table, but from class to data type.

Sounds like another no-brainer. A class is a data type.

The natural answer given our two no-brainers is that a class can map to a data type or a table. So the key question to understand Date is why does he think a class can’t map to a table?

Because he believes a table is a variable, and a class is a type. And that a class has methods and no public instance variables, whereas a table has public instance variables and only optionally methods.

Because he believes the fundamental value of an instance of a table is the set of all rows – a resultset or Date’s ‘relation’ – and the fundamental value of an instance of a class is a single object.

A couple of threads are entertwined here that we need to tease apart. The first is the database side distinction between data types and tables – between ‘domains’ and ‘relvars.’ The second is the single instance to set question.

What Date has missed on the database side is that the distinction he is drawing between data types and tables is an implementation distinction, not a conceptual one. And the important conceptual distinction he needs to visit and does not is that between entities – those instances that have identity – and non-entities. His mistake – the First Great Blinder – comes from not opening up his mind wide enough. No no no no he’s ou-out side, looking in…

Taking our inspiration from Jorge Luis Borges’ short story ‘The Library of Babel‘, let’s imagine in place of each of the data types we have a table holding all possible values of that data type. So for integers, we have a table ‘Integer’ with two columns, ‘SurrogateKey’ and ‘Value’, and an infinite number of rows, one row each for each integer.

“A moment” says the perceptive gentle reader. “What data type is ‘SurrogateKey?’ And more importantly, what data type is ‘Value?'”

The type of SurrogateKey we shall leave undefined except to say it is a unique identifier with aleph-null possible values. And the type of ‘Value?’ Why, integer, of course, says the White Rabbit. In infinite regress. But that is in the nature of the definition of number. And for each possible width of VARCHAR, such as 1 or 223, we have a table such as VARCHAR_1 or VARCHAR_223 whose columns are ‘SurrogateKey’ and ‘Value’ where the datatype of Value is ‘VARCHAR(1) or ‘VARCHAR(223). And the rows contain all the possible permutations of all the valid characters for VARCHARs. So that our table VARCHAR_1’s rows might be

KEY VALUE

1 a

2 b

etc.

And EVERYDATE is a table with two columns, ‘SurrogateKey’ and ‘Value,’ and its rows contain all possible date values. (Those of you who have designed data warehouses may have just had an ‘A Ha’ moment.)

So now, at our database design level, our tables consist only of sets of foreign keys with constraints binding them to other tables.

And our Person table definition becomes

CREATE TABLE PERSON {
FirstName mystical_surrogate_key,
LastName mystical_surrogate_key,
Gender mystical_surrogate_key,
Birthdate mystical_surrogate_key,
CONSTRAINT fkFirstName FOREIGN KEY (FirstName) REFERENCES Varchar_20(SurrogateKey),
CONSTRAINT fkLastName FOREIGN KEY (LastName) REFERENCES Varchar_20(SurrogateKey),
CONSTRAINT fkGender FOREIGN KEY (Gender) REFERENCES Varchar_1(SurrogateKey),
CONSTRAINT fkBirthdate FOREIGN KEY (BirthDate) REFERENCES EveryDate(SurrogateKey),
}

Ok. So now, in place of the surrogate keys just imagine that each actual value of a datatype, each varchar, or each integer, is not a literal value, but a foreign key to a table holding all possible values of that datatype…but because those tables have no other data, only the single columns, we never instantiate them explicitly. We can make direct references from one key value, such as the integer 23, to the same key value, 23, anywhere they occur.

The long-winded point here is that the distinction between data types and tables is not a necessary one, not a base conceptual distinction. There is no difference. The distinction is a practical one, one of convenience. All there are at the bottom are sets and sets of sets. It’s turtles all the way down.

The First Great Blinder is Date’s attempt to map from one implementation to another without reference to an underlying conceptual model. This is the most common error made in integration projects. We map one tool’s export file directly to another tool’s import file, and, rather than creating a common logical domain model to guide the mapping, we let Junior Programmer do it in their cube. And inevitably we end up with mapping errors.

The more important distinction we should be making in both realms is between entities and non-entities. An entity has identity. It is unique in its space. We describe an entity with a set of attributes (which as we’ve seen are sets of sets of sets of …), just like we describe something without identity. So what is the difference? An entity has continuity over time. It has a continuous existence. This notion is fundamental to OR-Mapping, not the distinction between data type and tables.

I’ll have a lot more to say about the nature of identity in another long-promised rumination.

The second mistake Date makes is to use the single table and single class as his canonical example for the mapping. As anyone who has ever worked in the trenches of OR mapping, or model-driven architecture tools, knows, single tables and single classes are just simple cases of the much more useful, and _more fundamental in most important ways_, relations – not Date’s relations but queries written to access joined tables, and – and there does not seem to be a good word here – graphs of related classes.

What I’d like to point out here – I am wrapping this up for now, but will try to flesh this out more soon – is the important mapping is not happening at the individual class or table level. Date make an error in characterizing the mapping of the relational world to the object world with a the mapping from a single class to a relation. That’s obviously not a rich enough example. And the people who have actually designed and used the tools of OR mapping know what’s right.

So, Mr. Date, here we are. The First Great Blunder is not a blunder at all. It is a partial truth. And your preferred mapping is but another piece of the same truth. You asked one of the key questions, but you were simply wearing the First Great Blinders and couldn’t see the arena where the full truth plays out. To a hammer everything is a nail.

A class can map to a data type. A class can map to a table. Your tuple – a row of a table – can map to an object. An instance of table data – a resultset, your relation – can map to a set of objects, generally in some container managing access to them.

But in the larger arena, these are the fundamental mappings:

A resultset (Date’s relation) maps to an object graph.
A query definition (Date’s relational variable) maps to the metadata description of an object graph – let’s coin an expression and call it a class graph.
A row of a resultset (Date’s tuple) maps to a set of individual objects in an object graph connected by pointers.

The fundamental importance of the object graph has become more and more clear over time. It has emerged at the heart of Service Data Objects, which are object graphs that can be disconnected from the mothership, modified, and phoned home.

More on object graphs soon.

Advertisements