On reporting

March 11, 2008

This was originally posted as a reply to a thread in Paul Mace’s blog. I reproduce it here partly because it has themes I want to expand on and I think it would be better to have the root with the branches, and partly because, while prompted by the thread, it is a fully formed rumination and I wanted to call it out distinctly. The copy paste feels wrong, somehow, perhaps in its violation of the leave-it-where-created-and-link-to-it nature of the web.

A business succeeds by dint of its crystal balls: the primary driver of business reporting is predicting the future. That is the impetus behind the emergence of ‘business intelligence.’ (Of course, producing future is only one aspect of intelligence. But that is another discussion.)

We need a model for that forecasting: we need a subject whose future behavior we are predicting; we need historical behavior data about that subject; we need a forecasting algorithm.

The most interesting practical subjects are generally not individuals, but populations with common attributes. Even for targeted ads and dynamically customized web sites, the segments of which the individual is a member are used for the forecasting.

Our data stores hold type hierarchies – levels of increasingly specialized information in hierarchically arranged partitions: ‘person’ partitioned on ‘gender’ to ‘male’ and ‘female’ subtypes. (Models by their abstracting nature lose some fidelity with the real world: even our simple example has no slot for hermaphrodites. But that is another discussion.)

Our type hierarchies are in complex relationships with one another. Those relationships may be from any level of the hierarchy to any level of another hierarchy (or multiple other hierarchies for n-ary relations.)

Those relationships in a given domain have to be modeled as a lattice, not a tree. Even a simple data model will quickly have multiple parent entities for the same child entity. (Look at the example of XML. For real world use, the XML tree structure must frequently be augmented by XPath, which enables us to dynamically navigate multiple parent relationships, and transforms, which are frequently used simply to rearrange the data around the ‘pole’ of a different parent. )

So the business goal is to identify a segment of the set of related entities about which to predict the future, where such segment is describable by a rule whose literal values are attributes of the entities and their relationships. (Data mining is simply the identification of interesting segments by inference from behaviors of larger populations which are randomly segmented until we find a segment for which there is a predicitive correlation with the behavior).

The practical challenge with doing this is that there are a number of conventions for physicalizing these related type hierarchies. These conventions are frequently used in an ad hoc fashion. Those creating the database may not even be aware of what they are actually doing, and so may not do it in a consistent fashion. Different conventions may be followed at different levels. Different conventions may be used to related different entities.

On any given level, our type partition may be physicalized using a descriptor attribute in a single table, such as ‘gender’, with values ‘male’ and ‘female’, with the space concession of having null values for non-gender appropriate attributes (dress size), or it might be physicalized as separate tables, Person Male Female, with the performance concession of creating joins (and in some circumstances outer joins) to get at our data, or it might be physicalized as a powertype, with Gender a table with rows Male and Female…

The point is, in order to create the rule to capture the logical segment of our data we need to work at the logical level. Reporting tools do not allow us to readily bridge that gap.

The next challenge is time sequencing the data. Most data warehousing efforts are really about articulating the temporal nature of data that was not originally well structured. If we look at the fundamental data types in a database, we have integers, floats, strings and time. But time is not like the other fundamental domains. Finding everywhere the number 3 occurs in our data is not very interesting. Finding everywhere ‘Bob’ occurs may be interesting. But finding everywhere January 3rd, 2006 occurs is definitely interesting.

Time in the database is best understood as a function of event occurrences, where an event of course is an interesting state change for an entity in one of our type hierarchies. Each time an attribute of one of our entities changes, or an attribute of one of our relationships, which themselves may be entities in their own right – marriage, for example – we need to capture the history of that event occurrence. That history includes the timestamp and the agent of change.

An additional level of complexity is that entities and their relationships may participate in life cycles (or more rigorously governed state machines) where the set of prestates and poststates for a given state exist as metadata. But frequently that metadata is not physicalized, or if it is is done awkwardly.

The problem with respect to reporting is that we frequently don’t have all the event history we need, at the grain we need it, to feed into our algorithm. When we restructure our data to obtain it, the data is frequently gappy.

Selecting the appropriate algorithm as part of our model is another challenge. Many managers can generate a trend line in Excel. Few of them know what it means.

Predictions based on gappy data need algorithms beyond the skillset and toolset of most mid-level managers.

As a result of all this, what has emerged in the reporting space – the current toolset – is at the high end a set of conventional models whose use requires the data to be rigorously structured in advance – think star schema, and at the low end, graphical tools that enable us to navigate and work with the physical database, whose logical structure we must infer when we are desiging our reports.

What is needed is a tool that dynamically reverse engineers the logical model out of a database, laying bare what is really there. The logical model may have to be finished by domain experts. Not all information needed to create the model may be present. The user examines entity behaviors, whose history has been captured in detail and is described by lifecycles and state machines, and selects a behavior of interest. The user examines the related type hierarchies, and selects a segment. (A data cube is a simple tool for doing this.) Then they can choose to predict the segment’s future behavior, and are guided through the selection of a model and algorithm by the tool, or they can select to mine that behavior for interesting sub-segment correlations, and are guided by the tool to select the attributes and algorithm (k-means clustering, etc) to use for the mining.

That is the tool that would revolutionize reporting and BI.

Leave a Reply