The Heart of Data Virtualization (Part 1)

Posted by Steve Cormier

Find me on:

2/28/15 10:51 AM

Part 1: Data Virtualization and Business Data Modeling 

Data virtualization is a technology that’s receiving a great deal of attention, as it offers solutions that are more efficient and economical than predecessor technologies such as data warehousing/ETL.

Cisco’s acquisition of Composite, the leading DV solution, puts the weight of a major player behind DV, while vendors who have long been in the data warehouse ETL market are working hard to add DV to their products. The fundamental concept of DV is that rather than moving and transforming data into a new repository for access, data is left in the original data stores and transformed ‘on the fly’, eliminating much of the difficult data management and cost associated with a separate data store. So, creating a logical layer on top of diverse sources, and performing ‘just in time’ transformation is the core concept, or the ‘brain’ of DV. But DV is a technical framework, and within that framework has to beat a heart of content—the meaning that the framework lives to contain. The heart of data virtualization is the business model. What does that mean?  First we need some background on data processing and modeling. 

How Modeling Came to Be 

A long time ago, people built systems based on functional objectives—a kind of assembly line of data processing.  If I wanted to process checks, I wrote a program that was based on that processing, with no broader goals.  I might have a customer’s personal information, bank account info, particular check info etc. all stuck together in a record in a table.  That made sense.  It was easier to get to all the information involved in check processing if it was all in one place.

Bank AccountHeart1  

The problem, though, is that you might then have a mortgage processing requirement.  Okay, no problem, we’ll create a new table/record with customer information, mortgage information and the customer’s bank account info for eligibility requirements.  Now we have two systems… 

Bank AccountHeart1 

Mortgage AccountHeart2 

A dilemma: Jane got married and now we have two different versions of her!  We have her as ‘Jones’ in the bank account system and ‘Jones-White’ in the mortgage system. Well, this goes on, and more and more tables/records are created, most with much of the same information (customer, account…).  But in each system, things are represented somewhat differently.  In one, each customer is given a unique id.  In another, they use a combination of name and address for customer identification.  Another gives a customer an id, but it’s a different one than in the other system, so John Smith id=100 in one system is Jane Jones id =100 in another. Pretty soon all this stuff becomes hard to manage.  Jane Jones gets angry that she got a mortgage ad for John Smith because of the shared id in the two systems.  Trying to do a report on total bank revenue by customer is impossible. 

So, then came the dawn of a new age.  It was called Object-Oriented Design or OOD.  In OOD, the process was not king.  What was king was the real world and a model of it.  We would build our systems based on models of actual things, not processes. So, a customer was an object both in the real world and in our computer system.  We created a table/record for customer, and that record was dedicated only to that customer’s information, and not toward whatever the customer happened to be doing (writing a check, getting a mortgage, buying a car).  We also would create separate objects/records for cars and for houses and other things in the real world.  Each would be kept separate, and stored in one place.  Jane Jones was only one person, and her basic information should only be stored once in our system with tables linking her to other objects.  In database design terms, OOD was known as Normalization—different word, same basic concept. Instead of constantly repeating data all over the place, we use basic object tables, connected by link tables (called ‘associative tables’ in data modeling jargon) as seen below. 

Please note: this is not exactly how the model would be done in the real world—it is intended to illustrate the concept)


Now looking at this, you might say, ‘Well that account belongs to just that one customer, so why not have the customer number in the account table?’  That’s possible to do, but the assumption that only one person is related to an account is, if you think about it, not accurate.  There could be multiple people on a mortgage account (co-owners of the house), or the bank account could be a joint account.  Because we can have as many link table entries as we want, we can represent as many customers and their relationships to an account as we want. Well, that’s a lot of background, isn’t it?  But it’s important to understanding data virtualization and why the business data model is its heart.

In Part 2 of this blog, we’ll look at how the process-oriented design approach has reemerged in the Big Data culture, and how data virtualization helps restore OOD/Normalization order.