The Heart of Data Virtualization (Part 3)

Posted by Steve Cormier

Find me on:

10/13/15 11:14 PM

Part 3: Data Virtualization to the Rescue

In Part 2 of this article, we explored how creating a data warehouse meant buying hardware, and software, and managing it, and doing backups, and version upgrades, and having more DBA’s, and how something new was needed.

Enter the concept of data virtualization.

The idea was to create something akin to a ‘logical data warehouse’ that would do much of what a data warehouse does, including having a good OOD/Normalized model, but that would pull the data from the original systems on-demand.

Data virtualization lowers cost significantly in many areas, and because computer systems are so much faster than they have been in the past, pulling from the operational systems is very often possible without disrupting them. Where that isn’t the case, DV also uses sophisticated caching mechanisms (temporary data stores that hold often-used data) to lessen strain on the operational systems. DV doesn’t necessarily eliminate the need for a data warehouse, because a warehouse may be necessary for certain complex data integrations from multiple systems, but in many situations the operational systems contain as much data as is needed for analysis, or the older data can be sent off to cheaper archival data stores with the same structure and knitted together with the current data via DV. Where there is a data warehouse, DV can integrate it with other sources, which can be a very powerful combination.

Still, despite the great value of this new technology, without a good model, the whole thing turns to dung.

Why is that?

Well, just because you virtualize doesn’t mean you can’t virtualize garbage.

I’ll give an example.

I have Jane Jones in three different operational systems with differing representations. I can create a bad central model in the virtualization (semantic) layer that doesn’t integrate Jane into a single person representation. If I do this, all the same issues of maintenance and possible error still exist. Realize that this pattern is a definite possibility. People want to get data out of the original sources quickly, and they may not care too much how they do it. Jane may end up existing in different versions in the original systems, and also existing in different versions in several virtualized data stores drawn from the different operational systems.

One of the prime reasons to do data virtualization is to feed analytics systems. Bad data in these systems can lead to business decisions being based on wrong information, harming the organization. Imagine Bill from finance running his numbers on Jane from his virtualized data mart and Carol from sales running the same numbers on Jane from her different virtualized mart, and then both going up to a higher executive who sees that the numbers aren’t the same. Not good. The users of the analytics systems usually aren’t computer specialists, and so giving them very clean data is critical, as they will depend on this data being accurate, and any discovery by them that it isn’t will seriously harm trust in the DV effort.

All of this is why the business model is the heart of data virtualization. Without proper modeling, data virtualization can become almost worthless—simply broadcasting bad or inconsistent data.

The model is the champion of the business rules, the wizard that pulls inconsistent data from process-oriented systems and transforms it into clear and concise singular representation. The model also contains the definitions of data elements for reference, so that when someone needs to know where something came from, and what it means in the business, they can look to the model’s careful, universal definitions that are kept in the DV system.

The Perfect, the Good, and Time to Value

Building a good central model can take some time, if the scope is large.

That’s why, at least at first, your scope should be small.

There’s a funny thing about good data modeling. If you do it right, it doesn’t matter if you start with a small scope (a small area of the organization being modeled). When you want to expand to more areas, you won’t have to do a lot of reengineering.

Remember our customer example? If you have your customer as a single object in your database, and you’re using link (associative) tables, then when you need to enter a new area (say auto loans), you just add a new link table from the customer to the auto loan account.

The way real-world modeling works is very cool. The real world is already there, made up of all the objects and relationships. So, when you model, you’re just focusing on new objects and new relationships that are already established in the world.

It’s sort of like you had a big map, and you’re shining a spotlight on different parts of it. Today I’ll model California, then maybe I’ll move my spotlight to Arizona tomorrow. The relationship between California and Arizona is already set in reality, so moving my spotlight just reveals more territory. If I’ve modeled reality correctly in my smaller scope, the objects and relationships will be right when I move my focus/scope. Remember though, modeling reality accurately in the complex world of business relationships isn’t simple. It takes a very high degree of skill, and that skill is of an unusual type. A good data modeler must be both an advanced technician and also capable of a thorough understanding of business relationships and accounting.

Now sometimes (frequently in fact) you have a really small budget and people who want very immediate results.

That’s difficult, because you know if you don’t do the modeling right, integrating things properly, you will pay for it later.

So, what I tell clients in such cases is that they can get away with maybe one, two, maybe three data stores without doing exact modeling. If the data sources are pretty simple, then you can get something out there that can quickly deliver great value without working out the real-world relationships perfectly.

However, BEWARE!

The quick win without proper modeling is seductive. You delivered value. Everyone’s thrilled. They want more. You feel like Elvis with thousands of screaming fans (okay a dated reference, maybe Beyoncé). You want to please them, so you start building more data ‘marts’ (a mart is a small, dedicated data set for a particular subject area like Finance or Sales). But slowly you realize that you’re flinging data all over the place. It’s getting harder to manage and you’re duplicating entities like customer and data is coming out wrong and…

You’re not a hero anymore. You’re the guy who looked good at first but then started delivering trash data that costs more and more.

This is the virtualizing garbage that I talked about above.

So, go ahead and give them the quick wins, but keep your eye on the future, and educate them about how important proper modeling is to long-term data health. Once you’ve established credibility in the first runs, you can probably get budget to build the proper business data model. When you’re doing the first data mart models, do everything you can to orient toward what you might do in the proper model. Get your naming standards down, make sure you do good data element documentation. Above all, resist scope creep. Make sure you get a high-value, low-cost first target and account for the overhead of the data elements you’re going to serve. Remember that every data element is an iceberg—most of the cost of managing it is below the surface that you initially see.

Conclusion

Building a successful data environment is a task that takes skill and judgment. You are building a machine with many interdependent parts. Data Virtualization is a powerful technology that can act as a central broker of information exchange throughout the enterprise.

But its power is dependent on how well it reflects the real-world business.

And how well it reflects the real-world business is entirely dependent on the business data model, the heart of data virtualization.

Topics: Analytics, Business Intelligence