Hadoop in the Enterprise Data Warehouse: Next Gen Analytics

Posted by Christian Franklin

Find me on:

2/6/12 4:57 PM

The Forrester Wave Report: Enterprise Hadoop Solutions was just published. The report notes that more players are jumping into the Hadoop pond and points out that it is still an immature market (a view that we share).  Althought the report is not without controversy: Hadoop Players Question Forrester's Take On Leaders, it's worth a browse - you can get a copy of it here

One item that is consistently not explored sufficiently in the big data and information analysis dialogue is the distinction between semi & unstructured data and strategies for complementing the structured data stores that make up much of the worlds transactional information, but I digress.

IBM's BigInsights scores well but gets a demerit for not offering, as of yet, an appliance version but the fact of the matter is that IBM has partners (like us) that are already working on filling that gap but does this really matter given Hadoop's ability to run on commodity servers?

Back to my digression - one of the appeals of Hadoop is the capability to store a broad range of data types, process analytic queries via MapReduce and predictably scale with increased data volumes - these are very attractive solution characteristics when it comes to big data analytics. But the real value inflection point is when you architect your big data platform to integrate RDBMS based enterprise data warehouse (EDW) solutions (eg. the Netezza appliance - which delivers low latency access to high volumes of data and handles data retrieval via SQL and is optimized for price/performance across a diverse set of workloads).

Big Data

Below are some typical examples of using Hadoop:

Exploratory analysis – More and more frequently the enterprise encounters new sources for data analysis that need to be analyzed. A common example: your marketing department launched a new multi-channel campaign and wants to integrate user responses from Facebook and Twitter with other sources of data they may have. If the Facebook and Twitter API are new to you, or you are not familiar with their data feed structures, it might take some experimentation to figure out what to extract from that and how to integrate it with other sources. Hadoop’s ability to process data feeds, for which schema has not yet been defined, serves as an excellent tool for this purpose. So if you want to explore relationships within data, especially in an environment where schema is constantly evolving, Hadoop provides a mechanism by which one can explore the data until a formal, repeatable ETL process is defined.

Queryable archive – Big data analytics tends to bring large volumes of data into scope and a significant percentage of this data isn't accessed on a regular basis because it may represent historical information or granular level information that has subsequently been summarized within the EDW and putting all that data into an infrastructure optimized for performance/price probably isn't economically prudent. By contrast, storing the less frequently accessed information on infrastructure that is optimized for price/terabyte of storage and move it to the high performance infrastructure on-demand is more sensible. More specifically, Hadoop’s fault tolerant distributed storage system running on commodity hardware could serve as a repository for the information and unlike tape based storage systems that have no computational capability, Hadoop provides a mechanism to access and analyze data. Since moving computation is cheaper than moving data, Hadoop’s architecture is better suited as a queryable archive for big data.

Unstructured data analysis Recent studies have shown that many enterprises believe that the amount of unstructured data that they would have to analyze is growing very fast and in some situations could soon outpace the amount of structured data. Common examples of unstructured data analysis are to glean user sentiment from the company’s Twitter feed or gain insights embedded in customer phone conversations with their support personnel. RDBMS based data warehouses provide limited capabilities in storing complex data types represented by unstructured data. Also, performing computations on unstructured data via SQL can be quite cumbersome. Hadoop’s ability to store data in any format and analyze that using a procedural programming paradigm, such as MapReduce, makes it well suited for storing, managing and processing unstructured data.

In an EDW context, one can, and should consider using Hadoop to pre-process unstructured data, extract key features and metadata and load that into an RDBMS data warehouse for further analysis.

Topics: Analytics, Big Data