all bits considered data to information to knowledge


Just-in-time Data Warehousing

Just-in-time Data Warehousing

In his article Is an Enterprise Data Warehouse Still Required for Business Intelligence? the author Colin White of BI-Research ponders arguments against Enterprise Data Warehouse(EDW) as foundation for Business Intelligence (BI).

He's got a point. As BI becomes real-time (e.g. quant trading algorithms) there simply is not enough time to persist the data in the data warehouse and then extract it for analysis. At the same time, operating with just current data  there might be not enough information to base the decisions off. The answer, of course, is that the both systems are complementary. The "traditional" data warehouse stores the wealth of information, and operational BI uses it to match transitory patterns of the OLTP records against those mined from the EDW. In a sense, this is very similar to how human intuition works as we are sometimes forced to make split second decisions based on the most insufficient data available drawing upon the experience accumulated up to the moment...

The EDW is not going away anytime soon; the justifications for its existence remain as valid today as they were back in the days of yore. Colin White lists 5 key reasons :

  1. the data was not usually in a suitable form for reporting,
  2. the data often had quality issues,
  3. decision support processing degraded business transaction performance,
  4. data was  often dispersed across many different systems
  5. there was a general lack of historical information.

None of these went away, if anything the problems only got worse as we are moving towards evermore distributed,  voluminous and hetero-formatted data. What did change are the speed and processing power which enabled tasks parallelization and distributed computing at acceptable rates of performance, and advances in software engineering that made design, construction and maintenance of complex software-intensive systems manageable.

A system such as IBM Watson would be impossible without all-in-memory data storage, and clever parallelization strategy splitting tasks across ~2,000 CPUs. This just might be a precursor for a just-in-time Enterprise Data Warehouse where all processes we currently perform sequentially would be done in parallel with sufficient speed to make cleansing, accumulation, and analysis nearly simultaneous.