all bits considered data to information to knowledge

20Jul/110

Bringing spreadsheet to predictive modeling fight

The data is out there. Its tidal waves are sloshing around the world 24x7 only to be deposited in the depth of climate controlled server rooms. It keeps accumulating, and never goes away. These rich deposits of data are available for mining for those with right tools and knowledge of the terrain.

But the time is running out for the individual gold prospectors - their tiny pans (spreadsheets) are no match for dredges (BI suites) sifting through tons data day and night. The data mining focus is shifted from craftsmanship to industrial scale operations; data mining is becoming strategic advantage, a lever in negotiating deals and market positioning.

And then there is predictive modeling. Long domain of actuarial science and financial wizards-quants it is getting into businesses at all levels. If you are a small business owner negotiating health plan for your employees unless you are using some predictive modeling you are doing it based on guesswork and hunches, but your health insurer does not have to guess - it knows. It has models that analyze mounds of data in search for patterns, and then analyze this data across hundreds of different dimensions to come up with a pricing model for type of business you are operating, geographical location, demographics of your employees - in short, they know your business, and the call all the shots in the negotiations.

If you still using spreadsheets to analyze your business and negotiate deals with the partners it is time for an upgrade:  you are fighting predictive modeling cannons with a pen-knife.

Here’s how Pitney Bowes felt in 2001:

<Pitney Bowes> was renegotiating a contract with one of Pitney Bowes’ HMO vendors. After seven years of negotiating favorable rates based on actuarial data, which showed that Pitney Bowes employees tended to be younger (and thus healthier) than average, <they were> stymied. The HMO negotiator had data showing that even though Pitney Bowes employees were younger, they were sicker. And he was using that data to justify a rate increase.

"… every time I said something this guy had an answer. He must be doing something we’re not,’" <HR director>  recalls. That something turned out to be predictive modeling.”

16Mar/110

Just-in-time Data Warehousing

Just-in-time Data Warehousing

In his article Is an Enterprise Data Warehouse Still Required for Business Intelligence? the author Colin White of BI-Research ponders arguments against Enterprise Data Warehouse(EDW) as foundation for Business Intelligence (BI).

He's got a point. As BI becomes real-time (e.g. quant trading algorithms) there simply is not enough time to persist the data in the data warehouse and then extract it for analysis. At the same time, operating with just current data  there might be not enough information to base the decisions off. The answer, of course, is that the both systems are complementary. The "traditional" data warehouse stores the wealth of information, and operational BI uses it to match transitory patterns of the OLTP records against those mined from the EDW. In a sense, this is very similar to how human intuition works as we are sometimes forced to make split second decisions based on the most insufficient data available drawing upon the experience accumulated up to the moment...

The EDW is not going away anytime soon; the justifications for its existence remain as valid today as they were back in the days of yore. Colin White lists 5 key reasons :

  1. the data was not usually in a suitable form for reporting,
  2. the data often had quality issues,
  3. decision support processing degraded business transaction performance,
  4. data was  often dispersed across many different systems
  5. there was a general lack of historical information.

None of these went away, if anything the problems only got worse as we are moving towards evermore distributed,  voluminous and hetero-formatted data. What did change are the speed and processing power which enabled tasks parallelization and distributed computing at acceptable rates of performance, and advances in software engineering that made design, construction and maintenance of complex software-intensive systems manageable.

A system such as IBM Watson would be impossible without all-in-memory data storage, and clever parallelization strategy splitting tasks across ~2,000 CPUs. This just might be a precursor for a just-in-time Enterprise Data Warehouse where all processes we currently perform sequentially would be done in parallel with sufficient speed to make cleansing, accumulation, and analysis nearly simultaneous.