all bits considered data to information to knowledge


Data Virtualization vs. Data Federation

Data Virtualization takes the idea of the Data Federation one step further - both abstract data sources for the users but the virtual data adds logic in an effort to present coherent data structure to the clients. This ties directly into Master Data Management domain: no longer will you access columns and rows but you will access logical data entities which behind the scenes could be a composite construct. For example, virtualized "Customer" attributes might come from within a dozen of data sources defined as a part of "golden record".

In my opinion, this implies that any Data Virtualization effort must rest on form foundation of Master Data Management while Data Federation can skip this step as optional.


How clean do you want your data?

Yet another presentation by an RDBMS vendor leaves me scratching my head... Master Data Management is the buzz word de  jour , and everybody has "just what you need"; naturally, I am sitting through all the presentations trying not to choke on chaff.

"Clean, consistent, and accurate data" - how many semantic overlaps do we have here? Does "clean" implies "consistent"? or maybe just "accurate"? If so, how do you measure cleanliness? degrees of consistency?

Wikipedia article on "Data Cleansing" makes a distinction between "data cleaning" and "data validation" (which it should) but for all the wrong reasons - "validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data." This would be news for the thousands of customers of SAS and Informatica struggling to validate (and re-validate) their legacy systems choke full of dirty data. Rather the distinction should be made on context - the Wikipedia article surreptitiously switches the context from database to software development;  no mater how much validation the gatekeepers might apply the sad fact is that once inside the data can (and often does) become invalid.

Unless there are measurable indicators quantifying these attributes of the data, the lofty goals of clean, accurate and consistent data are just a wishful thinking. A project manager on a Data Quality project must get all the stakeholders on the same page regarding what exactly each means by "clean data", and establish measures to gauge the progress and acceptance criteria.


The Single Version of Truth by Fiat

The one version of truth is an elusive goal of many corporate data initiatives. Getting people from all over the enterprise to agree on a single definition of anything - be it attributes of Customer or of a Product - can be a daunting task.

Jeanne Ross, Director of MIT’s Center for Information Systems Research, has a simple solution she had presentede last month on TechTomorrow conference  - just declare it.

The Single Version of Truth by fiat can be a proverbial line in the sand, or, better yet, Archimedes’ fulcrum… but it certainly ends paralysis. Once the truth is declared the project can move forward with “good enough” set of data; instead of debating merits of this or that data sets, and bemoaning its imperfections the focus shifts towards “How can we make it better?”

The ultimate goal of the single version of truth is a business one - not IT, and should be treated as such. The “good enough” criteria should be set by the business, and any further cleansing and conformance processes should be also viewed in terms of business value.


MDM market consolidation

Informatica acquires Siperian.  I had my bets on SAP (it was not a good fit for Microsoft or IBM) snapping it or, maybe, Oracle. I wonder what was the driving force: the software capabilities or its client base (with all this rush for EHR, HIE, HIMSS and the rest of the healthcare alphabet soup 🙂 ?