all bits considered data to information to knowledge


How clean do you want your data?

Yet another presentation by an RDBMS vendor leaves me scratching my head... Master Data Management is the buzz word de  jour , and everybody has "just what you need"; naturally, I am sitting through all the presentations trying not to choke on chaff.

"Clean, consistent, and accurate data" - how many semantic overlaps do we have here? Does "clean" implies "consistent"? or maybe just "accurate"? If so, how do you measure cleanliness? degrees of consistency?

Wikipedia article on "Data Cleansing" makes a distinction between "data cleaning" and "data validation" (which it should) but for all the wrong reasons - "validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data." This would be news for the thousands of customers of SAS and Informatica struggling to validate (and re-validate) their legacy systems choke full of dirty data. Rather the distinction should be made on context - the Wikipedia article surreptitiously switches the context from database to software development;  no mater how much validation the gatekeepers might apply the sad fact is that once inside the data can (and often does) become invalid.

Unless there are measurable indicators quantifying these attributes of the data, the lofty goals of clean, accurate and consistent data are just a wishful thinking. A project manager on a Data Quality project must get all the stakeholders on the same page regarding what exactly each means by "clean data", and establish measures to gauge the progress and acceptance criteria.