all bits considered data to information to knowledge


How clean do you want your data?

Yet another presentation by an RDBMS vendor leaves me scratching my head... Master Data Management is the buzz word de  jour , and everybody has "just what you need"; naturally, I am sitting through all the presentations trying not to choke on chaff.

"Clean, consistent, and accurate data" - how many semantic overlaps do we have here? Does "clean" implies "consistent"? or maybe just "accurate"? If so, how do you measure cleanliness? degrees of consistency?

Wikipedia article on "Data Cleansing" makes a distinction between "data cleaning" and "data validation" (which it should) but for all the wrong reasons - "validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data." This would be news for the thousands of customers of SAS and Informatica struggling to validate (and re-validate) their legacy systems choke full of dirty data. Rather the distinction should be made on context - the Wikipedia article surreptitiously switches the context from database to software development;  no mater how much validation the gatekeepers might apply the sad fact is that once inside the data can (and often does) become invalid.

Unless there are measurable indicators quantifying these attributes of the data, the lofty goals of clean, accurate and consistent data are just a wishful thinking. A project manager on a Data Quality project must get all the stakeholders on the same page regarding what exactly each means by "clean data", and establish measures to gauge the progress and acceptance criteria.


Data Quality Serenity Prayer

Only with a static data source could one be sure that once a data cleansing process is established s/he would get completely repeatable results; anything more fluid than that - and all bets are off. In a real world, no data source is an island, and as data keep changing, be it by design or by accident, so must the data cleansing processes for these data.

Even in the most constrained data entry systems user's ingenuity would bring an occasional surprise unforseen by the system's architects. As software system evolves, even the best laid data quality assurance plans fail; the solution is, of course, to evolve DQA plans with the system.

The usual advice to go for a "good enough" quality needs to come with a number of qualifications, and should be applied on case by case basis with clear understanding of ramifications - formalized and written down. At the very least, any exception to DQA processes must be detected and analyzed.

To paraphrase well known Serenity Prayer 🙂

God, grant me the serenity to accept the data quality issues I cannot change,
courage to deal with the data problems I can,
and wisdom to know the difference