all bits considered data to information to knowledge


Data Quality Serenity Prayer

Only with a static data source could one be sure that once a data cleansing process is established s/he would get completely repeatable results; anything more fluid than that - and all bets are off. In a real world, no data source is an island, and as data keep changing, be it by design or by accident, so must the data cleansing processes for these data.

Even in the most constrained data entry systems user's ingenuity would bring an occasional surprise unforseen by the system's architects. As software system evolves, even the best laid data quality assurance plans fail; the solution is, of course, to evolve DQA plans with the system.

The usual advice to go for a "good enough" quality needs to come with a number of qualifications, and should be applied on case by case basis with clear understanding of ramifications - formalized and written down. At the very least, any exception to DQA processes must be detected and analyzed.

To paraphrase well known Serenity Prayer 🙂

God, grant me the serenity to accept the data quality issues I cannot change,
courage to deal with the data problems I can,
and wisdom to know the difference


Raw data now!

You knew it all along - processed foods are bad for you. Could processed data be bad for you as well? I think so.

In process of preparing data for a specific use you might be inadvertently destroying it for any other future use: aggregated data loses its granularity; normalized data is forced into specific relationships, and so on. Solution - to leave the data as it is. Raw. Unmangled. Unadulterated. And beef up just-in-time processing capacities.

This aligns with a renaissance of key/value pairs as the preferred data storage format (so called NoSQL databases) which was around since 1940s, at the very least; the unstructured data which grows at exponential speed, and is increasingly consumed at the source without ETL processing..

Sure, convenience is lost with raw data - you're on your own to make sense out of it. For some people it might be exactly what they were craving for, and for those on a fast food diet there will be enough companies and individuals to package and sell the data, and its derivatives. But there will be always more where it came from!

Here's a video call for access to raw data by the real Internet inventor Tim Berners-Lee.



Data Math

The motto of this site is: "Data to Information to Knowledge". I'd like to think that it was I who came up with this though I did not dig around to prove it 🙂

Regardless of the origin, it does capture an important relationship:



How does context transforms data into information? Consider the following example.

Think of today's date in history  - May 11, 2011.

Written as 5112011 it is just a number.  Perhaps, annual compensation of a company's CEO? Number of atoms in 848869.25212257286350480987421015 E-17 of molar units?  If we interpret it as, say, a date then it might be the anniversary of the "day in 1934, (when) a massive storm sends millions of tons of topsoil flying from across the parched Great Plains region of the United States as far east as New York, Boston and Atlanta.", according to the

The context gives the data - a number, in my case - its meaning, and  transforms it into Information.

The analysis of the information takes it a step further - to the Knowledge. If I am dealing with the CEO's compensation t I could use it in charting out my investment strategy: is it time to go long on the company? to short the stock?

If I am a chemist, I might think of working with more manageable volumes of a particular compound; and if I am a farmer I might ponder the consequences of sustainable agriculture. This is knowledge, a human - possibly, uniquely human - trait.

While carrying vaguely theoretical overtones, these formulas have very practical implications to data architecture and systems engineering providing foundation for designing and constructing Enterprise Data Models, Enterprise Dashboards and ETL architectures.

And, the last but not least, keep this quote - by Albert Einstein - in mind:

Imagination is more important than knowledge


Data visualization at its finest: to Neptune and back

According to Google's former CEO Eric Schmidt, in just two days humans create as much data as they did since the first written record up to year 2003 (a date as good as any); this was calculated to be about 5 exabytes. In just two days. Turns out that this mind-boggling number (human mind, that is!) is but a surface scratching by machine standards.

The amount of business-related information processed by the world's computer servers in 2008 would fill enough books to create 20 stacks that would reach to Neptune and back, UC San Diego says in a study released on Wednesday. The unprecedented estimate says that 27 million servers distributed 9,570,000,000,000,000,000,000 bytes of data, or 9.57 zettabytes.

Most of this data is neither created nor aimed at humans. As Roger Bohn, a UCSD technology management professor who co-authored the study, puts it in his blog: "This is subterranean information. We see almost none of this. These are computers talking to each other at a level that is invisible even to people inside a corporation."

Data produced by machines and consumed by machines and - conceivably - transformed by machines into information (e.g. cross-indexing and context-sensitive search). The final frontier will be transforming information into knowledge, a domain that so far belongs exclusively to humans.


Entropy of content

The information out there becomes ever more fragmented, and ever less coherent. Arguably,  this could be a sign of "do-it-yourself" democratization of the information itself - no longer will the high priests of information be shaping the data to feed the unwashed masses (it’s might not have been an accident that no new “sacred texts” came into existence since the beginning of the last century… ) Rather, the masses themselves are now free to mix’n’match bits and pieces of data in any way they please. The positive aspect of the democratization is that access to it improved dramatically; the flip side of reducing information to the basic “elementary particles” is that the quality - the coherence of the data - went down just as dramatically.

Consider the following analogy: all music out there can be represented with the same seven notes, and their arrangement could produce Mozart’s symphony or Jingle Bells tune or - depending on the composer’s abilities - just a noise. The Internet brough about a new twist - now you don’t have to buy album of your favorite band, you can only buy a single tune; you could buy a book by chapter, or choose a cut out of famous work of art for a poster.  The rules of packaged deal where author alone decided the structure of his or her masterpiece - the chapters or songs sequence, the arrangement of elements, their colors and shades - is no longer apply with its implicit “not labeled for individual sale” label removed forever.

With so much data out there sloshing around the world 24/7 the question is why do people still pay for information, buying books and paintings?  The answer is the same as it was millennia ago: the ability to tell a good story, to transform raw ingredients into a delightful dish, is a talent which many do not have, and are willing to pay for.


Map/Reduce two-liner

Despite being around for quite a few years, and successfully used in a number of high profile implementation there is certain confusion around map/reduce concept, especially among non-technical folks.

If you need to explain this concept (yes, still fresh in memory 🙂 , here’s a two-line explanation:

1. Map: distribute data gathering tasks across grid.

2. Reduce:  collect the results, and eliminate duplicates

This is the processes in a nutshell, the devil, as usual, is in details; the data grid refers to distributed storage and computing infrastructure, and there is nothing trivial in parsing query into parallel data gathering  tasks, and processing the returned results.

Tagged as: , , , , No Comments