all bits considered data to information to knowledge


How big is Big Data?

There is no shortage of definitions for the ‘Big Data’ buzzword. Usually it is described in multiples of “V” – volume, velocity, variety (plug in your favorite data-related problem).

I believe that Big Data is defined only by our ability to process it.

There has always been Big Data, since the time when it was chiseled into stone, one symbol at the time.

We were talking about big data when it was written onto the papyri, vellum, paper; we have invented libraries, Dewey system, Hollerith cards, computers, – all in the name to process ever-increasing volumes of data, ever faster. Once upon a time a terabyte of data was “unimaginably big” (hence a company named “Teradata”), now a petabyte appears to be the “BIG” yardstick, only to be replaced with exabyte, zettabyte etc. in the near future; instead of batch processing we are moving to real-time, and, as with every bit of digital information, we are still storing numbers that to us represent text, video, sound and - yes - numbers.

The electronic data processing made a complete circle – from unstructured sequential files to structured hierarchical/network/relational database to NoSQL graph/doc databases and Hadoop sequential files processing.

Each round brings us closer to “analog data” – the ones that don’t have to be disassembled into bits and bytes to be understood and analyzed, the raw data.

Crossing the artificial chasm between digital and analog data will be the next frontier.


Ecclesiastes 9-11:
What has been is what will be,
and what has been done is what will be done,
and there is nothing new under the sun.

Is there a thing of which it is said,
“See, this is new”?
It has been already
in the ages before us.

There is no remembrance of former things,
nor will there be any remembrance
of later things yet to be among those who come after.


Just-in-Time vs. Just-in-Case Information

The term Just-in-Time information has been around for quite awhile... It has been used in various contexts - ad-hoc BI dashboards, personal and corporate time management technique, career/life organizing principles and so on; I am rather sure that there will be more domains where JIT Information will enter in the future as both data and information become increasingly liberated.

I blogged about evolution of our information consumption patterns following my observation on how my son interacts with the outside world - smartphone, laptop, facebook, twitter, google+, youtube - endless stream of seemingly superfluous data in constantly changing contexts... seems like a perfect recipe for chaos. Yet somehow they manage to stay on-track, graduate from schools, and go about their lives. I maintain that ubiquitous easily accessible data will be the main driver of the next evolutionary cycle, and, as Yogi Berra used to say, the future is not what it used to be...

This leads me to an observation in a more predictable and controlled environment - that of databases, relational and otherwise. I can't help but notice uncanny parallels in evolution of the electronic data storage and retrieval systems with that of paper-based storage and retrieval systems (yes, a fancy name for the ordinary "book"). A database, or - more specifically - data warehouse, came into existence when data was scarce, and data access was slow, expensive and unreliable; most of the data stored in a given data warehouse stays dormant for years, rarely - if ever - accessed; I would dub this model "Just-in-Case" information.

As these challenges get addressed there is a shift underway from centralized data warehouse to federated to ad-hoc models. The Raw Data movement has already started - in many scenarios the no need to hoard data, all one needs to know is where to find relevant data (one can imagine ever higher level of hierarchies - directories of directories of directories... meta-meta-meta+n data ).

Proliferation of NoSQL databases, the concept of "Big Data" - are all part of this shift towards "Just-in-Time" information, data freed from the shackles of  schemas and structure... A piece of advice - cast thy data upon the waters: for thou shalt find it after many days   (adapted after Ecclesiastes 11:1  🙂



Raw data now!

You knew it all along - processed foods are bad for you. Could processed data be bad for you as well? I think so.

In process of preparing data for a specific use you might be inadvertently destroying it for any other future use: aggregated data loses its granularity; normalized data is forced into specific relationships, and so on. Solution - to leave the data as it is. Raw. Unmangled. Unadulterated. And beef up just-in-time processing capacities.

This aligns with a renaissance of key/value pairs as the preferred data storage format (so called NoSQL databases) which was around since 1940s, at the very least; the unstructured data which grows at exponential speed, and is increasingly consumed at the source without ETL processing..

Sure, convenience is lost with raw data - you're on your own to make sense out of it. For some people it might be exactly what they were craving for, and for those on a fast food diet there will be enough companies and individuals to package and sell the data, and its derivatives. But there will be always more where it came from!

Here's a video call for access to raw data by the real Internet inventor Tim Berners-Lee.



Map/Reduce two-liner

Despite being around for quite a few years, and successfully used in a number of high profile implementation there is certain confusion around map/reduce concept, especially among non-technical folks.

If you need to explain this concept (yes, still fresh in memory 🙂 , here’s a two-line explanation:

1. Map: distribute data gathering tasks across grid.

2. Reduce:  collect the results, and eliminate duplicates

This is the processes in a nutshell, the devil, as usual, is in details; the data grid refers to distributed storage and computing infrastructure, and there is nothing trivial in parsing query into parallel data gathering  tasks, and processing the returned results.

Tagged as: , , , , No Comments