all bits considered data to information to knowledge


Where the wild data are

There were times when all data were wild... and if it was stored at all, it was committed to a memory of an individual which tended to fade away.  To facilitate this transient storage the data was wrapped in protocols of rhymes and vivid symbolic images; the pictures were drawn, stories were told. Then, about 5,000 years ago the writing systems begun develop – ideographic, symbols and, finally, letters. The data was tamed. The letters made up words, the words made up a sentence, and the sentence, hopefully, made sense. The data was written on clay tablets, animal skins, recorded on papyrus, vellum, paper, magnetic and laser disks… We got quite skillful at butchering the data into neatly organized chunks, and devising ever more sophisticated structures to hold it – scrolls, books, databases.

And then Internet happened. They say that we are creating more data in a year than in the previous thousand years, that 90% of the world’s data have been created in the past two years (though, according to one interpretation of the Law of Information Conservation, we only engage in recycling the information redistributed from existing sources) We are swamped with information, and the old, tried, trusted and true approach of organizing information into a palatable chunks is no longer working. Facing information deluge, we are forced to go back to basics – raw data – and find ways to make use of it without forcing it into a Procrustean bed  of some structure that might have seemed like a bright idea once. Hence the resurrection of an old idea of hierarchical databases – the ones before the advent of SQL – under the guise of NoSQL movement, and much hyped Big Data... In a sense, Big Data is nothing new, it’s been around us as long as humanity itself but just as with the proverbial iceberg, most of it was hidden from our conscious use – which by no means does not mean that we haven’t used it! No, it was always there for us, seeping in from traditions, proverbs, legends – something that we use without consciously thinking about it, the gut feeling, the social norms. The modern Big Data but extends this concept to the computers.

And I believe that the data can take care of itself, if only humans stopped telling it what to do – but we do have to arrange the meeting J

Instead of thinking how to accommodate the new data source or new data formats (e.g  video, mp3 files, text of various degrees of structural and semantic complexity), the humans can let the data figure out how to interpret it by itself.

The new data format could be analyzed, its structure inferred from background information/metadata, its usage from countless examples of similar (or not) data… This will require enormous computing power but we are getting close to it with likes of crowdsourcing, probability scores and machine learning, Hadoop infrastructure, variety of NoSQL and RDBMS data working together to produce insights from the data in the wild, the data over which we have no control, unreliable, inherently “dirty” data  ( and the degree of “dirtiness” itself is a valuable piece of information!)

It is nice to have smart ontology all figured out for the information we are using but it would be hundred times nicer not to pay any attention to any given ontology, and still being able to make meaningful use of the data!


How big is Big Data?

There is no shortage of definitions for the ‘Big Data’ buzzword. Usually it is described in multiples of “V” – volume, velocity, variety (plug in your favorite data-related problem).

I believe that Big Data is defined only by our ability to process it.

There has always been Big Data, since the time when it was chiseled into stone, one symbol at the time.

We were talking about big data when it was written onto the papyri, vellum, paper; we have invented libraries, Dewey system, Hollerith cards, computers, – all in the name to process ever-increasing volumes of data, ever faster. Once upon a time a terabyte of data was “unimaginably big” (hence a company named “Teradata”), now a petabyte appears to be the “BIG” yardstick, only to be replaced with exabyte, zettabyte etc. in the near future; instead of batch processing we are moving to real-time, and, as with every bit of digital information, we are still storing numbers that to us represent text, video, sound and - yes - numbers.

The electronic data processing made a complete circle – from unstructured sequential files to structured hierarchical/network/relational database to NoSQL graph/doc databases and Hadoop sequential files processing.

Each round brings us closer to “analog data” – the ones that don’t have to be disassembled into bits and bytes to be understood and analyzed, the raw data.

Crossing the artificial chasm between digital and analog data will be the next frontier.


Ecclesiastes 9-11:
What has been is what will be,
and what has been done is what will be done,
and there is nothing new under the sun.

Is there a thing of which it is said,
“See, this is new”?
It has been already
in the ages before us.

There is no remembrance of former things,
nor will there be any remembrance
of later things yet to be among those who come after.


Data Scientists or… Psychohistorians?

Before the Big Data, social Data Science/Data Mining and Machine Learning there was … Psychohistory!

The concept was introduced in 1951 by Isaac Asimov in his monumental Sci-Fi trilogy “ The Foundation”, and is very closely correlated with this “new” phenomenon of statistical modeling of the social interactions.

Proof? The definition from Encyclopedia Galactica quoted at the beginning of the 4th Chapter of The Foundation Trilogy:

Gaal Dornick, using non-mathematical concepts, has defined psychohistory to be that branch of mathematics which deals with reaction of human conglomerates to fixed social and economic stimuli …

… Implicit in all these definitions is the assumption that the human conglomerate being dealt with is sufficiently large for valid statistical treatment. The necessary size of such conglomerate may be determined by Seldon’s First Theorem which… A further necessary assumption is that the human conglomerate be itself unaware of psychohistoric analysis in order for its reactions to be truly random…

The basis of all valid psychohistory lies in the development of the Seldon Functions which exhibit properties congruent to these of such social and economic forces as …”


Asimov correctly points out the boundary conditions  of this statistical analysis – for this to work the society must be unaware of the analysis taking place and/or how it works as this would skew the distribution curve. After all, if the people stop clicking on these links and like-me-buttons, and stop sharing their information  (or worse – start feeding in some garbage data) all these sophisticated models would go haywire.

To continue analogy, the "Mule" character represents the "Black Swan" event that invalidates the entire premise based on normal distribution.


Big Data vs. Lots of Data

A short presentation intriguingly titled "Top Five Questions to Answer Before Starting on Big Data" caught my attention. There is a lot of noise around "Big Data" phenomenon already proclaimed to be The Next Big Thing. Quite a few folks disagreed, including Stephen Few of Perceptual Edge who published paper with a title "Big Data, Big Ruse" (pdf).

Don't get me wrong - I do believe that Big Data IS a big thing, and that its introduction will bring about a proverbial paradigm shift (another arguably over-used term of the last decade). Yet many people, while talking about Big Data, have a rather vague idea what it is, and many believe that is is equal to "Lots of Data" which underwent qualitative transformation a la Karl Marx ("Merely quantitative differences, beyond a certain point, pass into qualitative changes." --Karl Marx, Das Kapital , Vol. 1.)

Sorry to contradict some aficionados of dialectical materialism but.. it ain't so. Which is exactly the point of the slide #3 in the aforementioned deck.

The current incarnation of Big Data is mostly about machine-generated data. There might be lots of nuances and exceptions to this affirmation but humans simply cannot match machine's ability to generate data 24/7. True, lots of this data is generated in response to human activity (e.g. clickstreams) but even then it is enhanced with machine-generated information (e.g. date/time stamps, geocoding etc); a single tweet could generate additional kilobytes of contextual data which can enhance the semantic value of the tweet itself - to the business, not the tweeter, of course!... Say, was it tweeted from a mobile device or a laptop? which operating system? what browser/application? what time of day/night? geographical location? time elapsed between first syllable and the last? language used?  and so on and so on.

This is what Big Data is all about. And this is why the question on slide #3 - "Do you have Big Data problem or just Lots of Data problem?" comes right after "What do you need to know?" on slide 2.

9 out of 10 times people talking about Big Data are referring to the data locked in their enterprise database, documents and web pages; some of it might even include metadata. But the machine generated component - the proverbial 800 pound gorilla in the room - flies under the radar. The enterprise data - a domain of BI -  is but a tip of the iceberg which is the Big Data.



A Brief History of Big Data

How had the data grown so BIG? The trends appear to have finally converged:

  • data storage  went from stone tablets to animal skins to paper to HDD to...
  • ability to process information went from memorization to computer aided recall
  • humans are no longer the biggest data producers in the Universe

Gil Press traces the origins of Big Data - or at least a premonition thereof  - to 1946 in a great article in Forbes - a fascinating read!

I think that (with a bit of a stretch) it could go back all the way to Marcus Tullius Cicero, who lived in first century B.C.

"Times are bad. Children no longer obey their parents, and everyone is writing a book." 

It's the book part I am referring to  🙂


The fine line between “Big Data” and “Big Brother”

The was never a lack of desire to collect as much data as possible on the part of business or governments; it was capabilities that always got in the way. With advent of "Big Data" technology the barrier had just been lowered.

Monitoring employees interactions in minute detail to analyze patterns, and get ideas on productivity improvements is not illegal per se...but it takes us one step further towards this proverbial slippery slope.

The recent article in The Wall Street Journal by Rachel Emma Silverman highlights the indisputable advantages but somehow glosses over the potential dangers:

As Big Data becomes a fixture of office life, companies are turning to tracking devices to gather real-time information on how teams of employees work and interact. Sensors, worn on lanyards or placed on office furniture, record how often staffers get up from their desks, consult other teams and hold meetings.

Businesses say the data offer otherwise hard-to-glean insights about how workers do their jobs, and are using the information to make changes large and small, ranging from the timing of coffee breaks to how work groups are composed, to spur collaboration and productivity.


[06.17.2013] Here's a blog post addressing the very same issues by Michael Walker, with benefit of hindsight after revelations on PRISM surveillance program:


Big Data Open Source Tools

Here is an ever-growing list of open source tools to make your BIG data dreams a reality, compliments of