Where the wild data are

There were times when all data were wild... and if it was stored at all, it was committed to a memory of an individual which tended to fade away.  To facilitate this transient storage the data was wrapped in protocols of rhymes and vivid symbolic images; the pictures were drawn, stories were told. Then, about 5,000 years ago the writing systems begun develop – ideographic, symbols and, finally, letters. The data was tamed. The letters made up words, the words made up a sentence, and the sentence, hopefully, made sense. The data was written on clay tablets, animal skins, recorded on papyrus, vellum, paper, magnetic and laser disks… We got quite skillful at butchering the data into neatly organized chunks, and devising ever more sophisticated structures to hold it – scrolls, books, databases.

And then Internet happened. They say that we are creating more data in a year than in the previous thousand years, that 90% of the world’s data have been created in the past two years (though, according to one interpretation of the Law of Information Conservation, we only engage in recycling the information redistributed from existing sources) We are swamped with information, and the old, tried, trusted and true approach of organizing information into a palatable chunks is no longer working. Facing information deluge, we are forced to go back to basics – raw data – and find ways to make use of it without forcing it into a Procrustean bed  of some structure that might have seemed like a bright idea once. Hence the resurrection of an old idea of hierarchical databases – the ones before the advent of SQL – under the guise of NoSQL movement, and much hyped Big Data... In a sense, Big Data is nothing new, it’s been around us as long as humanity itself but just as with the proverbial iceberg, most of it was hidden from our conscious use – which by no means does not mean that we haven’t used it! No, it was always there for us, seeping in from traditions, proverbs, legends – something that we use without consciously thinking about it, the gut feeling, the social norms. The modern Big Data but extends this concept to the computers.

And I believe that the data can take care of itself, if only humans stopped telling it what to do – but we do have to arrange the meeting J

Instead of thinking how to accommodate the new data source or new data formats (e.g  video, mp3 files, text of various degrees of structural and semantic complexity), the humans can let the data figure out how to interpret it by itself.

The new data format could be analyzed, its structure inferred from background information/metadata, its usage from countless examples of similar (or not) data… This will require enormous computing power but we are getting close to it with likes of crowdsourcing, probability scores and machine learning, Hadoop infrastructure, variety of NoSQL and RDBMS data working together to produce insights from the data in the wild, the data over which we have no control, unreliable, inherently “dirty” data  ( and the degree of “dirtiness” itself is a valuable piece of information!)

It is nice to have smart ontology all figured out for the information we are using but it would be hundred times nicer not to pay any attention to any given ontology, and still being able to make meaningful use of the data!


A glimpse of future: Just-In-Time Information

Ever-shrinking attention span of the younger generation gets quite a bit of attention (pun intended) from the researchers and educators (e.g. "How Social Media Is Ruining Our Lives" - over the course of the last ten years the average attention span has dropped from 12 minutes to a staggeringly short 5 minutes )

Yet I wonder. Maybe we do not need long attention span in the era of informational deluge pouring through smartphones, tablets and laptops? The pervasive nature of internet is changing the way we collect and process information. No longer do we need to own information, we only need to know where/how to find it, and how to connect it with other bits we've already found.

Memorizing information was the staple of a rote learning for centuries - people traveled to read a copy of the book in particular library or listen to particular lecture; movable type and audio/video recording changed this - books/records/movies become more readily available, in a library or purchased from a bookstore. As time passed, books became ever more affordable - but they still were self contained: the information in a book/magazine/movie was distilled and structured to provide all the components needed. With the advent of Internet and electronic media this began to change - it became possible to transform raw data into information just in time. And the premium is not on ownership but on speed of finding and processing the data, ability to evaluate and integrate it on-the-fly, and - what's the word- the critical thinking.


Singing data blues

Data democratization is a great equalizer... Only, to paraphrase  George Orwell, some are going to  be more equal than the others. With data available in ever greater quantities and ever less processed formats the users are left to fend for themselves. The era of structured data is ending and era of raw data that comes to replace it will require entirely different set of skills .
Consider the following analogy - a musical tune. It is composed of the same 7 notes, arranged in some harmonic patterns which qualify it as a tune. You might like it, and then again - you might not. In the latter case you might decide to improve upon it; maybe, rearrange the notes, or combine it with a different tune.  The results would predictably vary - some would end up with a tune of their own but the vast majority the result would be cacophony at best, and white noise at worst ( I recognize cacophony to be a superior result because of the internal structure it implies whereas white noise has none).
Equal access to data does not automatically result in equal information being created - much less equal knowledge.

