all bits considered data to information to knowledge


Where the wild data are

There were times when all data were wild... and if it was stored at all, it was committed to a memory of an individual which tended to fade away.  To facilitate this transient storage the data was wrapped in protocols of rhymes and vivid symbolic images; the pictures were drawn, stories were told. Then, about 5,000 years ago the writing systems begun develop – ideographic, symbols and, finally, letters. The data was tamed. The letters made up words, the words made up a sentence, and the sentence, hopefully, made sense. The data was written on clay tablets, animal skins, recorded on papyrus, vellum, paper, magnetic and laser disks… We got quite skillful at butchering the data into neatly organized chunks, and devising ever more sophisticated structures to hold it – scrolls, books, databases.

And then Internet happened. They say that we are creating more data in a year than in the previous thousand years, that 90% of the world’s data have been created in the past two years (though, according to one interpretation of the Law of Information Conservation, we only engage in recycling the information redistributed from existing sources) We are swamped with information, and the old, tried, trusted and true approach of organizing information into a palatable chunks is no longer working. Facing information deluge, we are forced to go back to basics – raw data – and find ways to make use of it without forcing it into a Procrustean bed  of some structure that might have seemed like a bright idea once. Hence the resurrection of an old idea of hierarchical databases – the ones before the advent of SQL – under the guise of NoSQL movement, and much hyped Big Data... In a sense, Big Data is nothing new, it’s been around us as long as humanity itself but just as with the proverbial iceberg, most of it was hidden from our conscious use – which by no means does not mean that we haven’t used it! No, it was always there for us, seeping in from traditions, proverbs, legends – something that we use without consciously thinking about it, the gut feeling, the social norms. The modern Big Data but extends this concept to the computers.

And I believe that the data can take care of itself, if only humans stopped telling it what to do – but we do have to arrange the meeting J

Instead of thinking how to accommodate the new data source or new data formats (e.g  video, mp3 files, text of various degrees of structural and semantic complexity), the humans can let the data figure out how to interpret it by itself.

The new data format could be analyzed, its structure inferred from background information/metadata, its usage from countless examples of similar (or not) data… This will require enormous computing power but we are getting close to it with likes of crowdsourcing, probability scores and machine learning, Hadoop infrastructure, variety of NoSQL and RDBMS data working together to produce insights from the data in the wild, the data over which we have no control, unreliable, inherently “dirty” data  ( and the degree of “dirtiness” itself is a valuable piece of information!)

It is nice to have smart ontology all figured out for the information we are using but it would be hundred times nicer not to pay any attention to any given ontology, and still being able to make meaningful use of the data!


Just say NO to data moochers!

The other day I stopped by a Great Clips salon to get a haircut. I was greeted with "Hi! What's your phone number?".  Then the following dialog ensued:

- Well, that's a bit personal, don't you think?

- I need it to enter into the computer! - the lady looked a bit pensive.

- I could give you my name. It's Alex  - I didn't want to cause any trouble for her, just wanted a haircut.

- Is this the name you're usually using here? I can't find you in the computer! - she sounded annoyed.

That was it for me. I muttered my thanks, and left the premises with a firm intention to boycott Great Clips from now on.

I got my haircut from a friendly neighborhood salon down the road - no questions asked.

Everybody tracks everybody nowadays. The loyalty cards, online cookies, single sign-on apps seems to be proliferating with a speed of electrical current. And I get it - Facebook collects information in exchange for providing me with a valuable service, Safeway collects my information in exchange for giving a discount etc. All this is spelled out upfront, with clear understanding of what this transaction brings to both parties. But why would Great Clips expect me to share my personal information with them for free?

On the other hand, I am wondering whether there is such thing as "data addiction", and if so - what are the health implications for the company that got into the habit? After all, a mix of data and predictive models can be, well, unpredictable.

Gaining insight from the data is great... if both data and the assumptions and the predictive models are correct. And this is a big IF.


How big is Big Data?

There is no shortage of definitions for the ‘Big Data’ buzzword. Usually it is described in multiples of “V” – volume, velocity, variety (plug in your favorite data-related problem).

I believe that Big Data is defined only by our ability to process it.

There has always been Big Data, since the time when it was chiseled into stone, one symbol at the time.

We were talking about big data when it was written onto the papyri, vellum, paper; we have invented libraries, Dewey system, Hollerith cards, computers, – all in the name to process ever-increasing volumes of data, ever faster. Once upon a time a terabyte of data was “unimaginably big” (hence a company named “Teradata”), now a petabyte appears to be the “BIG” yardstick, only to be replaced with exabyte, zettabyte etc. in the near future; instead of batch processing we are moving to real-time, and, as with every bit of digital information, we are still storing numbers that to us represent text, video, sound and - yes - numbers.

The electronic data processing made a complete circle – from unstructured sequential files to structured hierarchical/network/relational database to NoSQL graph/doc databases and Hadoop sequential files processing.

Each round brings us closer to “analog data” – the ones that don’t have to be disassembled into bits and bytes to be understood and analyzed, the raw data.

Crossing the artificial chasm between digital and analog data will be the next frontier.


Ecclesiastes 9-11:
What has been is what will be,
and what has been done is what will be done,
and there is nothing new under the sun.

Is there a thing of which it is said,
“See, this is new”?
It has been already
in the ages before us.

There is no remembrance of former things,
nor will there be any remembrance
of later things yet to be among those who come after.


Big Data vs. Lots of Data

A short presentation intriguingly titled "Top Five Questions to Answer Before Starting on Big Data" caught my attention. There is a lot of noise around "Big Data" phenomenon already proclaimed to be The Next Big Thing. Quite a few folks disagreed, including Stephen Few of Perceptual Edge who published paper with a title "Big Data, Big Ruse" (pdf).

Don't get me wrong - I do believe that Big Data IS a big thing, and that its introduction will bring about a proverbial paradigm shift (another arguably over-used term of the last decade). Yet many people, while talking about Big Data, have a rather vague idea what it is, and many believe that is is equal to "Lots of Data" which underwent qualitative transformation a la Karl Marx ("Merely quantitative differences, beyond a certain point, pass into qualitative changes." --Karl Marx, Das Kapital , Vol. 1.)

Sorry to contradict some aficionados of dialectical materialism but.. it ain't so. Which is exactly the point of the slide #3 in the aforementioned deck.

The current incarnation of Big Data is mostly about machine-generated data. There might be lots of nuances and exceptions to this affirmation but humans simply cannot match machine's ability to generate data 24/7. True, lots of this data is generated in response to human activity (e.g. clickstreams) but even then it is enhanced with machine-generated information (e.g. date/time stamps, geocoding etc); a single tweet could generate additional kilobytes of contextual data which can enhance the semantic value of the tweet itself - to the business, not the tweeter, of course!... Say, was it tweeted from a mobile device or a laptop? which operating system? what browser/application? what time of day/night? geographical location? time elapsed between first syllable and the last? language used?  and so on and so on.

This is what Big Data is all about. And this is why the question on slide #3 - "Do you have Big Data problem or just Lots of Data problem?" comes right after "What do you need to know?" on slide 2.

9 out of 10 times people talking about Big Data are referring to the data locked in their enterprise database, documents and web pages; some of it might even include metadata. But the machine generated component - the proverbial 800 pound gorilla in the room - flies under the radar. The enterprise data - a domain of BI -  is but a tip of the iceberg which is the Big Data.



Meditations on Nature of Data: the first three days :)

Meditating on John Wheeler's "It from Bit" insight I came up with what really could have been going on during first three days of Creation.... 🙂


1. In the beginning Information Architect created the database and the model.

2. Now the database was formless and empty, darkness was over the surface of the deep, and the Spirit of Backus-Naur was hovering over the databases.

3. And Architect said: "Let there be data", and there were data.

4. And Information Architect saw that the data were good, and he separated data from metadata.

5. Architect called the data "data," and the metadata he called "metadata." And there was evening, and there was morning - the first day.

6. And Information Architect said, "Let there be a vault between the data to separate datum from datum." And it was so.

7. So the Architect made the vault and separated the datum under the vault from the datum above it.

8. Architect called the vault "database schema." And there was evening, and there was morning - the second day...

9. And Information Architect said, "Let the data outside the schema be gathered to one place, and let storage appear." And it was so. 

10. Architect called the storage "data storage" and the gathered data he called "data logs". And Architect saw that it was good.

11. Then Information Architect said, "Let the data logs, and databases produce meaningful information according to their various kinds of context." And it was so.

12. The data combined with context produced information: databases bearing reports according to their kinds and data logs bearing reports in it according to their kinds. And Information Architect saw that it was good. 

13. And there was evening, and there was morning - the third day.



The fine line between “Big Data” and “Big Brother”

The was never a lack of desire to collect as much data as possible on the part of business or governments; it was capabilities that always got in the way. With advent of "Big Data" technology the barrier had just been lowered.

Monitoring employees interactions in minute detail to analyze patterns, and get ideas on productivity improvements is not illegal per se...but it takes us one step further towards this proverbial slippery slope.

The recent article in The Wall Street Journal by Rachel Emma Silverman highlights the indisputable advantages but somehow glosses over the potential dangers:

As Big Data becomes a fixture of office life, companies are turning to tracking devices to gather real-time information on how teams of employees work and interact. Sensors, worn on lanyards or placed on office furniture, record how often staffers get up from their desks, consult other teams and hold meetings.

Businesses say the data offer otherwise hard-to-glean insights about how workers do their jobs, and are using the information to make changes large and small, ranging from the timing of coffee breaks to how work groups are composed, to spur collaboration and productivity.


[06.17.2013] Here's a blog post addressing the very same issues by Michael Walker, with benefit of hindsight after revelations on PRISM surveillance program:


Just-in-Time vs. Just-in-Case Information

The term Just-in-Time information has been around for quite awhile... It has been used in various contexts - ad-hoc BI dashboards, personal and corporate time management technique, career/life organizing principles and so on; I am rather sure that there will be more domains where JIT Information will enter in the future as both data and information become increasingly liberated.

I blogged about evolution of our information consumption patterns following my observation on how my son interacts with the outside world - smartphone, laptop, facebook, twitter, google+, youtube - endless stream of seemingly superfluous data in constantly changing contexts... seems like a perfect recipe for chaos. Yet somehow they manage to stay on-track, graduate from schools, and go about their lives. I maintain that ubiquitous easily accessible data will be the main driver of the next evolutionary cycle, and, as Yogi Berra used to say, the future is not what it used to be...

This leads me to an observation in a more predictable and controlled environment - that of databases, relational and otherwise. I can't help but notice uncanny parallels in evolution of the electronic data storage and retrieval systems with that of paper-based storage and retrieval systems (yes, a fancy name for the ordinary "book"). A database, or - more specifically - data warehouse, came into existence when data was scarce, and data access was slow, expensive and unreliable; most of the data stored in a given data warehouse stays dormant for years, rarely - if ever - accessed; I would dub this model "Just-in-Case" information.

As these challenges get addressed there is a shift underway from centralized data warehouse to federated to ad-hoc models. The Raw Data movement has already started - in many scenarios the no need to hoard data, all one needs to know is where to find relevant data (one can imagine ever higher level of hierarchies - directories of directories of directories... meta-meta-meta+n data ).

Proliferation of NoSQL databases, the concept of "Big Data" - are all part of this shift towards "Just-in-Time" information, data freed from the shackles of  schemas and structure... A piece of advice - cast thy data upon the waters: for thou shalt find it after many days   (adapted after Ecclesiastes 11:1  🙂



Ethical limits of Business Intelligence

Intelligence of all kinds can be gleaned from the mounds of data accumulated from our daily interactions with the outside world such as business intelligence or social intelligence. It then can be used to manipulate our behavior to the benefit of the data collector/analyst.

Here is, for example,  how IKEA and Costco utilize information "to turn browsers into buyers, and making buyers to spend more". A new layout of the store floor or combination of sounds/lights/olfactory stimuli to put us in "buying mode", targeted advertising, mass customization based upon data collected from purchasing history, Facebook, LinkedIn, Google+... For example:

"In research yet to be published, a University of Alberta team has proven that what we smell and hear affects what we buy: When a sample group smelled the relaxing scent of lavender, 77% wanted a soothing iced tea, but when the same group smelled the arousing aroma of grapefruit, 70% reached for an energy drink. When the researchers played Mozart’s Sonata in D Major at a slow tempo, 71% wanted iced tea, but when the piano piece was sped up, 71% wanted an energy drink — an exact reversal."

Where does "legitimate use" stop and "Brave New World"/"1984" take over?

Where is this limit after which these "insights into consumers" behavior become invasion of privacy?



How clean do you want your data?

Yet another presentation by an RDBMS vendor leaves me scratching my head... Master Data Management is the buzz word de  jour , and everybody has "just what you need"; naturally, I am sitting through all the presentations trying not to choke on chaff.

"Clean, consistent, and accurate data" - how many semantic overlaps do we have here? Does "clean" implies "consistent"? or maybe just "accurate"? If so, how do you measure cleanliness? degrees of consistency?

Wikipedia article on "Data Cleansing" makes a distinction between "data cleaning" and "data validation" (which it should) but for all the wrong reasons - "validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data." This would be news for the thousands of customers of SAS and Informatica struggling to validate (and re-validate) their legacy systems choke full of dirty data. Rather the distinction should be made on context - the Wikipedia article surreptitiously switches the context from database to software development;  no mater how much validation the gatekeepers might apply the sad fact is that once inside the data can (and often does) become invalid.

Unless there are measurable indicators quantifying these attributes of the data, the lofty goals of clean, accurate and consistent data are just a wishful thinking. A project manager on a Data Quality project must get all the stakeholders on the same page regarding what exactly each means by "clean data", and establish measures to gauge the progress and acceptance criteria.


Singing data blues

Data democratization is a great equalizer... Only, to paraphrase  George Orwell, some are going to  be more equal than the others. With data available in ever greater quantities and ever less processed formats the users are left to fend for themselves. The era of structured data is ending and era of raw data that comes to replace it will require entirely different set of skills .
Consider the following analogy - a musical tune. It is composed of the same 7 notes, arranged in some harmonic patterns which qualify it as a tune. You might like it, and then again - you might not. In the latter case you might decide to improve upon it; maybe, rearrange the notes, or combine it with a different tune.  The results would predictably vary - some would end up with a tune of their own but the vast majority the result would be cacophony at best, and white noise at worst ( I recognize cacophony to be a superior result because of the internal structure it implies whereas white noise has none).
Equal access to data does not automatically result in equal information being created - much less equal knowledge.

Tagged as: , No Comments