all bits considered data to information to knowledge


Elementary, my dear Watson!

A new era of was officially introduced on February 14, 2011 with an IBM Watson computer has taken upon a “uniquely human” activity - playing Jeopardy games. The machine was named after IBM founder Thomas J. Watson (in case anyone was wondering about why it was not named after Sherlock Holmes), and it represents a next giant step towards something that was dubbed “artificial intelligence” in 1956, and was almost exclusively in the domain of science fiction ever since.

For a long time it has been understood that simply to possess information does not equal ability to answer questions, let alone the intelligent ones. A search engine, even the most advanced one, relies on keywords to search for information; it is up to humans to come up with clever string of keywords, and it is ultimately human task to decide whether information returned constitutes an answer to the question. Watson takes it a step further - it has to figure out the question, deduct the context, and come up with statistically most-probable answer. This is very different from the Deep Blue computer which beat chess grandmaster Garry Kasparov in 1997. The chess game can be reduced to a set of well defined mathematical problems in combinatorics, a very large set to be sure, but ultimately susceptible to number-crunching power of the computer - no ambiguity, no contextual variations. The IBM Watson had to deal with uncertainty of human language; it had to interpret metaphors, it had to understand nuances of human language.

The tables had turned again - instead of humans learning machine’s language to query for answers it’s the machine who learned to understand questions posted with all ambiguity of the human language. With clever programming algorithms the computer was able to “understand” natural language query, and come up with a correct answer - most of the times, that is.

Does Watson use SQL to come up with the answer? The details of implementation is a closely guarded secret, at least for now. Given the limitations imposed by the Jeopardy rules, narrowly focused purpose and relatively modest computing power (around 2,000 CPU even though “connected in a very special way”- according to Dr. Christopher Welty, a member of the IBM artificial intelligence group, a far cry from 750,000 cores the IBM Mira super computer being built for DOE’s Argonne National Library), it is most probably did not use relational database to store data but rather relied on proprietary data structures and algorithms to search and retrieve the information. Eventually, these advances will make it into the mainstream database technology, and the way we transform data into information into knowledge will change, again. The future is near.

Update: IBM will incorporate Nuance CLU speech-recognition applications into the Watson supercomputer to provide information that assists doctors as they make diagnoses.


Look Ma, no SQL!

Is the Structured Query Language  goes the way of dinosaurs?
First proposed back in 1970s, the relational database technologies have flourished, taking over the entire data processing domain (with an occasional non-relational data storage hiding in long shadows of the [t]rusty mainframes). The days of glory may be over, and the reason could be  ... yes, you've guessed it - a paradigm shift.

The relational databases brought order into chaotic world of unstructured data; for years the ultimate goal was to normalize data, organize it in some fashion, chop it into entities and attributes so it could be further sliced and diced to construct information... There was a price to pay though t - need for a set-based language to manipulate the data, namely, Structured Query Language - SQL  (with some procedural and multidimensional extensions trown in...)

The Holy Grail was to get data to 5NF, and then create a litter of data warehoses - either dimensional or normalized to analyze the data.... Then again, maybe we could just leave the data the way it is, stop torturing it into relational model - and gain speed and flexibility at the same time?  That's what I call a paradigm shift!

Enter MapReduce: Simplified Data Processing on Large Clusters, another idea from Google (which also inspired Hadoop - open source implementation of the idea)

Google is doing it, Adobe is doing it, FaceBook is doing it, and hordes of other, relatively unknown, vendors are doing it ( lots of tacky names - CouchDB, MongoDB, Dynomite, HadoopDB, Cassandra,Voldemort, Hypertable ... 🙂

IBM, Oracle and Microsoft have announced additional features for their flagship products: the M2 Data Analysis Platform based upon Hadoop, and Microsoft extending its LINQ  (which goes past relational data) to include similar features... Sybase has recently announced that it implementes MapReduce in its SybaseIQ database.

To be true, the data still undergo some pre-processing to be fully managed by these technologies, but to a much lesser degree. The technology is designed to abstract intricacies of parallel processing, and to facilitate managementr of large distributed data sets;  it aims not to eliminate need for relational storage but the need for SQL to manipulate the data... the idea is to allow analytic processing of the data where it lives, without expensive ETL, and with minimal performance hit. The line is blurring between ORM, DBMS, OODBMS and programming environment; between data and data processing..

With all that said, it might not be the time to ditch your trusty RDBMS ( just yet...:)  A team of researchers concluded that "Databases "were significantly faster and required less code to implement each task, but took longer to tune and load the data," the researchers write. Database clusters were between 3.1 and 6.5 times faster on a "variety of analytic tasks."