all bits considered data to information to knowledge


Big Data Open Source Tools

Here is an ever-growing list of open source tools to make your BIG data dreams a reality, compliments of


Look Ma, no SQL!

Is the Structured Query Language  goes the way of dinosaurs?
First proposed back in 1970s, the relational database technologies have flourished, taking over the entire data processing domain (with an occasional non-relational data storage hiding in long shadows of the [t]rusty mainframes). The days of glory may be over, and the reason could be  ... yes, you've guessed it - a paradigm shift.

The relational databases brought order into chaotic world of unstructured data; for years the ultimate goal was to normalize data, organize it in some fashion, chop it into entities and attributes so it could be further sliced and diced to construct information... There was a price to pay though t - need for a set-based language to manipulate the data, namely, Structured Query Language - SQL  (with some procedural and multidimensional extensions trown in...)

The Holy Grail was to get data to 5NF, and then create a litter of data warehoses - either dimensional or normalized to analyze the data.... Then again, maybe we could just leave the data the way it is, stop torturing it into relational model - and gain speed and flexibility at the same time?  That's what I call a paradigm shift!

Enter MapReduce: Simplified Data Processing on Large Clusters, another idea from Google (which also inspired Hadoop - open source implementation of the idea)

Google is doing it, Adobe is doing it, FaceBook is doing it, and hordes of other, relatively unknown, vendors are doing it ( lots of tacky names - CouchDB, MongoDB, Dynomite, HadoopDB, Cassandra,Voldemort, Hypertable ... 🙂

IBM, Oracle and Microsoft have announced additional features for their flagship products: the M2 Data Analysis Platform based upon Hadoop, and Microsoft extending its LINQ  (which goes past relational data) to include similar features... Sybase has recently announced that it implementes MapReduce in its SybaseIQ database.

To be true, the data still undergo some pre-processing to be fully managed by these technologies, but to a much lesser degree. The technology is designed to abstract intricacies of parallel processing, and to facilitate managementr of large distributed data sets;  it aims not to eliminate need for relational storage but the need for SQL to manipulate the data... the idea is to allow analytic processing of the data where it lives, without expensive ETL, and with minimal performance hit. The line is blurring between ORM, DBMS, OODBMS and programming environment; between data and data processing..

With all that said, it might not be the time to ditch your trusty RDBMS ( just yet...:)  A team of researchers concluded that "Databases "were significantly faster and required less code to implement each task, but took longer to tune and load the data," the researchers write. Database clusters were between 3.1 and 6.5 times faster on a "variety of analytic tasks."