all bits considered data to information to knowledge


Big Data vs. Lots of Data

A short presentation intriguingly titled "Top Five Questions to Answer Before Starting on Big Data" caught my attention. There is a lot of noise around "Big Data" phenomenon already proclaimed to be The Next Big Thing. Quite a few folks disagreed, including Stephen Few of Perceptual Edge who published paper with a title "Big Data, Big Ruse" (pdf).

Don't get me wrong - I do believe that Big Data IS a big thing, and that its introduction will bring about a proverbial paradigm shift (another arguably over-used term of the last decade). Yet many people, while talking about Big Data, have a rather vague idea what it is, and many believe that is is equal to "Lots of Data" which underwent qualitative transformation a la Karl Marx ("Merely quantitative differences, beyond a certain point, pass into qualitative changes." --Karl Marx, Das Kapital , Vol. 1.)

Sorry to contradict some aficionados of dialectical materialism but.. it ain't so. Which is exactly the point of the slide #3 in the aforementioned deck.

The current incarnation of Big Data is mostly about machine-generated data. There might be lots of nuances and exceptions to this affirmation but humans simply cannot match machine's ability to generate data 24/7. True, lots of this data is generated in response to human activity (e.g. clickstreams) but even then it is enhanced with machine-generated information (e.g. date/time stamps, geocoding etc); a single tweet could generate additional kilobytes of contextual data which can enhance the semantic value of the tweet itself - to the business, not the tweeter, of course!... Say, was it tweeted from a mobile device or a laptop? which operating system? what browser/application? what time of day/night? geographical location? time elapsed between first syllable and the last? language used?  and so on and so on.

This is what Big Data is all about. And this is why the question on slide #3 - "Do you have Big Data problem or just Lots of Data problem?" comes right after "What do you need to know?" on slide 2.

9 out of 10 times people talking about Big Data are referring to the data locked in their enterprise database, documents and web pages; some of it might even include metadata. But the machine generated component - the proverbial 800 pound gorilla in the room - flies under the radar. The enterprise data - a domain of BI -  is but a tip of the iceberg which is the Big Data.



A Brief History of Big Data

How had the data grown so BIG? The trends appear to have finally converged:

  • data storage  went from stone tablets to animal skins to paper to HDD to...
  • ability to process information went from memorization to computer aided recall
  • humans are no longer the biggest data producers in the Universe

Gil Press traces the origins of Big Data - or at least a premonition thereof  - to 1946 in a great article in Forbes - a fascinating read!

I think that (with a bit of a stretch) it could go back all the way to Marcus Tullius Cicero, who lived in first century B.C.

"Times are bad. Children no longer obey their parents, and everyone is writing a book." 

It's the book part I am referring to  🙂


Meditations on Nature of Data: the first three days :)

Meditating on John Wheeler's "It from Bit" insight I came up with what really could have been going on during first three days of Creation.... 🙂


1. In the beginning Information Architect created the database and the model.

2. Now the database was formless and empty, darkness was over the surface of the deep, and the Spirit of Backus-Naur was hovering over the databases.

3. And Architect said: "Let there be data", and there were data.

4. And Information Architect saw that the data were good, and he separated data from metadata.

5. Architect called the data "data," and the metadata he called "metadata." And there was evening, and there was morning - the first day.

6. And Information Architect said, "Let there be a vault between the data to separate datum from datum." And it was so.

7. So the Architect made the vault and separated the datum under the vault from the datum above it.

8. Architect called the vault "database schema." And there was evening, and there was morning - the second day...

9. And Information Architect said, "Let the data outside the schema be gathered to one place, and let storage appear." And it was so. 

10. Architect called the storage "data storage" and the gathered data he called "data logs". And Architect saw that it was good.

11. Then Information Architect said, "Let the data logs, and databases produce meaningful information according to their various kinds of context." And it was so.

12. The data combined with context produced information: databases bearing reports according to their kinds and data logs bearing reports in it according to their kinds. And Information Architect saw that it was good. 

13. And there was evening, and there was morning - the third day.



New Meaning of “Investing in Your Health”: an idea for Health Insurance Exchange

As states race to implement Health Insurance Exchanges mandated under Affordable Care Act, I wonder whether they have chosen a wrong model - that of an overseer, an information provider and a mediator... Maybe we could have borrowed a paradigm from Stock Exchange market?
Some of the major hurdles facing Health Insurance Exchange implementation include
  • inherent complexity of the endeavor
  • insufficient experience on the state part in operating exchanges (as opposed to financial industry)
  • need to attract sufficient number of participants to become efficient and self-sustaining
It is a common pattern in software engineering to deal with complexity by introducing an abstraction layer,
and financial industry did just that with the concept of Exchange-Traded Funds and Mutual Funds to ease complexity of picking individual stocks. I believe that Health Insurance Exchange might have much more in common with exchange than insurance, and that the very same concepts are applicable here.
Imagine health insurance pools structured in a way similar to that of mutual funds/ETF according to some predefined criteria, and designed to cater to a certain category of consumer (again, analogy of industry sector funds). It is then sold as units of insurance to consumer through the exchange.
The role of the Exchange operators would be that of mutual fund managers:
  • design portfolios of insurance plans and sell units of insurance to consumer (after proper validation and categorization)
  • handle fund-to-fund exchange (when situation of the customer changes)
  • process refunds and assess charges
  • provide apples-to-apples comparison
  • etc.
The insurer and the insured will be decoupled: the former will roll out insurance plans, and the latter will buy as much insurance or as little as they need, and the State's Health Insurance Exchange would provide platforms for "health insurance pool" comparison and rating, and handle financial transfers (including state/fed subsidy portions)..
In a commercial twist to the idea: since the funds have an expiration date, the insurers might even pay dividends to the units holders based upon un-used portion of the plan (insert actuarial voodoo here :).
There might be even a secondary market where investors might buy funds from customers (original units buyers) if they can reasonably expect positive return due to under-utilization of the plan.

The fine line between “Big Data” and “Big Brother”

The was never a lack of desire to collect as much data as possible on the part of business or governments; it was capabilities that always got in the way. With advent of "Big Data" technology the barrier had just been lowered.

Monitoring employees interactions in minute detail to analyze patterns, and get ideas on productivity improvements is not illegal per se...but it takes us one step further towards this proverbial slippery slope.

The recent article in The Wall Street Journal by Rachel Emma Silverman highlights the indisputable advantages but somehow glosses over the potential dangers:

As Big Data becomes a fixture of office life, companies are turning to tracking devices to gather real-time information on how teams of employees work and interact. Sensors, worn on lanyards or placed on office furniture, record how often staffers get up from their desks, consult other teams and hold meetings.

Businesses say the data offer otherwise hard-to-glean insights about how workers do their jobs, and are using the information to make changes large and small, ranging from the timing of coffee breaks to how work groups are composed, to spur collaboration and productivity.


[06.17.2013] Here's a blog post addressing the very same issues by Michael Walker, with benefit of hindsight after revelations on PRISM surveillance program:


Data as a weapon of choice

I blogged last week about Tesla Motors CEO fighting shoddy review of the car his company manufactures. This week news brought in yet another story of a CEO using data to justify a very public and controversial decision - banning employees from working from home

How did CEO Marissa Mayer decide to make such a controversial decision? According to a source, the only way Mayer is comfortable making any decision: with the help of data.


Diverging realities

An ability to filter information based upon one’s personal preferences and tastes has never been greater even as variety of the available information explodes. Therein lays a problem - rarely if ever do we choose to be exposed to something that we do not want to hear, view or read, and as a result we are surrounded with “yes-information” (by analogy with “yes-man” definition from Merriam-Webster).

Some of these filtering criteria are based on our own conscious choices, some - on insights into our subconscious preferences gathered from a social media trail… This ultimately leads to stratification of the society into “interest groups”, each being fed a different information diet, each assembling its own version of reality from the information that gets through the filters.

This “freedom of association” is not a new phenomenon, to be sure; it’s the emerging totality of it that might be transmogrifying one's freedom into a self-imposed exile.

P.S. A recent article in The Wall Street Journal by Evgeni Morozov  highlights other dangers of so-called "smart gadgets" - surrendering our ability to make mistakes. The smart gadgets are giving a new twist to the social engineering attempts as "a number of thinkers in Silicon Valley see these technologies as a way not just to give consumers new products that they want but to push them to behave better."  Of course, this implies that this "better" is a well understood universal concept; yet it has been proven - times and again - that the road to hell is paved with good intentions...


Fighting back with data!

A highly publicized fight between  NY Times John Broder and Tesla motors CEO Elon Musk is destined to become a case study in how data became not only "corporate asset" but a weapon of choice in protecting firm.

After a test-driving Tesla S car, the NY Times journalist John Broder published a review 'Stalled Out on Tesla’s Electric Highway" claiming that the car's poor performance under cold weather condition lead to it being towed before it reached its destination. Instead of issuing usual "confidence in quality of our product" and promises to "thoroughly review the incident" the Tesla Motors CEO fought back with data, first calling the Broder's review a fake on Twitter, and then supporting his claim with extensive data in form of graphs and tables in the article "The most peculiar test drive" posted on the corporate blog on February 13, 2013

Turns out that Tesla Motors had established a comprehensive data collection and analysis program:  distance, time spent at charging stations, the speed and geographical location of every vehicle used by the media at any given moment - tons and tons of machine generated data - were recorded and stored in the Tesla's corporate database. The data clearly shows that the car never "run out of battery", and makes several other claims made by John Broder dubious at best.

P.S. Mr. Broder since posted a response but his "recollections", "memorized facts" and hand-written notes look terribly suspicious and woefully inadequate against cold hard numbers presented by Elon Musk


An excellent bit of advice from the trenches: Startup DNA

Yevgeniy Brikman, staff engineer at LinkedIN who had "front row seats at very successful start-ups" (LinkedIN, TripAdvisor) shares his observations and insights in a few (well, 106) slides here (

A very interesting insight into minimizing "trial and error" cycle with dynamic languages (#35 ) and development methodologies/framework (#37) :  shortening the feedback loop allows for earlier maturity. Or, as he puts it quoting Jeff Atwood (of StackOverflow fame), " Speed of iteration beats quality of iteration".  Of course, without discipline and support of well defined processes and frameworks the speed alone could be a runaway train-wreck 🙂

Second observation that struck a chord with me on slide #50: "If you cannot measure it, you cannot fix it" .  [NB: Of course, this has a long history (and even longer attribution list) of being said at different times in various contexts by Lord Kelvin, Bill Hewlett, Tom Peters and Peter Drucker ]. The advice "Measure Everything" should not be taken too far, though:

"Not everything that counts can be counted, and not everything that can be counted counts. "  Albert Einstein

but in the context of the presentation Evgeniy's advice ought to be taken to the heart: collect server metrics, database metrics, client side metrics, profile metrics, activity metrics, bug metrics, build metrics, test metrics etc!

And the last (but not least) observation on sharing is arguably the most important one (slides #93-102).

"The best way to learn is to teach" -

Frank Oppenheimer


Application Lifecycle Management is coming of age

Software intensive systems have a lot in common with humans – they are born, mature and die. They even sometimes come back from the dead, or simply linger around scaring day lights out of everyone who come in contact with them... To minimize one's chances of inadvertently releasing such monsters into the wild one should adopt holistic point of view – that of Application Lifecycle Management.

Wikipedia defines ALM as "... a continuous process of managing the life of an application through governance, development and maintenance. ALM is the marriage of business management to software engineering made possible by tools that facilitate and integrate requirements management, architecture, coding, testing, tracking, and release management."

Can't say it is a novel concept - every organization is already doing all of these either by design or by accident, with majority falling into the giant “in-between” void. The key is tight integration between three key areas – governance, development and operations.

To address the issue many a vendor came with ALM tools - sometime bundled, oftentimes integrated, - into a suite. Wikipedia lists over 40 “products” ranging from full-blown suites to assembly of specific tools, both commercial and free open source. Gartner's MarketScope mentions 20 leading vendors with ALM suites offering, out of which 8 got “Positive” rating, and IBM’s one got the only “Strong Positive”. The Forrester's Wave for ALM lists 7 vendors in the “strong” segment, with additional marks for market presence (with IBM, HP and Microsoft leading the “big” guys and CollabNet, Atlassian and Rally Software in leading smaller vendors pack)

The ALM offerings differ in degree of completeness, degree of coherence between the tools and extensibility model provided. Some of the more integrated offerings come in a variety of flavors such as SaaS or on-premises installations, with numerous options to complement either. And then there is a price tag to consider which, as with everything that purports to address enterprise-wide issues, is not insignificant – ranging from tens of thousand of dollars to a couple millions (and then some) with additional costs for infrastructure, operations, and maintenance. Still, there is a solid evidence that these investments under right circumstances might and do pay off. Implementing ALM principles to the Enterprise Integration and/or software development project can significantly improve quality of the delivered system and positively affect schedule.

The ALM processes fall into 5 domains:

  1. Requirements Definition and Management
  2. Quality and Build Management (including test Case management)
  3. Software Change and Configuration Management
  4. Process Frameworks and Methodology
  5. Integration Across Multiple AD (Application development) Tools

An integrated suite with a hefty price tag must address all of these domains to be worth consideration; and for best of breed route integration consideration are of paramount importance in order to realize ALM potential. One such important consideration, for example, is an integrated QA (either that or ability to integrate with a QA suite)

So far, only two vendors today offer fully integrated (all 5 domains), end-to-end, technology neutral ALM suites – IBM Rational Jazz Platform and HP ALM. The rest is either very much technology specific (such as Microsoft TFS) or stop short at providing some vital functionality (e.g. the issue and project tracking Jira does not address requirement management while FogBugz does, and neither comes close to providing test management functionality; both provide robust extensibility model to amend this with third party integration).  I am going to elaborate on the selection criteria and process of  “the best-fit ALM” solution in follow up posts. 🙂

Filed under: Uncategorized No Comments