all bits considered data to information to knowledge

14May/130

A Brief History of Big Data

How had the data grown so BIG? The trends appear to have finally converged:

  • data storage  went from stone tablets to animal skins to paper to HDD to...
  • ability to process information went from memorization to computer aided recall
  • humans are no longer the biggest data producers in the Universe

Gil Press traces the origins of Big Data - or at least a premonition thereof  - to 1946 in a great article in Forbes - a fascinating read!

I think that (with a bit of a stretch) it could go back all the way to Marcus Tullius Cicero, who lived in first century B.C.

"Times are bad. Children no longer obey their parents, and everyone is writing a book." 

It's the book part I am referring to  🙂

26Apr/130

Meditations on Nature of Data: the first three days :)

Meditating on John Wheeler's "It from Bit" insight I came up with what really could have been going on during first three days of Creation.... 🙂

 

1. In the beginning Information Architect created the database and the model.

2. Now the database was formless and empty, darkness was over the surface of the deep, and the Spirit of Backus-Naur was hovering over the databases.

3. And Architect said: "Let there be data", and there were data.

4. And Information Architect saw that the data were good, and he separated data from metadata.

5. Architect called the data "data," and the metadata he called "metadata." And there was evening, and there was morning - the first day.

6. And Information Architect said, "Let there be a vault between the data to separate datum from datum." And it was so.

7. So the Architect made the vault and separated the datum under the vault from the datum above it.

8. Architect called the vault "database schema." And there was evening, and there was morning - the second day...

9. And Information Architect said, "Let the data outside the schema be gathered to one place, and let storage appear." And it was so. 

10. Architect called the storage "data storage" and the gathered data he called "data logs". And Architect saw that it was good.

11. Then Information Architect said, "Let the data logs, and databases produce meaningful information according to their various kinds of context." And it was so.

12. The data combined with context produced information: databases bearing reports according to their kinds and data logs bearing reports in it according to their kinds. And Information Architect saw that it was good. 

13. And there was evening, and there was morning - the third day.

 

8Apr/130

New Meaning of “Investing in Your Health”: an idea for Health Insurance Exchange

As states race to implement Health Insurance Exchanges mandated under Affordable Care Act, I wonder whether they have chosen a wrong model - that of an overseer, an information provider and a mediator... Maybe we could have borrowed a paradigm from Stock Exchange market?
Some of the major hurdles facing Health Insurance Exchange implementation include
  • inherent complexity of the endeavor
  • insufficient experience on the state part in operating exchanges (as opposed to financial industry)
  • need to attract sufficient number of participants to become efficient and self-sustaining
It is a common pattern in software engineering to deal with complexity by introducing an abstraction layer,
and financial industry did just that with the concept of Exchange-Traded Funds and Mutual Funds to ease complexity of picking individual stocks. I believe that Health Insurance Exchange might have much more in common with exchange than insurance, and that the very same concepts are applicable here.
Imagine health insurance pools structured in a way similar to that of mutual funds/ETF according to some predefined criteria, and designed to cater to a certain category of consumer (again, analogy of industry sector funds). It is then sold as units of insurance to consumer through the exchange.
The role of the Exchange operators would be that of mutual fund managers:
  • design portfolios of insurance plans and sell units of insurance to consumer (after proper validation and categorization)
  • handle fund-to-fund exchange (when situation of the customer changes)
  • process refunds and assess charges
  • provide apples-to-apples comparison
  • etc.
The insurer and the insured will be decoupled: the former will roll out insurance plans, and the latter will buy as much insurance or as little as they need, and the State's Health Insurance Exchange would provide platforms for "health insurance pool" comparison and rating, and handle financial transfers (including state/fed subsidy portions)..
In a commercial twist to the idea: since the funds have an expiration date, the insurers might even pay dividends to the units holders based upon un-used portion of the plan (insert actuarial voodoo here :).
There might be even a secondary market where investors might buy funds from customers (original units buyers) if they can reasonably expect positive return due to under-utilization of the plan.
7Mar/130

The fine line between “Big Data” and “Big Brother”

The was never a lack of desire to collect as much data as possible on the part of business or governments; it was capabilities that always got in the way. With advent of "Big Data" technology the barrier had just been lowered.

Monitoring employees interactions in minute detail to analyze patterns, and get ideas on productivity improvements is not illegal per se...but it takes us one step further towards this proverbial slippery slope.

The recent article in The Wall Street Journal by Rachel Emma Silverman highlights the indisputable advantages but somehow glosses over the potential dangers:

As Big Data becomes a fixture of office life, companies are turning to tracking devices to gather real-time information on how teams of employees work and interact. Sensors, worn on lanyards or placed on office furniture, record how often staffers get up from their desks, consult other teams and hold meetings.

Businesses say the data offer otherwise hard-to-glean insights about how workers do their jobs, and are using the information to make changes large and small, ranging from the timing of coffee breaks to how work groups are composed, to spur collaboration and productivity.

 

[06.17.2013] Here's a blog post addressing the very same issues by Michael Walker, with benefit of hindsight after revelations on PRISM surveillance program: http://www.datasciencecentral.com/profiles/blogs/privacy-vs-security-and-data-science

2Mar/130

Data as a weapon of choice

I blogged last week about Tesla Motors CEO fighting shoddy review of the car his company manufactures. This week news brought in yet another story of a CEO using data to justify a very public and controversial decision - banning employees from working from home

How did CEO Marissa Mayer decide to make such a controversial decision? According to a source, the only way Mayer is comfortable making any decision: with the help of data.

22Feb/130

Diverging realities

An ability to filter information based upon one’s personal preferences and tastes has never been greater even as variety of the available information explodes. Therein lays a problem - rarely if ever do we choose to be exposed to something that we do not want to hear, view or read, and as a result we are surrounded with “yes-information” (by analogy with “yes-man” definition from Merriam-Webster).

Some of these filtering criteria are based on our own conscious choices, some - on insights into our subconscious preferences gathered from a social media trail… This ultimately leads to stratification of the society into “interest groups”, each being fed a different information diet, each assembling its own version of reality from the information that gets through the filters.

This “freedom of association” is not a new phenomenon, to be sure; it’s the emerging totality of it that might be transmogrifying one's freedom into a self-imposed exile.

P.S. A recent article in The Wall Street Journal by Evgeni Morozov  highlights other dangers of so-called "smart gadgets" - surrendering our ability to make mistakes. The smart gadgets are giving a new twist to the social engineering attempts as "a number of thinkers in Silicon Valley see these technologies as a way not just to give consumers new products that they want but to push them to behave better."  Of course, this implies that this "better" is a well understood universal concept; yet it has been proven - times and again - that the road to hell is paved with good intentions...

18Feb/130

Fighting back with data!

A highly publicized fight between  NY Times John Broder and Tesla motors CEO Elon Musk is destined to become a case study in how data became not only "corporate asset" but a weapon of choice in protecting firm.

After a test-driving Tesla S car, the NY Times journalist John Broder published a review 'Stalled Out on Tesla’s Electric Highway" claiming that the car's poor performance under cold weather condition lead to it being towed before it reached its destination. Instead of issuing usual "confidence in quality of our product" and promises to "thoroughly review the incident" the Tesla Motors CEO fought back with data, first calling the Broder's review a fake on Twitter, and then supporting his claim with extensive data in form of graphs and tables in the article "The most peculiar test drive" posted on the corporate blog on February 13, 2013

Turns out that Tesla Motors had established a comprehensive data collection and analysis program:  distance, time spent at charging stations, the speed and geographical location of every vehicle used by the media at any given moment - tons and tons of machine generated data - were recorded and stored in the Tesla's corporate database. The data clearly shows that the car never "run out of battery", and makes several other claims made by John Broder dubious at best.

P.S. Mr. Broder since posted a response but his "recollections", "memorized facts" and hand-written notes look terribly suspicious and woefully inadequate against cold hard numbers presented by Elon Musk

16Feb/130

An excellent bit of advice from the trenches: Startup DNA

Yevgeniy Brikman, staff engineer at LinkedIN who had "front row seats at very successful start-ups" (LinkedIN, TripAdvisor) shares his observations and insights in a few (well, 106) slides here (http://www.slideshare.net/brikis98/startup-dna)

A very interesting insight into minimizing "trial and error" cycle with dynamic languages (#35 ) and development methodologies/framework (#37) :  shortening the feedback loop allows for earlier maturity. Or, as he puts it quoting Jeff Atwood (of StackOverflow fame), " Speed of iteration beats quality of iteration".  Of course, without discipline and support of well defined processes and frameworks the speed alone could be a runaway train-wreck 🙂

Second observation that struck a chord with me on slide #50: "If you cannot measure it, you cannot fix it" .  [NB: Of course, this has a long history (and even longer attribution list) of being said at different times in various contexts by Lord Kelvin, Bill Hewlett, Tom Peters and Peter Drucker ]. The advice "Measure Everything" should not be taken too far, though:

"Not everything that counts can be counted, and not everything that can be counted counts. "  Albert Einstein

but in the context of the presentation Evgeniy's advice ought to be taken to the heart: collect server metrics, database metrics, client side metrics, profile metrics, activity metrics, bug metrics, build metrics, test metrics etc!

And the last (but not least) observation on sharing is arguably the most important one (slides #93-102).

"The best way to learn is to teach" -

Frank Oppenheimer

11Feb/130

Application Lifecycle Management is coming of age

Software intensive systems have a lot in common with humans – they are born, mature and die. They even sometimes come back from the dead, or simply linger around scaring day lights out of everyone who come in contact with them... To minimize one's chances of inadvertently releasing such monsters into the wild one should adopt holistic point of view – that of Application Lifecycle Management.

Wikipedia defines ALM as "... a continuous process of managing the life of an application through governance, development and maintenance. ALM is the marriage of business management to software engineering made possible by tools that facilitate and integrate requirements management, architecture, coding, testing, tracking, and release management."

Can't say it is a novel concept - every organization is already doing all of these either by design or by accident, with majority falling into the giant “in-between” void. The key is tight integration between three key areas – governance, development and operations.

To address the issue many a vendor came with ALM tools - sometime bundled, oftentimes integrated, - into a suite. Wikipedia lists over 40 “products” ranging from full-blown suites to assembly of specific tools, both commercial and free open source. Gartner's MarketScope mentions 20 leading vendors with ALM suites offering, out of which 8 got “Positive” rating, and IBM’s one got the only “Strong Positive”. The Forrester's Wave for ALM lists 7 vendors in the “strong” segment, with additional marks for market presence (with IBM, HP and Microsoft leading the “big” guys and CollabNet, Atlassian and Rally Software in leading smaller vendors pack)

The ALM offerings differ in degree of completeness, degree of coherence between the tools and extensibility model provided. Some of the more integrated offerings come in a variety of flavors such as SaaS or on-premises installations, with numerous options to complement either. And then there is a price tag to consider which, as with everything that purports to address enterprise-wide issues, is not insignificant – ranging from tens of thousand of dollars to a couple millions (and then some) with additional costs for infrastructure, operations, and maintenance. Still, there is a solid evidence that these investments under right circumstances might and do pay off. Implementing ALM principles to the Enterprise Integration and/or software development project can significantly improve quality of the delivered system and positively affect schedule.

The ALM processes fall into 5 domains:

  1. Requirements Definition and Management
  2. Quality and Build Management (including test Case management)
  3. Software Change and Configuration Management
  4. Process Frameworks and Methodology
  5. Integration Across Multiple AD (Application development) Tools

An integrated suite with a hefty price tag must address all of these domains to be worth consideration; and for best of breed route integration consideration are of paramount importance in order to realize ALM potential. One such important consideration, for example, is an integrated QA (either that or ability to integrate with a QA suite)

So far, only two vendors today offer fully integrated (all 5 domains), end-to-end, technology neutral ALM suites – IBM Rational Jazz Platform and HP ALM. The rest is either very much technology specific (such as Microsoft TFS) or stop short at providing some vital functionality (e.g. the issue and project tracking Jira does not address requirement management while FogBugz does, and neither comes close to providing test management functionality; both provide robust extensibility model to amend this with third party integration).  I am going to elaborate on the selection criteria and process of  “the best-fit ALM” solution in follow up posts. 🙂

Filed under: Uncategorized No Comments
23Jan/130

OBIEE – what’s in a name?

The unwieldy acronym OBIEE stands for Oracle Business Intelligence Enterprise Edition.

The offering is a loosely coupled assembly of a dozen plus components (eight – by some other counts) both acquired and homegrown. Its beginnings go back 12 years ago to nQuire product which first became Siebel Analytics only to be reborn as OBIEE after Oracle's acquisition of Siebel in 2005 and then Hyperion in 2007. The story does not end here as Oracle continues its acquisition spree with the recent (2012) purchase of Endeca for its e-Commerce search and analytics capabilities.

The current intermediate result is a solid contender for the Enterprise BI Platform, firmly placed at the top-right of Gartner's Magic Quadrant along with Microstrategy, Microsoft, IBM, SAP and SAS.

Oracle's page for Oracle Business Intelligence Enterprise Edition 11g summarizes the suite's functionality in following terms (direct quote, with claims about “cost reduction” and “ease of implementation” left TBD)

• Provides a common infrastructure for producing and delivering enterprise reports, scorecards, dashboards, ad-hoc analysis, and OLAP analysis
• Includes rich visualization, interactive dashboards, a vast range of animated charting options, OLAP-style interactions and innovative search, and actionable collaboration capabilities to increase user adoption

And – by and large - it does deliver on the promises.

One of the important features for the enterprise is integration with Microsoft Office (Word, Excel and PowerPoint). What Oracle has dubbed as “Spacial Intelligence via Map Based Visualization” represents a decent integration of mapping capabilities (not quite ESRI ArcGIS but a nice bundled option nevertheless – and no third party components!)

Among other things to consider is tighter integration with Oracle's ERP/CRM ecosystems (no surprises here as every vendor sooner or later tries to be everything for everybody), and for the organizations with significant Oracle presence this would be an important selling point.

Being redesigned with SOA principles in mind, OBIEE yields itself nicely to integration into SOA- compliant infrastructure. Most organizations choose Oracle Fusion Middleware for the task due to more coherence with OBIEE and the rest of Oracle's stack; but it is by no means a requirement– it can be run with any SOA infrastructures, including open source ones.

For mobile BI capabilities, OBIEE offers Oracle Business Intelligence Mobile (for OBIEE 11g), currently only for Apple's devices – iPad and iPhone – downloadable from Apple iTunes App store. Most features of the OBIEE available in the corporate environment are supported on mobile devices, including geo spacial data integration.

NB: Predictive modeling and data mining are not part of OBIEE per se (it cannot even access data mining functions built into Oracle dialect of SQL!) but they could be surfaced through it. Oracle Advanced Analytics platform represents Oracle's offering in this market.

OBIEE ranks second from the bottom in difficulty of implementation (SAS holding the current record); coupled with a relative dearth of expertise on the market and below-average customer support, this should be considered in evaluation of the OBIEE for adoption in the enterprise.

One interesting twist in OBIEE story is Oracle's introduction of Exalytics In-Memory Machine in 2011 – an appliance that integrates OBIEE with some other components such as Oracle Essbase and Oracle TimesTen in-memory database. The appliance trend resurrects the idea of a self-contained system in a new context of interconnected world, and Oracle fully embraces it with the array of products such as Exadata, Exalogic and now – Exalytics. By virtue of coming fully integrated and preconfigured it supposedly addresses the difficulties of installation and integration – at a price; this is designed to be a turn-key solution for an enterprise but its full impact (and validity of the claim) remains to be seen.

So, to sum it up:

Pro:

It is a solid enterprise class BI platform with all standard features of a robust BI – reports, scorecards, dashboards (interactive and otherwise), OLAP capabilities, mobile apps,
integration with Microsoft Office, SOA compliant architecture. It also includes pre-defined analytics applications for horizontal business processes (e.g. finance, procurement, sales) as well as additional vertical analytical models for the industries (to help to establish common data model)

Contra:

It is evolving through acquisitions and integration thereof which affects coherence and completeness of vision; no integrated predictive modeling and data mining capabilities,
ranks rather low on ease of deployment and use as well as on quality of support; rather shallow (and therefore expensive) talent pool; with all being factored in, the TCO could
potentially be higher than comparable offerings from other vendors.