all bits considered data to information to knowledge

10Oct/140

Rescuing files from an Iomega NAS drive

I've learned that Iomega became part of Lenovo EMC only after my 2TB Iomega NAS died, and I started my search for a solution to rescue my precious files. Most of the information found dealt with resetting/rebooting the NAS as-is, with assumption that  something was wrong with the configuration; it did not help a bit - the lights continued blinking indicating permanent "booting status". It was time to bring in heavy guns 🙂

I have disassembled my 4 years old NAS, took out Seagate 2TB SATA HDD, and got to work.

First, I needed some docking device, to plug the HDD into my computer... but all I had at my disposition was a Windows 7 laptop. So I got myself a SATA-to-USB cable from Amazon - it connected the HDD but somehow failed to power it; Windows did not even recognized the partition. Then I picked up a powered device (EZ-dock, EZD-2335) from Fry's Electronics for $20, with USB/eSATA interface. It worked - the drive sprung to life, and Windows prompted me to format the RAW partition!

format.drive

Not quite what I was looking for.

A bit more searching told me that Iomega NAS is using Linux file system (XFS/EXT#). Following this lead I tried a variety of tools out there that supposed to make Linux file system accessible under Windows (explore2fs , Diskinternals Linux reader, and some others). None of these mounted the file system though they did see it as a RAW partition.

Then I decided to use Linux to access the files. There are several ways to run Linux OS under Windows but I already had Ubuntu Linux VM running in Oracle VirtualBox; it had extensions installed - including USB support (if I did not, I would probably use bootable Linux flash drive)

The EZDock was recognized by my Linux VN ad "JMicron USB to ATA/ATAPI bridge [0100]" but still refused to mount the partition.

To find the volumes on the system I used fsarchiver  utility. I tried to mount them manually; it failed with "unrecognizable file system error".

NB: to install any of these on Ubuntu (in case they are missing) use the apt-get  with install option (e.g. [prompt] ~: sudo apt-get install fsarchiver)

[prompt]:~$ sudo fsarchiver probe

fsarchiver

So, the drive was configured as RAID (duh! - it was NAS to begin with... and it even said that much with fsarchiver utility), and this called for a RAID system running to mount these logical partitions. Enter MDADM utility. It has to be installed as it is not a common attribute of a desktop Linux distribution, and it asks a few questions along the way - I opted for "No Configuration" every time for Postfix.

After the installation I run the following command

[prompt]:~$ sudo mdadm --assemble --scan

mdadm: /dev/md/hmnhd-TICF8C:1 has been started with 1 drive.

mdadm: /dev/md/0_0 has been started with 1 drive.

And voila - it worked! Both disks (logical) are working as a single unit...moreover, Ubuntu immediately recognized them and I was able to access the files.

see.volumes

Hope that my happy end story could help someone facing the same challenges: NAS controller failed but disk's still working.

10Sep/140

Where the wild data are

There were times when all data were wild... and if it was stored at all, it was committed to a memory of an individual which tended to fade away.  To facilitate this transient storage the data was wrapped in protocols of rhymes and vivid symbolic images; the pictures were drawn, stories were told. Then, about 5,000 years ago the writing systems begun develop – ideographic, symbols and, finally, letters. The data was tamed. The letters made up words, the words made up a sentence, and the sentence, hopefully, made sense. The data was written on clay tablets, animal skins, recorded on papyrus, vellum, paper, magnetic and laser disks… We got quite skillful at butchering the data into neatly organized chunks, and devising ever more sophisticated structures to hold it – scrolls, books, databases.

And then Internet happened. They say that we are creating more data in a year than in the previous thousand years, that 90% of the world’s data have been created in the past two years (though, according to one interpretation of the Law of Information Conservation, we only engage in recycling the information redistributed from existing sources) We are swamped with information, and the old, tried, trusted and true approach of organizing information into a palatable chunks is no longer working. Facing information deluge, we are forced to go back to basics – raw data – and find ways to make use of it without forcing it into a Procrustean bed  of some structure that might have seemed like a bright idea once. Hence the resurrection of an old idea of hierarchical databases – the ones before the advent of SQL – under the guise of NoSQL movement, and much hyped Big Data... In a sense, Big Data is nothing new, it’s been around us as long as humanity itself but just as with the proverbial iceberg, most of it was hidden from our conscious use – which by no means does not mean that we haven’t used it! No, it was always there for us, seeping in from traditions, proverbs, legends – something that we use without consciously thinking about it, the gut feeling, the social norms. The modern Big Data but extends this concept to the computers.

And I believe that the data can take care of itself, if only humans stopped telling it what to do – but we do have to arrange the meeting J

Instead of thinking how to accommodate the new data source or new data formats (e.g  video, mp3 files, text of various degrees of structural and semantic complexity), the humans can let the data figure out how to interpret it by itself.

The new data format could be analyzed, its structure inferred from background information/metadata, its usage from countless examples of similar (or not) data… This will require enormous computing power but we are getting close to it with likes of crowdsourcing, probability scores and machine learning, Hadoop infrastructure, variety of NoSQL and RDBMS data working together to produce insights from the data in the wild, the data over which we have no control, unreliable, inherently “dirty” data  ( and the degree of “dirtiness” itself is a valuable piece of information!)

It is nice to have smart ontology all figured out for the information we are using but it would be hundred times nicer not to pay any attention to any given ontology, and still being able to make meaningful use of the data!

28Jul/140

Agile Methods for Disposable Software

I had a conversation the other day about agile software development with a friend of mine who is, by his own admission, a “real hardware engineer”. The focus was the Agile Manifesto:

  • Individuals and interactions over processes and tools
  • Working software over comprehensive documentation
  • Customer collaboration over contract negotiation
  • Responding to change over following a plan

According to my friend, these statements are “either trivial, or naive, or plain wrong, or all of the above”.

NB: He also coined a hilarious ADHDDD term - Attention Deficit Hyperactivity Disorder (ADHD) Driven Development – and wrote a manifesto worth reading if only for its literary merits  🙂

Of course, as a certified Scrum Master, I beg to differ - but his attitude illustrates the point when a method (or its perception) can be taken ad absurdum. One of the common themes is that in the past the software was “designed to last” and Agile tell the developers “don’t think – code” … As any agile practitioner knows - nothing could be further from truth. Most of the COBOL written in 1960 is still running – but is it a good thing?  Borrowing from my friend's area of expertise, electronic tubes can work in stead of ASIC  - and ,arguably, with greater transparency (pun intended 🙂 ) – but would you really want to? Agile development does not preclude solid software architecture – quite opposite, it demands it! The fundamental quality attributes of the system are defined (and designed!) before a single line of code is written, and then – in close collaboration with the stakeholders - the rest of the requirements are being fleshed out.

We are living in the age of disposable software. This is a trade-off we've made to get the “latest and greatest” now - and cheap (preferably, free!).  Just take a look how things have progressed since 1980s - CPU, storage, RAM - all plummeted in price, and packaged software prices went through the roof...  (the Open Source movement only changes costs model – instead of “paying for software” one begins “paying for support and additional features”).

It might appear that we are rushing the development a la Netscape experiment of the 90s where a barely compiled program was foisted upon unsuspecting customers to debug…but it is a superficial analogy. We came a long way since those heady days.  We have a number of tools and frameworks at our disposal to shorten the development cycles – requirements gathering, build, testing, release – until we are getting to a nirvana of “continuous build” (and when we think that things cannot get any better we are ushered in a wondrous world of “continuous deployment”).

Is it better – or worse -  than a top-heavy process that takes time to spell out every requirement in a minute detail? The answer is, of course, it depends. A working software just-in-time for the market – albeit a buggy one – trumps comprehensive but obsolete one every time!

“For to him that is joined to all the living there is hope: for a living dog is better than a dead lion.”   Ecclesiastes 9:4

20Mar/140

Just say NO to data moochers!

The other day I stopped by a Great Clips salon to get a haircut. I was greeted with "Hi! What's your phone number?".  Then the following dialog ensued:

- Well, that's a bit personal, don't you think?

- I need it to enter into the computer! - the lady looked a bit pensive.

- I could give you my name. It's Alex  - I didn't want to cause any trouble for her, just wanted a haircut.

- Is this the name you're usually using here? I can't find you in the computer! - she sounded annoyed.

That was it for me. I muttered my thanks, and left the premises with a firm intention to boycott Great Clips from now on.

I got my haircut from a friendly neighborhood salon down the road - no questions asked.

Everybody tracks everybody nowadays. The loyalty cards, online cookies, single sign-on apps seems to be proliferating with a speed of electrical current. And I get it - Facebook collects information in exchange for providing me with a valuable service, Safeway collects my information in exchange for giving a discount etc. All this is spelled out upfront, with clear understanding of what this transaction brings to both parties. But why would Great Clips expect me to share my personal information with them for free?

On the other hand, I am wondering whether there is such thing as "data addiction", and if so - what are the health implications for the company that got into the habit? After all, a mix of data and predictive models can be, well, unpredictable.

Gaining insight from the data is great... if both data and the assumptions and the predictive models are correct. And this is a big IF.

25Feb/140

How big is Big Data?

There is no shortage of definitions for the ‘Big Data’ buzzword. Usually it is described in multiples of “V” – volume, velocity, variety (plug in your favorite data-related problem).

I believe that Big Data is defined only by our ability to process it.

There has always been Big Data, since the time when it was chiseled into stone, one symbol at the time.

We were talking about big data when it was written onto the papyri, vellum, paper; we have invented libraries, Dewey system, Hollerith cards, computers, – all in the name to process ever-increasing volumes of data, ever faster. Once upon a time a terabyte of data was “unimaginably big” (hence a company named “Teradata”), now a petabyte appears to be the “BIG” yardstick, only to be replaced with exabyte, zettabyte etc. in the near future; instead of batch processing we are moving to real-time, and, as with every bit of digital information, we are still storing numbers that to us represent text, video, sound and - yes - numbers.

The electronic data processing made a complete circle – from unstructured sequential files to structured hierarchical/network/relational database to NoSQL graph/doc databases and Hadoop sequential files processing.

Each round brings us closer to “analog data” – the ones that don’t have to be disassembled into bits and bytes to be understood and analyzed, the raw data.

Crossing the artificial chasm between digital and analog data will be the next frontier.

 

Ecclesiastes 9-11:
What has been is what will be,
and what has been done is what will be done,
and there is nothing new under the sun.

Is there a thing of which it is said,
“See, this is new”?
It has been already
in the ages before us.

There is no remembrance of former things,
nor will there be any remembrance
of later things yet to be among those who come after.

9Sep/130

Data Scientists or… Psychohistorians?

Before the Big Data, social Data Science/Data Mining and Machine Learning there was … Psychohistory!

The concept was introduced in 1951 by Isaac Asimov in his monumental Sci-Fi trilogy “ The Foundation”, and is very closely correlated with this “new” phenomenon of statistical modeling of the social interactions.

Proof? The definition from Encyclopedia Galactica quoted at the beginning of the 4th Chapter of The Foundation Trilogy:

Gaal Dornick, using non-mathematical concepts, has defined psychohistory to be that branch of mathematics which deals with reaction of human conglomerates to fixed social and economic stimuli …

… Implicit in all these definitions is the assumption that the human conglomerate being dealt with is sufficiently large for valid statistical treatment. The necessary size of such conglomerate may be determined by Seldon’s First Theorem which… A further necessary assumption is that the human conglomerate be itself unaware of psychohistoric analysis in order for its reactions to be truly random…

The basis of all valid psychohistory lies in the development of the Seldon Functions which exhibit properties congruent to these of such social and economic forces as …”

 

Asimov correctly points out the boundary conditions  of this statistical analysis – for this to work the society must be unaware of the analysis taking place and/or how it works as this would skew the distribution curve. After all, if the people stop clicking on these links and like-me-buttons, and stop sharing their information  (or worse – start feeding in some garbage data) all these sophisticated models would go haywire.

To continue analogy, the "Mule" character represents the "Black Swan" event that invalidates the entire premise based on normal distribution.

23Jul/130

On code quality

It takes approximately equal amount of time to craft a good quality code as it does to produce a bad one... but with solid a code base you will be light years ahead in terms of maintainability, extensibility, agility and virtually every other quality attribute for the system.

Code quality does not necessarily translates into quality software but there could be no quality software without quality of the underlying code.

My, admittedly rhetorical,  question is - why waste your time creating BAD code? !

12Jul/130

Supersede vs Supercede: a humble proposition

The Merriam Webster authoritatively informs that "supersede" is the only correct spelling, and "Supercede has occurred as a spelling variant of supersede since the 17th century, and it is common in current published writing. It continues, however, to be widely regarded as an error." Fair enough.

I merely propose to adopt "superCede" word with a new semantic load...

'Cedere" means "to yield to, give way for" which could lead to

SUPERCEDE meaning " to give more than asked for"

e.g.

"I supercede the power!"     "I most gladly supercede my responsibility" 🙂

20Jun/130

Finding your natural habitat @work: It’s what you do in your free time that counts

When comes to computer programming there are two broad categories of companies - these that produce IT products, and those that consume IT  products(with countless variations in-between).

The skills set for each is somewhat similar yet there's enough difference to apply different criteria to interviewing candidates:

  • A company that lives and breathes technology will be looking for candidates with similar traits. 
  • A company which primary business is anything but - building materials, food, clothes - will be looking for a person with primary interest in business complemented by IT savvy.

So, a question - what do you do with your free time? - helps to clarify your "ideal environment", your natural working habitat.

Additional questions might be - what is your primary sources of information? what site do you open first thing in the morning? Is it Wall Street Journal or SlashDot?

16May/130

Big Data vs. Lots of Data

A short presentation intriguingly titled "Top Five Questions to Answer Before Starting on Big Data" caught my attention. There is a lot of noise around "Big Data" phenomenon already proclaimed to be The Next Big Thing. Quite a few folks disagreed, including Stephen Few of Perceptual Edge who published paper with a title "Big Data, Big Ruse" (pdf).

Don't get me wrong - I do believe that Big Data IS a big thing, and that its introduction will bring about a proverbial paradigm shift (another arguably over-used term of the last decade). Yet many people, while talking about Big Data, have a rather vague idea what it is, and many believe that is is equal to "Lots of Data" which underwent qualitative transformation a la Karl Marx ("Merely quantitative differences, beyond a certain point, pass into qualitative changes." --Karl Marx, Das Kapital , Vol. 1.)

Sorry to contradict some aficionados of dialectical materialism but.. it ain't so. Which is exactly the point of the slide #3 in the aforementioned deck.

The current incarnation of Big Data is mostly about machine-generated data. There might be lots of nuances and exceptions to this affirmation but humans simply cannot match machine's ability to generate data 24/7. True, lots of this data is generated in response to human activity (e.g. clickstreams) but even then it is enhanced with machine-generated information (e.g. date/time stamps, geocoding etc); a single tweet could generate additional kilobytes of contextual data which can enhance the semantic value of the tweet itself - to the business, not the tweeter, of course!... Say, was it tweeted from a mobile device or a laptop? which operating system? what browser/application? what time of day/night? geographical location? time elapsed between first syllable and the last? language used?  and so on and so on.

This is what Big Data is all about. And this is why the question on slide #3 - "Do you have Big Data problem or just Lots of Data problem?" comes right after "What do you need to know?" on slide 2.

9 out of 10 times people talking about Big Data are referring to the data locked in their enterprise database, documents and web pages; some of it might even include metadata. But the machine generated component - the proverbial 800 pound gorilla in the room - flies under the radar. The enterprise data - a domain of BI -  is but a tip of the iceberg which is the Big Data.