According to a Digital Universe Study the Global Data Supply hit 2.8 zettabytes in 2012 but only 0.5% of that is used for any kind of analysis.  This to me is incredibly interesting as we look at just how much data that is, the entire Library of Congress (if and when it is fully digitized) would be around 5000 Petabytes (according to Matt Raymond in a 2009 Post at the Library of Congress site) or to look at it another way, if we were to take all of that data and put it on 1 TB Drives we would have a Stack of HDDs that was 107 km high or have the similar volume as filling the first 3 floors of the Empire State Building with Hard Drives!

The reality is…that is a lot of Data and the absolute truth is…most of it just sits there, doing nothing.  I was looking at an Infographic that stated that cost was the primary contributor for Small Business to not enter into the Big Data realm but 41% of those same companies thought that it would increase profitability.  I recently built myself an interesting experiment to sort through the 9 TB of Data that I have here at home.  More specifically the 2 TB of PDFs that I have accumulated in the 14 years of being online.

I did this by first trying out the Pentaho Suite earlier this year during a 30 Day Free Trial and found that organizing and sorting my data was not as hard as I had imagined but I am a truly cheap bastard and was not willing to pay the license fees so I jumped over to an Open Source Product called Talend Open Studio for Big Data.  It is not as robust as the full suite from Talend but for my purposes it has performed beautifully and I look forward to integrating the setup with my new Cheap NAS in the coming week.

What can I do with this setup?  I have configured it to look to the PDFs that I have and analyze for both key words, topics and themes to aid in my goal of having interesting and new ideas for this blog and topics of conversation with my clients every day.  I operate like my own little mini Google.  I know that I could be doing far more with this setup than I currently am…but this was a tinkering experiment and one that didn’t cost me anything but the time it took to setup up the configuration.

Can anyone do this…absolutely and the benefits for me at least is that I no longer have to try and slog through my memory of what book it was that I saw an interesting statistic or theme for discussion.

My next step though will be to pull this together and create a connection to the web so that I can access this repository of knowledge remotely, and extend the insights from just text to also include the multitude of graphs and images contained in the PDFs so that they become searchable to.

I still say that Big Data really is not about the shear volume even though a stack of Hard Drives that enters into the Thermosphere is big…we need to actually tap into that information and make use of it.  Personally I was tired of being a Data Hog….I look around and see more than 500 DVDs filled with the digital “Stuff” from my time online plus the 9 TB of HDD that I have on the go right now.

Hopefully this helps someone out there and I look forward to comments and suggestions on other Open Source and preferably FREE Big Data Tools out there!

