I got curious today, mainly because I looked at how my current employer is harvesting potential “customers” and feeding them to us and realized that Big Data needs to be brought to the forefront for every company and understanding that it does not need to be a massive 1100 Node Cluster with 12 PB of raw Data like Facebook to be effective. I took a swing by the Apache Hadoop page and took a quick look around.
Apache Hadoop is an open-source software framework that supports data-intensive distributed applications licensed under the Apache v2 license. It supports running applications on large clusters of commodity hardware. Hadoop derives from Google’s MapReduce and Google File System (GFS) papers.
Ok so Wikipedia in its opening salvo just throws a lot of technical jargon out there and really does not get to the point that everyone can understand…so it is time for a quick Krispification of the term.
What is Hadoop According to Krispy?
Apache Hadoop is an open-source Big Data Framework that supports a variety of high volume data intensive connections to external applications. It supports this by running the applications common hardware like regular Servers and Desktop machines without needing to expand to a mainframe capacity by connecting multiple machines together in clusters.
Ok…so hopefully this helps a little bit for those who see things in technology as an inside joke that is meant to keep the non-geeks outside looking in (funny how Technology has almost become a way to reverse bully but that is for another day).
But who is using this and What is it being used for? That is even more interesting as the applications that Hadoop can be applied to are very much so varied. I myself started using Hadoop to do real time search capabilities across 1TB of PDFs to help me with research. I not only established the unstructured data connections but I applied context and insights into the algorithms so that I was not having to rely entirely on my own memory (as good as it is…I think it is reaching capacity or my noodle is starting to deteriorate as I approach the top of the Hill that I am almost over).
Other organizations are using it for well just about everything. There is a list available at the PoweredBy – Hadoop page:
We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
Currently we have 2 major clusters:
- A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
- A 300-machine cluster with 2400 cores and about 3 PB raw storage.
- Each (commodity) node has 8 cores and 12 TB of storage.
We are heavy users of both streaming as well as the Java APIs. We have built a higher level data warehousing framework using these features called Hive (see the http://hadoop.apache.org/hive/). We have also developed a FUSE implementation over HDFS.
We use Hadoop to analyze our virtual economy
- We also use Hive to access our trove of operational data to inform product development decisions around improving user experience and retention as well as meeting revenue targets
- Our data is stored in s3 and pulled into our clusters of up to 4 m1.large EC2 instances. Our total data volume is on the order of 5Tb
Dual quad-core Xeon L5520 @ 2.27GHz & L5630 @ 2.13GHz , 24GB RAM, 8TB(4x2TB)/node storage.
- Used for charts calculation, royalty reporting, log analysis, A/B testing, dataset merging
- Also used for large scale audio feature analysis over millions of tracks
We have multiple grids divided up based upon purpose.
- ~800 Westmere-based HP SL 170x, with 2×4 cores, 24GB RAM, 6x2TB SATA
- ~1900 Westmere-based SuperMicro X8DTT-H, with 2×6 cores, 24GB RAM, 6x2TB SATA
- ~1400 Sandy Bridge-based SuperMicro with 2×6 cores, 32GB RAM, 6x2TB SATA
There are many different ways that Hadoop and Big Data Analysis can be done and the funny thing is, you do not have to commit 100’s of man hours and Millions of Dollars to this type of project to reap the benefits. Look at me, I save 10-15 hours per week doing research for this blog by having invested 15 hours in building my Hadoop and MongoDB based search program…even if I was a certified programmer, having an ROI of less than 2 weeks is awesome!