Chris J Powell

Big Data for Document Management

One of the challenges that I have always had being a collector of information is finding the time to read every PDF, doc, epub and ppt that I find on the interwebz.  I spend no less than 2 hours daily finding information that I find interesting at the time and at least that much time reading the documents that I find but the reality is…with only 24 hours in the day…I don’t get through everything that I want to.

mongoDB is built for Document Based Storage, to index, catalog and keyword search/query mountains of documents (in my case 1.5TB).  To understand what the Data Model is for mongoDB:

A MongoDB deployment hosts a number of databases. A database holds a set of collections. A collection holds a set of documents. A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection’s documents may hold different types of data.

When I started down the path of leveraging Open Source Tools to build my self a Big Data haven for all of the information that I tend to Hoard (yes, I freely admit that I am more than a bit of a Data Hoarder) I took several steps in the wrong direction.  I started with a purely Hadoop build which did the job, was fast and made awesome reports for analyzing the data…problem was…I had no one to report too because my wife really doesn’t care about how many new files were saved and the correlation between Tech Journal A and O’Reilly Book B…so I went back to the drawing board.

The Application that I am building is not quite finished but I was very impressed with the solid documentation that comes with mongoDB and the level of support that the community is available to offer.  If you are looking to make a move into Big Data Application Development, I strongly suggest starting with a great video by Roger Bodamer over at


Chris J Powell

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.