When I am talking to my clients and the topic of Big Data comes up, it surprises me that for the most part the solution currently is “just add disks” is the solution. I have been following a few really great sources lately and it seems that “user generated” data is not the only place that organizations should be looking when they are building out the Data Warehouse, BI and Analytics strategies. We have entered a time when collectively we are generating a mindboggling amount of Data every day and in many cases we just duplicate that data never really connect the dots.
I stumbled across an interesting blog post by James Serra over at sqlservercentral.com and his take on first looking to the Data Warehouse Architecture was refreshing but I had not heard of the Kimball or Inmon Methodologies before so I set forth to to a little digging.
Bill Inmon: Considered by many to be the father of Data Warehousing, Bill Inmon is among the founding fathers within the modern computing world. In 2008 when he retooled his “DW 2.0” and set forth on his latest endeavour of the “Corporate Information Factory”.
The Inmon Methodology defines that a Data Warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.
Now I look at that statement and go…ahh….ok…not much meat and definitely no potatoes. This is actually one of my favourite parts of blogging. The search for a single answer creates several new questions. So I took a dive into the depths of the Corporate Information Factory and not only was there light at the end of the tunnel…I now get the need for having a Master Data Management Plan.
To sum up several hundred pages, the challenge with Data Management is that ever application wants to have its own Data Share (the Data Mart) and in many cases COTS Applications tend to push for a decentralized model of Data Sharing which becomes increasingly challenging when multiple Applications need to pull the same data but there is not an available connection into that Data from that Application (which leads to messy coding, delays and crashes). The Data Warehouse sits between the Data Mart and the “consumer” to allow a flow and ensure that there is always access to the data required.
If you are at all involved in the Management of Data and have not registered for access to the CIF Resources Portal I strongly suggest it. It is free and the information contained is worth the 30 second registration.
Ralph Kimball: Taking a different approach to Data Management, Ralph Kimball holds the honour of being in the Database Hall of Fame and his methodology known as Dimensional Modelling is widely regarded as the defacto starting point for decision support and the foundation for Business Intelligence.
The Kimball Methodology defines that the Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.
So taking a look at the Dimensional Model will be key to unlocking this statement and then I find that unlike the Inmon Methodology, things are not quite as freely presented. The starting point is at the Kimball Group website but there is a wealth of information hidden behind the scenes and hundreds of links and other concepts behind the Dimensional Model. In looking through the titles I figured I would start with the Design Steps to gain an understanding of where the Kimball Methodology starts.
In diving into this I very quickly realized that there is no simple answer to the question of what is Dimensional Modelling that I could glean from reading over 3 or 4 white papers to kick start my morning. In fact the Kimball Group has the Kimball University to help explain what Dimensional Modelling is. I get that this is a big topic and in dealing with Big Data there will be challenges but I can’t help but think back to Occam’s Razor:
It is a principle urging one to select among competing hypotheses that which makes the fewest assumptions and thereby offers the simplest explanation of the effect.
The Dimensional Model to me seems to over complicate the situation and now that we are now talking potential Petabytes and Zetabytes of data…simple must be better?
Well now that my head is spinning, I think I need to part ways and circle back to this. Expect to see a return to this topic soon as I dive into the topic further and start to wade through the hundreds of insights I found today.
Chris J Powell