The Power and Simplicity of the Data Warehouse

“In many ways, dimensional modeling amounts to holding the fort against assaults on simplicity”
– Ralph Kimball, The Data Warehouse Toolkit

 

Although there are many reasons that an organization may consider building a “data warehouse”, most of the time the overwhelming reason is performance related… a report or web page is just taking too long to run and they want it to be faster. If that data could be pre-computed than that page would be really fast!

I’m here to tell you that speed is not the most important reason for building the warehouse.

When you’ve got a system where multiple consumers are reading the data from the transactional store and doing their own calculations, you create a whole bunch of other problems beyond just the speed issue:

  • You create the potential for multiple different and conflicting results in that system. At least a few of those consumers will inevitably do something wrong.
  • You put a considerable burden on your transactional system to recalculate that data over-and-over again for each custom request.
  • While those consumers are running their long running queries, that data is being simultaneously updated by a multitude of data collection and transformative processes… you are not only reading inconsistent sets of data in the consumers, you are blocking the collection and transformation processes from doing their job, getting in their way and slowing them down… and sometimes even causing them to fail.
  • You’ve created a multitude of intertwined dependencies in that system. This makes attempts to improve or otherwise change that transactional system extremely difficult to achieve without breaking things… and having to make massive system wide changes to accommodate even the smallest change.
  • The bottom line is this: You’ve just got a greatly over-complicated system that is much more prone to performance problems and defects. As Ralph Kimball states so eloquently, data warehouse efforts are a giant move towards simplicity. And simpler systems are better systems.

We recently launched a major warehouse initiative to, once and for all, pre-compute and store all our portfolio-level data. Although that data is already computed from data that’s been pre-aggregated at the account level, there is still considerable additional work required to aggregate that data further to the portfolio level.

Primarily, that’s a major performance problem. Pulling a single portfolio’s data can take as long as 5-7 minutes for larger portfolios. That’s a serious problem with our scalability and an overall burden on our system.

I’m happy to report that the portfolio warehouse initiative is nearing its conclusion and am confident it will do things for us far beyond the performance improvements we hoped to gain:

  • With every consumer pulling from the same warehouse for portfolio level information, we can guarantee they will get the same results… they are all “drinking water from the same well.”
  • The portfolio data can now be processed “incrementally”… i.e. rather than have to recalculate that data from the beginning of time for every request, we can reprocess only that data that has changed. This pays huge dividends on overall system performance and greatly decreases the burden on the system.
  • Our data will now be pulled from a snapshot-enabled data warehouse. This guarantees clean and consistent reads without blocking the transactional store from doing its work.
  • By having one system that reads transactional data and compiles and stores the portfolio data, we only have one system to change when we want to change something in the transactional store. This is hugely liberating to us when we want to modify those underlying systems.
  • The new published warehouse structure for portfolios is simple and easy to understand. It therefore opens up consumption of that data in new ways with less effort, opening the doors to new possibilities that were otherwise impossible. Looking at data for all the portfolios in a firm in one pass is now possible, performing cross-firm analytics that were unthinkable before. This opens a myriad of optionsfor us that we intend to take advantage of.
  • Oh, and the speed is nice also… it’s really fast!

While we are still in the final stages of implementation, we hope to bring this system fully into production over the next few months and are very excited about the possibilities… we hope you are too!

And if you’d like to read about data warehousing, I highly recommend what is, in my opinion, the bible of data warehousing:

The Data Warehouse Toolkit – By Ralph Kimball and Margy Ross

Fortigent

“In many ways, dimensional modeling amounts to holding the fort against assaults on simplicity”
– Ralph Kimball, The Data Warehouse Toolkit


Although there are many reasons that an organization may consider building a “data warehouse”, most of the time the overwhelming reason is performance related… a report or web page is just taking too long to run and they want it to be faster. If that data could be pre-computed than that page would be really fast!

I’m here to tell you that speed is not the most important reason for building the warehouse.

When you’ve got a system where multiple consumers are reading the data from the transactional store and doing their own calculations, you create a whole bunch of other problems beyond just the speed issue:

  • You create the potential for multiple different and conflicting results in that system. At least a few of those consumers will inevitably…

View original post 587 more words

Advertisements

Deming’s Seven Deadly Diseases

William Edwards Deming was sent to Japan in the early 1950’s and propagated his ideas about quality control and production process throughout Japanese industry.

There’s a wealth of wisdom in Deming’s work, albeit much of it industrially focused, but I’m particularly fond of his “Seven Deadly Diseases” of management (with my comments):

  1. Lack of constancy of purpose
    • It’s clear that having some core concepts about what you are trying to do is helpful… simple, effective statements about what’s important to your company, what your company does, and perhaps what your department’s role is in helping to fulfill the company’s purpose.
  2. Emphasis on short-term profits
    • Encourages what Bob Lutz describes as what-can-we-get-away-with thinking.
  3. Evaluation by performance, merit rating, or annual review of performance
    • These systems reward results rather than process-improvement, which can be counter-productive, and thereby encourage workers to maintain the status quo rather than innovate… their goal is to ‘get it done’ rather than to improve how they do so.
  4. Mobility of management
    • Too much ‘reorganization’ interrupts and breaks process improvement initiatives. Probably happens so much because of #3.
  5. Running a company on visible figures alone
    • You cannot measure everything, but must nonetheless do things you think need to be done. Too many times are we told not to do something if you can’t show me it will be valuable.
  6. Excessive medical costs
    • A very interesting observation made over 60 years ago
  7. Excessive costs of warranty, fueled by lawyers who work for contingency fees
    • Maybe not so relevant to the software industry.

It’s also worth mentioning a few other items from “A Lesser Category of Obstacles”:

  1. Neglecting long-range planning.
    • I’m a little torn on this.  In software, too much long-term planning can be a waste of time, but you certainly can’t neglect it entirely.
  2. Relying on technology to solve problems
    • I see this all the time. Figure out your problems first please… I beg you… before you start buying software you think will solve it for you… it won’t.
  3. Seeking examples to follow rather than developing solutions
  4. Excuses, such as “our problems are different”.
  5. Placing blame on workforces who are only responsible for 15% of mistakes where the system designed by management is responsible for 85% of the unintended consequences.
  6. Relying on quality inspection rather than improving product quality.
    • Relying on software testing rather than changing how we build the software in the first place.

I’m often drawn back to these little pearls of wisdom and I continue to be amazed at the prevalence of many of them throughout the industry. Keep them in mind while you are trying to steer your own efforts.

Top 5 logical fallacies used in technical debates

I’m an engineer, so I get in a lot of debates. I get in lots of debates on a wide variety of other topics as well. I can certainly come across as argumentative, but all I’m really seeking is a reasonable debate between valid arguments.

What I often get, instead, are arguments that are really based on nothing. I think it’s important for people in our field to understand the more common logical fallacies so they can learn when to recognize them being used… and avoid using them themselves.

#5: Slippery Slope

This is a form of ‘probability fallacy’… i.e. because X has happened 10 times in the past, it is very likely to happen in the future. This, statistically, is false. If I flip a coin 10 times and get heads, it’s still 50/50 to get heads the eleventh time.

It’s hard to convince people that this isn’t a good argument, but it isn’t. In some cases, it’s an argument about the failings of human nature. But I still think it’s just lazy… tell me why it’s a bad idea rather than convince me that, at some point in the future, it will cause bad things to happen. I am sometimes guilty of this one myself.

Example: But it we add that property to the class to solve that problem, what stops us from adding other properties? Eventually, the class will have 1000 properties!

#4: Appeal to the Masses

Also known as an ad populum argument. Just because 10 ba-jillion people do it doesn’t mean it’s right. We used to bleed people to cure diseases… that didn’t make it right. Tell me what’s good about your solution. What are it’s pros and cons? I don’t care… at all… how many people think it’s a good idea.

Example:  But everyone is using Java applets on their web pages, so we should be doing that also!

#3: Appeal to Authority

Very, very common in this internet age to become victim to this one because authority is granted upon the most random of internet characters. Having a blog does not make you an authority, in any case, and it certainly isn’t a valid argument to point to anyone to validate your solution. If you want to plagiarize that person’s arguments to make your case, that’s fine… but please don’t tell me you’re right because some guy said so.

Example:  But Scott Hanselman says that everybody should be using NuGet for everything!

#2: Straw Man

This one is used a lot and it’s good to be able to spot it when it happens. Someone takes your side of the argument, misrepresents it in the most gross way possible in order to beat it about the head and shoulders like a birthday pinata.

Example: So you want to take all that data, jam thousands of rows in the database and then just do it all with stored procedures?!? Um, no… not what I said at all.

#1: Ad Hominem

This happens all the time. Basically attacking the debater rather than the idea. This is usually a desperation move or commonly used by someone unaccustomed to defending their positions. Often takes the form of Appeal to Motive…. attacking the debater by questioning their motives.

Example:  You don’t like this solution because you just don’t understand it (because, presumably, I am an idiot). You just want to do it that way because it’s less work for you and more work for me (appeal to motive).

I think it’s good for people to know debating skills and what makes for a good argument, and what doesn’t. A good read of Robert’s Rules of Order would be nice also!

Innovation not Regurgitation

Given the rapid growth of the internet as an information resource… and our seemingly inherent human treatment of the written word as gospel… it is very easy for my fellow engineers to read a single blog entry and conclude that ‘we should be doing that!’

… and I think that’s a very serious problem.

To make matters worse, the ever increasing number of tools and frameworks leads to a dangerous tendency to reduce our engineering responsibility to ‘toolbox assembly’.  We’ll use Tool A for this and Tool B for that and we’ll house that all in Framework C… my work is done here!

I remember sitting in a meeting a few years ago and asking “when did we get so afraid to write our own code?”

Ultimately, I’m interested in ideas… using tools and contemplating thoughts from internet blogs are useful endeavors… but they shouldn’t replace our basic creative function as engineers.

Give me innovation not regurgitation!