Build a testing database with SSDT and NuGet

My central idea was to use SSDT to build a dacpac file that would define the schema of our testing database and wrap that DACPAC artifact in a nuget package that testing projects could reference.

It’s not uncommon during early attempts to bootstrap CI and automated testing for an engineering team to start with ‘unit’ tests that read and write to a common development database. Truly, these should be considered more integration tests than unit tests, since they touch more than a single layer of the architecture, but that’s not the main drive of this particular post. My primary goal in adopting testing culture is increasing the amount of reliable and automated testing. Moving towards more pure unit tests will be a drive towards making our automated tests faster (and probably improved architecture)… but that is a problem for a future day.

Continue reading

The Power and Simplicity of the Data Warehouse

“In many ways, dimensional modeling amounts to holding the fort against assaults on simplicity”
– Ralph Kimball, The Data Warehouse Toolkit

 

Although there are many reasons that an organization may consider building a “data warehouse”, most of the time the overwhelming reason is performance related… a report or web page is just taking too long to run and they want it to be faster. If that data could be pre-computed than that page would be really fast!

I’m here to tell you that speed is not the most important reason for building the warehouse.

When you’ve got a system where multiple consumers are reading the data from the transactional store and doing their own calculations, you create a whole bunch of other problems beyond just the speed issue:

  • You create the potential for multiple different and conflicting results in that system. At least a few of those consumers will inevitably do something wrong.
  • You put a considerable burden on your transactional system to recalculate that data over-and-over again for each custom request.
  • While those consumers are running their long running queries, that data is being simultaneously updated by a multitude of data collection and transformative processes… you are not only reading inconsistent sets of data in the consumers, you are blocking the collection and transformation processes from doing their job, getting in their way and slowing them down… and sometimes even causing them to fail.
  • You’ve created a multitude of intertwined dependencies in that system. This makes attempts to improve or otherwise change that transactional system extremely difficult to achieve without breaking things… and having to make massive system wide changes to accommodate even the smallest change.
  • The bottom line is this: You’ve just got a greatly over-complicated system that is much more prone to performance problems and defects. As Ralph Kimball states so eloquently, data warehouse efforts are a giant move towards simplicity. And simpler systems are better systems.

We recently launched a major warehouse initiative to, once and for all, pre-compute and store all our portfolio-level data. Although that data is already computed from data that’s been pre-aggregated at the account level, there is still considerable additional work required to aggregate that data further to the portfolio level.

Primarily, that’s a major performance problem. Pulling a single portfolio’s data can take as long as 5-7 minutes for larger portfolios. That’s a serious problem with our scalability and an overall burden on our system.

I’m happy to report that the portfolio warehouse initiative is nearing its conclusion and am confident it will do things for us far beyond the performance improvements we hoped to gain:

  • With every consumer pulling from the same warehouse for portfolio level information, we can guarantee they will get the same results… they are all “drinking water from the same well.”
  • The portfolio data can now be processed “incrementally”… i.e. rather than have to recalculate that data from the beginning of time for every request, we can reprocess only that data that has changed. This pays huge dividends on overall system performance and greatly decreases the burden on the system.
  • Our data will now be pulled from a snapshot-enabled data warehouse. This guarantees clean and consistent reads without blocking the transactional store from doing its work.
  • By having one system that reads transactional data and compiles and stores the portfolio data, we only have one system to change when we want to change something in the transactional store. This is hugely liberating to us when we want to modify those underlying systems.
  • The new published warehouse structure for portfolios is simple and easy to understand. It therefore opens up consumption of that data in new ways with less effort, opening the doors to new possibilities that were otherwise impossible. Looking at data for all the portfolios in a firm in one pass is now possible, performing cross-firm analytics that were unthinkable before. This opens a myriad of optionsfor us that we intend to take advantage of.
  • Oh, and the speed is nice also… it’s really fast!

While we are still in the final stages of implementation, we hope to bring this system fully into production over the next few months and are very excited about the possibilities… we hope you are too!

And if you’d like to read about data warehousing, I highly recommend what is, in my opinion, the bible of data warehousing:

The Data Warehouse Toolkit – By Ralph Kimball and Margy Ross

Fortigent

“In many ways, dimensional modeling amounts to holding the fort against assaults on simplicity”
– Ralph Kimball, The Data Warehouse Toolkit


Although there are many reasons that an organization may consider building a “data warehouse”, most of the time the overwhelming reason is performance related… a report or web page is just taking too long to run and they want it to be faster. If that data could be pre-computed than that page would be really fast!

I’m here to tell you that speed is not the most important reason for building the warehouse.

When you’ve got a system where multiple consumers are reading the data from the transactional store and doing their own calculations, you create a whole bunch of other problems beyond just the speed issue:

  • You create the potential for multiple different and conflicting results in that system. At least a few of those consumers will inevitably…

View original post 587 more words

The coolest SQL Server function you never heard of

Ever heard of the SQL_VARIANT_PROPERTY function? I didn’t think so.

SQL Server developers very often make the mistake of making their NUMERIC fields too large. When faced with a choice of how to size the column, they’ll often think “make it way larger than I need to be safe”.

This works OK as long as you simply store and read these values, but if you ever have to perform math with these columns, particularly some form of division or multiplication, you may find your results mysteriously losing precision.

This is because SQL Server can only store a maximum of 38 digits per number… if the results of your mathematic expression may yield a number larger than that, SQL Server will be forced to downsize it and remove digits from the mantissa as a result.

Continue reading

Reading from Snapshot Databases With Multiple Tables

If you haven’t explored using Snapshot isolation in SQL Server, I recommend you give it a look. A snapshot enabled database allows the reader to get  a clean read without blocking.

Prior to this capability, the only way to guarantee a non-blocking read from the database was to sprinkle NOLOCK statements all over your queries. Clearly this is a bad idea because you’re getting a dirty read, but it’s really much worse than that.

Enter Snapshot isolation… When querying using Snapshot isolation, your query will read the state of the rows at the moment the query begins, prior to any outstanding transactions against those rows. This allows you to get the last known committed state of those rows (clean data) without having to wait for outstanding transactions to commit. This is critical behavior if you want, say, a website that doesn’t have slow loading pages.

Now, it gets interesting when you’re trying to read multiple tables in one batch from your Snapshot database. Let me show you an example.

Start with two sample tables:

create table t1 ( id int identity(1, 1), value varchar(50))
create table t2 ( id int identity(1, 1), value varchar(50))
insert into t1 values ('Test1')
insert into t2 values ('Test1')

Now, setup a set a couple of update statements, but don’t execute them yet:

-- Update statements
begin transaction
update t1 set value = 'Test2'
update t2 set value = 'Test2'
commit transaction

In another window, setup a read as follows:

-- Select statements
set transaction isolation level read committed
begin transaction
select * from t1
-- create some artificial time between the two queries
waitfor delay '000:00:10'
select * from t2
commit transaction

Now, execute the Select Statements code, then go back to your update statements code and execute the updates including the commit (you’ve got 10 seconds, so move quickly). Now go back to your select statements and wait for them to finish.

Here’s what you’ll get:

Using READ COMMITTED Isolation

Since your first query executed right away, it gets Test1, while the second query reads after 10 seconds (during which your update occurred) and sees Test2.

Now switch your test data back to its original state:

update t1 set value = 'Test1'
update t2 set value = 'Test1'

And change your select query to use Snapshot isolation:

set transaction isolation level snapshot
begin transaction
select * from t1
-- create some artificial time between the two queries
waitfor delay '000:00:10'
select * from t2
commit transaction

Now repeat the process… run your select query, switch windows and run your updates with commit, then switch back and wait for your select query to complete. Now you’ll get this:

shot2

Cool! Now I get Test1 from both tables, even though I updated the data between the two individual queries. How’s that work and why does it matter?

According to MSDN, SNAPSHOT isolation specifies that ‘data read by any statement in a transaction will be the transactionally consistent version of the data that existed at the start of the transaction‘.

This differs from READ COMMITTED in which data ‘can be changed by other transactions between individual statements within the current transaction‘.

This can be pretty important if you are publishing data to a multi-table warehouse and that multi-table publication should be considered as part of a ‘set’. If you use READ COMMITTED in that scenario you can get, essentially, a mismatched set of data. This is usually not a good thing.

If you’re reading from a single table from your Snapshot-enabled database, then using read committed will be fine. You’ll get your nice non-blocking clean read. If you need to read multiple tables in one transaction, however, and you want those reads to be consistent, you need to explicitly use SNAPSHOT isolation and start a transaction!

Yes, that’s right… you need a transaction wrapping your select statements. Transactions are not just for updates… shock and horror!

————–

More information here:

http://msdn.microsoft.com/en-us/library/ms173763.aspx

http://msdn.microsoft.com/en-us/library/tcbchxcb(v=vs.80).aspx

http://blogs.msdn.com/b/sqlcat/archive/2007/02/01/previously-committed-rows-might-be-missed-if-nolock-hint-is-used.aspx

Find stored procedures that won’t compile

Here’s a handy little code snippet.  Find any stored procedures that won’t compile on SQL Server.

The routine will list all procedures, showing ‘– Failed’ with the error messages for those that fail.

Note that recompiling the procedures may also throw dependency errors also that can’t be caught with try/catch block … these errors likely indicate an issue as well.

SET NOCOUNT ON

DECLARE @ProcedureName VARCHAR(2048)
DECLARE @ProcedureBody VARCHAR(MAX)

DECLARE @RoutineName varchar(500)

DECLARE procCursor CURSOR STATIC FORWARD_ONLY READ_ONLY
 FOR
 SELECT
 --TOP 1
 SCHEMA_NAME(schema_id) + '.' + NAME AS ProcedureName,
 OBJECT_DEFINITION(o.[object_id]) AS ProcedureBody
 FROM sys.objects AS o
 WHERE o.[type] = 'P'
 ORDER BY o.[name]

OPEN procCursor
FETCH NEXT FROM procCursor INTO @ProcedureName, @ProcedureBody

WHILE @@FETCH_STATUS = 0
BEGIN
 -- Might have to play with this logic if you don't have discipline in your create statements
 SET @ProcedureBody = REPLACE(@ProcedureBody, 'CREATE PROCEDURE', 'ALTER PROCEDURE')

 BEGIN TRY
   EXECUTE(@ProcedureBody)
   PRINT @ProcedureName + ' -- Succeeded'
 END TRY
 BEGIN CATCH
   PRINT @ProcedureName + ' -- Failed: ' + ERROR_MESSAGE()
 END CATCH

 FETCH NEXT FROM procCursor INTO @ProcedureName, @ProcedureBody
END

CLOSE procCursor
DEALLOCATE procCursor