Archive for sql

Stored procedure reporting & scalability

Today’s post is a case study of sorts, about my former employer, who had an interesting architecture. It’s roots were VB6 and SQL server (version 6 I believe). They decided to put as much logic in their stored procedures as possible. The arguments being:

  • Easy to update (fix, improve) on-the-fly.
  • Hard to work with data (multiple tables, arrays, etc) in VB6 and ASP, at least compared to .NET.

Given the circumstances, I don’t think that I would go back and change history if given the chance.

The problem is that the world has changed, and their architecture just doesn’t scale well. The biggest problem is in the area of reporting (which is now primarily .NET). Since all the data manipulation and logic is in the stored procedures, the database is forced to do all the work. In a small application with not a lot of data, this probably isn’t a big deal. The report runs for a few seconds and no one really experiences a slow down.

The problem arises when there is a lot of data. Easily millions of rows in dozens of tables. The current fix is to scale vertically, meaning that they throw more hardware (AKA money) at the problem.

The real solution to the problem is to scale horizontally. There are two main ways of doing this. The first is to add more database servers. The problem is that this isn’t really all that easy. The software to do this is getting better, but the fact remains that two different systems now have to stay synchronized. I’ve read articles about businesses that have scaled their business, and a common theme is that databases are one of the hardest parts of your architecture to scale.

The second aspect of scaling is to reallocate where your work is being done. You need to start thinking of the database as a…well….a database! The first and foremost purpose of a database is to store massive amounts of data, and allow you to quickly retrieve that data.

Databases are amazingly fast when you use them simply as a place to store data. If you design your database correctly, and set up indexes that are optimized for the ways you want to retrieve your data, there should be no reason to wait for your data. SQL Server can easily handles millions or even billions of rows, and query any of them almost instantaneously. Even with multiple queries being executed concurrently or in succession.

Consider the following diagrams:

Architecture

The existing architecture is on the left, and the proposed structure is on the right. It’s not a big change, I’m simply suggesting that the business logic be moved from the stored procedures, into the code.

The first reaction that I usually receive from this suggestion is that the code is going to be ugly compared to the corresponding SQL. The reality is that the SQL is ridiculously complicated (at least in my experience). SQL is great for set based logic, but really starts to break down when trying to do object oriented or procedural operations.

The fact is that .NET is great at processing massive amounts of data. First of all, it’s incredibly fast. If you’re writing your code correctly, you’ll be amazed at how fast it can process data.

More importantly, if you’re using a programming language to manipulate your data instead of T-SQL, you can really start to break down the problem at hand. Databases are traditionally very bad at breaking a large problem into smaller problems. Sure, you can call other functions and stored procedures, but you can tell that it’s not the strongest feature of the database. A programming language like .NET lets you pass data around in any structure that you can conceive.

The other major advantage of processing the data in your code, is that you can now easily build testable code. Any code that you can easily test will have less bugs, and should be easier to maintain in the long run.

If your data is now being processed and organized in your code and not the database, you are probably ready to scale horizontally. It’s relatively easy to add more front end servers. Since they all hit a common database, there aren’t really any synchronization issues.

So does this really work? Of course. I’m not the first to think of it by any means. Take any of the largest websites on the Internet, and look at how they have designed a scalable architecture. Digg.com has always had database scalability issues with their MySql servers, so they try to minimize them as much as possible. Twitter uses ONE database server to handle thousands of requests per second. eBay.com doesn’t even do joins in their database. They would never be able to scale if they put that burden on a database that handles 26 billion SQL queries each day.

Microsoft has made matters worse by integrating the .NET CLR into SQL Server. You can now write .NET code that gets executed on your database server. This is a great tool, but it must be used with great care. This isn’t an excuse to write more code in your database.

In summary, databases are an expensive commodity, don’t abuse them! Be careful how much you do in the database, and ask yourself if you can move some processing into code, where it is more easily scaleable.

Just recently, Scott Hanselman came out with a podcast about website scaling. If you’re looking for more site scaling concepts, it’s worth a listen.

SQL Server NULL values and "Order By" order

I have a few tables that contain a column called "Order", which is used to sort by when retrieving the data. The purpose is to keep the data in a certain order when displayed to the end user.

Black Linen NULL
Navy Blue Linen NULL
Dark Green Linen NULL
Burgundy Linen NULL
Ivory Vellum NULL
Grey Felt NULL
Natural Linen NULL
White Coated Two Sides 1
White Cast Coated One Side 2
White SemiGloss Coated One Side 3
White Smooth 4
White Linen 5

The problem is that SQL Server puts null values above non-null values when doing an "order by". To reverse this behavior, this was the most elegant and efficient solution that I found:

Select FooValue
From foos
Order by (Case When [Order] Is Null Then 1 Else 0 End), [Order]

I found information about the original problem here, and the solution was from Tim in the comments. Thanks!

When should you use database constraints?

A discussion came up at work recently about the extent of constraint usage in your databases. There were basically 2 camps:

  1. Constrain everything humanly possible. If it’s an integer that wouldn’t normally be negative, add a ">= 0" constraint.
  2. Constrain primarily where it’s necessary to maintain referential integrity.

Consider the following diagram. It’s a map of the flow of data from your user, which eventually makes its way into the database.

Validation Layers 

Since we’re getting input from a user, and they’re the one that can fix invalid data, we validate data at the top layer. There’s usually no getting around this. In fact, for the best user experience on the web, you’re going to perform some JavaScript validation. Then you’ll probably validate it again on the server, in case they have JavaScript disabled.

At this point, unless there is a bug in your code, you’re sure that the data is valid. You may not know if it’s referentially valid. Validating the input a third time in the database is probably overkill. It’s also a potential performance bottleneck.

Yes, there are many times when this doesn’t apply. For example, when multiple systems are interacting with the same database, and one counts on the data in a certain format. The only way to guarantee you get data in a format you expect is to constrain it at the database level.

In general, I avoid strict constraints at the database level. The biggest reason is that it requires your to synchronize all of your validators. They all have to agree on the same set of restrictions, or the code will fail. That goes against the LEAN and Agile philosophies. When I want to allow negative numbers in my integer field, it’s much easier to simply change it in my application. This is amplified if you have to talk to a DBA to make changes.

Another reason to avoid constraints is that they can’t always understand the data like the application can. For example, should a constraint attempt to ensure that valid email addresses are entered? If you’re storing a persons age, do you constrain it so that it can’t go above 500? 200? 100?

iStock_000005716223XSmall

Now let’s assume that there is a bug in your application code, and you didn’t have a trusty constraint to stop it. You now have invalid data in your database. The good news is that you now have the potential to clean it up, or adapt your code to deal with it. The bad news is that if that value is used in a calculation that could have bad side effects, you could have big problems.

As with anything, there is no hard and fast rule for every situation, but we can at least make some general guidelines.

Think LEAN. Anything that doesn’t provide value to the customer is waste. Think Agile, requirements change, be adaptable.

LINQ, I’m not ready for you just yet

Today I was between features on the current project I’m working on, so I had some free time start researching some technologies I’ve been meaning to learn and start using. The topics at the top of my learning list are LINQ and MVC. I gave LINQ a few months to mature, so I figured it was a good time to investigate.

Boy Crying

(there is no emoticon to express my anger!)

The picture above shows how I felt when I started writing my first LINQ expression. The biggest problem was the fact that the latest version of ReSharper doesn’t support any .NET 3.0+ language features. Not only does it not support LINQ, it’s IntelliSense severely interrupts you while writing it. So much so that it makes it unusable.

I went ahead and downloaded the latest development build (Build 783). On their download page, it’s listed as "Works here". That wasn’t encouraging. It does work a little better with LINQ, but it’s still a steaming pile of you know what (dog poop for the not-so-smart). This further reinforces my love/hate relationship with ReSharper.

Anyway, I was eventually able to write some LINQ code. A great tool to get started is LinqPad, which is basically a query analyzer but with LINQ expressions. Writing LINQ is very difficult with a SQL background, because everything is backwards. You think you know what you’re doing, but you don’t.

Right now, we’re using NHibernate in the main project that I’ve been working on for the past couple of months. It’s amazing, but there are a couple of things that would be nice:

  • Better optimization of queries - It looks like LINQ does an amazing job with this.
  • Batched reads & writes - LINQ does batched writes, but lazy loading by default. Maybe not as big a deal as I think.
  • Cross session saving - I spent hours battling with some code that loaded a complex object with relationships in one session, and then saved them in another. It appears that LINQ solves this, but I’ll have to run some tests to be sure.
  • Less work generating mapping files and relationships.

One thing that is nice about LINQ to objects is the fact that it will generate all of the model classes, plus the glue that connects the model to the database. You can either use Visual Studio and drop the tables into a mapping file, or you can use SqlMetal to script the class generation.

One of the biggest questions I’m trying to answer write now, is how unit testing fits in with LINQ. We’re currently testing our data access layer by using an in memory SQLite database, which let’s us perform close to real world saves and loads. We also use interfaces for our data access methods, which makes it easy to create testable classes that can simply be supplied a database interface.

I’m also not sure if it even makes sense to put my LINQ queries in a data access layer. The code would almost seem trivial, and would just create a lack of flexibility. Ironically, it almost feels like you should use LINQ to query against your data access layer.

For now, there are more questions than answers. For now, I don’t plan on retrofitting my last project with LINQ, but I’m going to investigate if it will be a good foundation for the data access logic in my next project. Of course if I go that route, you’ll be sure to hear about it!

Disable constraints in "After Insert" trigger

I have a table that stores extra information (Users) that gets associated with the "aspnet_Membership" table in my application. Since my table references the membership table, I have a foreign key for referential integrity.

I added a trigger to the membership table so that rows automatically get inserted into my table. The problem is, the trigger violates the foreign key constraint! Here is the trigger code:

Create Trigger dbo.Trigger_CreateExtraUserRecord ON aspnet_Users
After Insert
As
Begin
    Set Nocount On
    Insert Into tfs_Users
    (MembershipUserId)
    Select UserId From inserted
End

As you can see, it’s an "After Insert" trigger, so the first insert will be done at this point (I have verified that).

It must be using some kind of transaction, and the foreign key is violated because it’s not committed.

The solution (not perfect, but it works), is to use this before the insert to disable the foreign key check:

Alter Table tfs_Users Nocheck Constraint All

And use this after:

Alter Table tfs_Users Check Constraint All

Instead of disabling all constraints, you could also specify the constraint name, which would obviously be better in most cases.

Does anyone have a better explanation for this behavior?