Deep dive into SQL Server 2008

Microsoft's 'Katmai' is packed with performance and management features enterprises will love, and even small shops will have reasons to make the upgrade

SQL Server 2008, aka "Katmai," gives SQL Server shops plenty of reasons to get excited. The best SQL Server release to date, it sports more nice new features than you can count, and the improvements extend to both performance and manageability. In a few cases, such as the Resource Governor, you'll wish Microsoft had taken the functionality a little further. But whether you manage an OLTP environment, or an OLAP environment, or both, you will most likely find Katmai compelling. It easily passes my own five-point test for upgrades.

My five-point test requires a new release to bring at least five significant improvements to my environment, or it's not worth upgrading. Each improvement has to change my life in a significant way, either by dramatically shortening the time to do common tasks or by allowing me to do something I couldn't do before. For my environment, which is a large data warehouse, these five Katmai features easily pass the test: Change Data Capture, Lookup Cache, Data Compression, PowerShell integration, and Policy-Based Management. And I could easily keep counting.

How does Oracle Database compare? See the review, "Oracle Database 11g shoots the moon."

Change Data Capture and Lookup Cache will be popular with DBAs who want to speed up ETL (Extract, Transformation, and Load) processes, as will the pipeline improvements in Integration Services, enabling it to push data much faster. Data Compression, PowerShell integration, and Policy-Based Management, not to mention Server Groups, should make a big splash in almost any environment. Just keep in mind that the benefits you get from Data Compression (which works at the page level, replacing repeating data with lookup pointers) will depend on the nature of the data in your table and how it's ordered.

Backup Compression is also new in SQL Server 2008. Although the compression itself works well, the feature has too many limitations (including supporting only Enterprise Edition) to be effective in an enterprise setting. Another compression feature, called Sparse Columns, allows you to store nulls without taking up physical space. Sparse Columns will come in handy for large tables containing many null values. It's just too bad you can't use Sparse Columns and Data Compression on the same table.

Policy-Based Management allows you to define almost any configuration or administration policy you might think of for any number of servers, and be alerted whenever those policies are violated. Although spanking new in Katmai, Policy-Based Management already seems mature.

The inclusion of PowerShell will reinvent the way DBAs manage their environments by taking complicated cursors out of management scenarios. This is the debut of PowerShell in SQL Server, and there are some rough spots. But I expect they'll be ironed out soon enough. I've even heard rumors that PowerShell will eventually replace sqlcmd as the command line interface.

Page Break

Policy-Based Management and PowerShell integration are the biggies, but Management Studio has added some other nice features, such as Server Groups, which allows you to run queries against multiple servers at once. However, I'm not fond of the new Activity Monitor. It may be a step in the right direction, but it just isn't useful in its current state.

Another new feature that isn't quite there yet is the ballyhooed Resource Governor. The Resource Governor lets you define limits on CPU and memory usage for certain workloads. This is good enough to prevent some traffic jams -- like query processes consuming too many resources on an OLTP server, for example -- but it's a far cry from what's needed to define and isolate rogue queries.

In short, SQL Server 2008 is a solid release packed with significant improvements. Not all of the new features are fully fleshed out, but some are surprisingly mature and many are extraordinarily useful. A few will allow DBAs to manage multiple servers as easily as one, and perform certain tasks and operations in a fraction of the time they take today. Read on for more detail on SQL Server Management Studio and management improvements, Resource Governor, Change Data Capture and nonlogged inserts, Data Compression, Backup Compression, index improvements, SQL Server Reporting Services, and SQL Server Integration Services.

Improved management

If there's any one new feature in SQL Server 2008 that will change the way DBAs manage their environments, it's probably Server Groups. Server groups let you run any query you like against an entire group of servers simultaneously. So instead of having to cycle through all of your servers to check job status, deploy stored procedures, and so on, you can manage or push code to any number of servers with a single query. On top of that, Server Groups can be further extended and enhanced to do some great things. It does take a paradigm shift, however. DBAs will have to start thinking in terms of groups instead of single servers.

SSMS (SQL Server Management Studio) also has much better object-level stats than ever before. This is a perfectly executed feature, and I love it. In the object details pane, you can now click on an object folder and get info for all objects in that folder. For example, click the Databases folder to immediately see the key info on all the databases on the server, including size, space used, owner, index space used, and many more. This is an incredibly useful list because you can sort it and reorder the columns and your customizations persist. You can get detailed info on all of your tables too. In this new detail view, you can see row count, data and index usage, schema, and so on. Not only is it very easy to view object-level stats, but this level of info is available for all folders, not just databases and tables. This is one feature that I wouldn't change in any way.

One of the most important features included in this release is PowerShell. PowerShell is going to change how DBAs manage their environment, because it allows you to do complicated things so much easier than in Visual Basic or T-SQL.

Page Break

Let's take an example of scripting databases. You can script a database pretty easily in SSMS, but you can't schedule it. Until now, if you want to schedule a script or code the scripting process to perform the exact same options every time without having to click through the wizard or worry about making mistakes, all you could do was code a solution in VB and compile it. But not everyone has VB skills, and writing SMO (SQL Server Management Objects) is neither easy nor intuitive. PowerShell allows you to script a database in just 21 characters of code and schedule it inside an agent job. You don't have to compile it, so you don't have to keep up with separate source code. PowerShell also makes it a lot easier to work with multiple objects -- and multiple servers -- in SQL Server.

Another good example is adding user permissions to all the schemas in a database. You can code this in T-SQL, but T-SQL requires the inclusion of unsafe dynamic SQL code inside a cursor. You can accomplish the same thing in PowerShell with just a couple of lines of code, and without introducing any unsafe constructs into your environment.

Pretty blinkies

SSMS has a new Activity Monitor that provides information DBAs formerly had to gather from multiple sources. At the top is a series of live graphs where you can see useful stats such as CPU, waiting tasks, and batch requests. Under the graphs, you can see detailed information on processes, resource waits, recent expensive queries, and data file I/O. In fact, the Activity Monitor looks exactly like the resource overview from Vista.

So the Activity Monitor is presentable enough, but using it is a mixed bag. If you want just the top-level server stats from the graphs, then you won't be disappointed. The graphs load fast enough and the information is useful. The trouble begins when you start digging into the drill-down information. For starters, on a really busy system it takes too long for these drill-down areas to populate. I tested this on a 64-bit data warehouse system with 1,500 reports running and a few dozen more ad-hoc queries going on. The drill-downs took so long to populate -- more than five minutes -- that I thought they weren't working. That's not what you need in a troubleshooting scenario.

Second, much of the information you'd expect isn't there. When the drill-down info did come up, I immediately saw that my server was maintaining about 67 percent CPU -- so far, so good. But the next pieces of information you would want -- namely, memory usage and disk usage -- aren't in the Activity Monitor at all. You can see disk throughput, but that's not really going to tell you if you're having disk problems.

In short, while there are some useful server-level stats, they're incomplete enough to render them pointless. If I have to go to another tool for my memory and disk stats, then I might as well skip the Activity Monitor and use perfmon (or other tool) to begin with.

Page Break

The processes, resource waits, and data file I/O drilldowns are more useful. The resource waits monitor shows all of the server-level resource wait types and gives you stats about each one. Similarly, data file I/O puts all of the important info at your fingertips, though it does take forever to load.

Processes is really just the same sys.sysprocesses table that DBAs are used to, but with a GUI that makes it a little more user-friendly than a manual query. You can easily filter data that you want to see, which makes for easy troubleshooting, especially when you're looking for a specific user or blocking process. One beef: Although you can rearrange the columns in the display, the order doesn't persist, so the next time you open the Activity Monitor, you'll be arranging your columns again.

The recent expensive queries drill-down is almost completely pointless. Does it give you the most recent expensive queries? Yes, it does. Does it give you important stats on these queries such as CPU, number of reads and writes, average duration, and plan count? Absolutely. Does it tie these stats to a user so that you can tell who or what is performing these offending actions? No. What is a DBA supposed to do with that?

One thing you get from the GUI drill-downs that you don't get from their manual counterparts is auto-refresh. You can configure your refresh rate, but here again, your changes don't persist. A bigger issue is being able to collect the performance data you want within the refresh period. This can be a problem on busy systems that collect a lot of performance data.

For example, if you have dozens of data files on your system, it could take longer than the 10-second default refresh rate to pull the data file I/O data. And while you can increase the refresh interval to, say, one minute, the CPU graph will be refreshed only once per minute as well. Unfortunately, you can't set different refresh intervals for different monitors. It's one size fits all.

For all that it tries to be, I have to give the Activity Monitor a thumbs-down, or at least a thumbs-sideways. In its current form, it's just not useful enough to do DBAs much good. Not only did it take too long to give me the info I needed, but the info was so sketchy that I needed another tool to complete it. If I'm going to look at CPU, I'm immediately going to want to see memory and disk. And as long as I have to go somewhere else for those other metrics, I might as well get CPU there too. Sorry Microsoft, better luck next time.

Resource Governor

Microsoft touts the new Resource Governor as one of the biggest enhancements in this release, and there's no question it will come in handy for shops that want to keep some processes from interfering with others. For example, one of the best fits for a resource governor would be on an OLTP system where you're forced to run reports against live OLTP data. You don't want the reports to get in the way of transactions, so you would push them to a group that caps their resources.

Page Break

Still, this is Microsoft's first stab at granular resource management, and while it's a nice start, it's not going to be as immediately useful as many shops hope. First of all, you can specify only two resource measures to define a resource group: CPU and memory. You can specify min and max values for each one, but that's not nearly enough to define a rogue process on a busy system. Rogue processes take many forms, and capturing them requires additional metrics (number of I/Os, amount of time, etc.) beyond CPU and memory usage.

Further, once you define a process to be in a particular resource group, it stays there. Misbehaving queries cannot be moved dynamically to a different group, as you can do in Oracle Database. In order to assign a query to a different resource group, you would have to kill the process, assign it to the group, and then restart the Resource Governor. Something tells me you won't want to do that often on a really busy system.

Finally, you have to be careful with how you set up your groups. For example, if you have three groups and they all have a maximum 50 percent of the memory defined, then there's obviously going to be a problem. So there have to be rules that define how much everyone gets; you just have to be careful about how you define your groups and monitor them to make sure they're giving you what you want.

It's difficult not to compare Microsoft's Resource Governor with Oracle's. Oracle Database has had the feature longer, and it's certainly richer and more mature. But there's one advantage to Microsoft's you just can't ignore: It's included in the license, while Oracle's is a fairly expensive add-on.

Change Data Capture and nonlogged inserts

Change Data Capture (CDC) and nonlogged inserts are two nice enhancements to the SQL Server engine that will make a world of difference for some shops. These enhancements mainly revolve around warehouses, but can be useful on some OLTP systems as well. CDC will alter the way developers write ETL processes for warehouses. CDC employs the same log reader used by replication to write the specifics of DML (Data Manipulation Language) operations to a table instead of a distribution database. This allows warehouses to be very easily loaded from tables that only contain changed data -- no more having to modify legacy applications or ETL processes with update columns in all the tables to provide this functionality. And you can track with perfect precision the changes made to data in your tables without altering your schema or performing a messy and expensive query to filter out the differences.

Retrieving the changes from the CDC tables is very easy also. Microsoft has provided a set of functions that retrieve the data, so all you have to do is call them. As you would expect, both Type 1 and Type 2 data can be stored, so you can have the exact level of change data you want.

Page Break

Nonlogged inserts allow SQL Server to do minimal logging for large inserts. By not logging every data row inserted, you can speed up your data loads by orders of magnitude in some cases. At the very least, you will see significant performance gains.

You could do nonlogged inserts in previous versions of SQL Server using what's known as a "select into" statement, but "select into" actually creates the table for you. This is great for new tables, but if you have existing tables with security and other attributes defined, you don't want to delete and re-create them. Another problem with "select into" is that you have no control where your new table goes; it always joins the default file group.

The new nonlogged insert allows you to insert the rows into an existing table, maintaining the control you have over your space, performance, and security, and still get the benefit of an ultra-fast data load. The general rule of thumb I like to follow for warehouses (and any database, come to think of it) is no moving schema. That means you don't want to delete and re-create permanent objects every day. It's error prone and introduces complexity into your system you don't need, not to mention your tables aren't available during the load because they've been deleted.

In my testing, non-logged inserts performed right on par with their "select into" cousins. As the DBA of a large warehouse that faces data loading problems every day, I love this feature. There's just one thing I would change about it: I wish there were an option I could tack onto my insert statement to throw it into non-logged mode.

Data Compression

SQL Server's first attempt at Data Compression actually looks pretty good. As I noted in my beta preview (see "Microsoft's 'Katmai' is filled to the brim"), SQL Server 2008 provides two types of compression: row and page. Row compression is true compression, in which unused spaces at the ends of columns are removed to save storage. Page compression, aka dictionary compression, normalizes the data on each page and keeps a lookup pointer. In SQL Server 2008, page compression includes row compression. If you have page compression turned on, you get row compression in the bargain.

Microsoft provides a handy compression calculation wizard that will give you a good estimate of the benefits you can expect. The wizard runs a test compression scenario against your data for each compression type (row and page) and tells you what the new size of the table should be. I tested the compression calculator against a number of data sets, and on average the calculation deviated from my final results by only 1 or 2 percent. That's pretty good, considering that the calculation is based on a relatively small amount of data.

Page Break

How much will a table compress? That depends on both the type of data and the type of compression. If your data is something like SAP data, which has a lot of trailing spaces, then row-level compression will serve you well. If you have a lot of repeating data, then table-level page compression will be your best friend.

The order of your data matters also. Because SQL Server compresses at the page level, repeating data that is clustered together will be de-duplicated. So it pays to be aware of both the nature and the order of your data.

All that said, just as with Advanced Compression in Oracle Database 11g (see the review), my results with Data Compression in SQL Server 2008 were all over the map. I got as little as 17 percent compression on some data sets, and as much as 76 percent on another. The high of 76 percent was reached on denormalized order line data in a data warehouse.

When it comes to Data Compression, SQL Server has one advantage over Oracle: SQL Server will compress existing tables in a single statement. In Oracle Database 11g, you have to create a new table, insert the data, and then delete the old table. Not only does SQL Server spare you from creating an additional copy of the data, but you can still perform operations on the table during the compression operation, no downtime required.

An excellent new feature called Sparse Columns allows you to store null values without taking up any physical space. But here Microsoft made a big mistake: Sparse Columns aren't compatible with Data Compression. If you define Sparse Columns on a table, you can't also apply Data Compression. As I mentioned in my preview, this one may be worthy of a Darwin Award. Sparse Columns and Data Compression would be a perfect match. Let's hope Microsoft fixes this soon.

Backup Compression too

SQL Server 2008 also introduces Backup Compression. In some preliminary tests, I found the compression ratios to be mostly on par with those of Quest LiteSpeed and other third-party backup tools I've used. Still, SQL Server's Backup Compression has a number of strikes against it. First, Backup Compression is only available for the enterprise edition. Even in the enterprise, that's just going to make things harder to manage. Chances are the lion's share of SQL Server instances in any sizable organization aren't enterprise edition, and for these instances to be the exception in the backup plan just isn't acceptable.

Page Break

Look at it like this: Database size isn't the reason you've chosen enterprise edition. Because most versions of SQL Server can handle unlimited data size, enterprise edition is generally chosen for the features. As a result, you're likely to have plenty of SQL Server boxes in your shop with very large data sets that aren't on enterprise, and they need compressed backups too, which means you probably already have a third-party backup solution in place. To switch backup routines for a small subset of your SQL Server boxes just doesn't make sense. You'll surely want to have the same solution for your entire environment.

Second, unlike SQL Server, the third-party backup solutions have object-level restore, which can come in very handy in a number of situations. If you go with SQL Server's Backup Compression for your enterprise servers, you're losing functionality.

Third, the third-party solutions have centralized repositories and provide centralized reporting and alerting. So if you use SQL Server's native compression, you've effectively eliminated centralized management of backups for those boxes.

There are other features that third-party backup solutions bring to the table, but incomplete coverage, object-level restore, and centralized backup management are the biggest reasons that SQL Server's Backup Compression isn't going to be viable for most shops.

Index improvements

Indexes have received an overhaul. Not only can indexes be compressed in SQL Server 2008, but you can also build filtered indexes. Filtered indexes have a "where" clause, allowing you to partially index a large table. This may not seem very useful at first, but there are situations where it's very beneficial.

For example, let's look at a Sparse Column situation. Say you have 400 million records in a table with a significant number of null values, and you've defined a Sparse Column, so you're not taking up space for all of those nulls. Well, if you don't want to take up storage space with null values, then you surely don't want to take up index space with them either. Here you would define a filtered index on that sparse column where the value is not null so that only rows with actual data in them are indexed.

Not only can you save a lot of index space with filtered indexes, but the queries that use those indexes will be faster because they're running against a subset of the entire data set. You also shrink the maintenance window and re-index space for the index.

Page Break

Another good use of filtered indexes might be if you have, say, ten years of data in a table but nobody ever queries further than two years back. You could create a filtered index for just those two years and get the same benefits as in the first scenario.

SQL Server Reporting Services

SSRS (SQL Server Reporting Services) has received a major overhaul. For starters, it no longer relies on IIS for its report server capabilities. This is a huge leap forward for SSRS shops because as I've found many times, it can be difficult to get approval for Windows changes, such as installing IIS. Often, these shops don't like to make major changes to a Windows install, and adding a component such as IIS, which is perceived to be difficult to secure, makes them doubly hesitant; not requiring IIS is a good thing.

Another major enhancement, called Tablix, is a new data element type that combines table, matrix, and list. Tablix gives you an easy way to work with grouping by allowing you to group on columns or rows as well as define adjacent or parent/child groups. All of this sounds fairly intuitive, but it does take a little training. I quickly found that building groups the way I did in the previous version left me out in the cold. I had to slightly change the way I authored reports to be successful.

Report Builder has come a long way as well. Microsoft has removed a number of limitations that could make it difficult to work with. Now, you can do almost anything in Report Builder that you can do in Visual Studio, and Report Builder comes with wizards that make formatting data and charts automatic. I have almost no reason to write reports in Visual Studio anymore. This new version of Report Builder (2.0) wasn't available when Katmai shipped, but it is available as a separate download now.

SQL Server Integration Services

Microsoft has taken some big steps to ensure that SSIS (SQL Server Integration Services) can compete directly with competitors such as Informatica by offering lookup caching and CDC. And while CDC isn't strictly an SSIS enhancement, SSIS does take advantage of it to increase the speed of ETL processes. In fact, both lookup caching and CDC can improve the speed of loads by orders of magnitude. Let's examine a couple situations where you'll see dramatic performance gains.

Perhaps you're loading customer data, and you have to load each customer one at a time because you have to look up other information and make load decisions based on that lookup data. In SQL Server 2005, you would put this load inside a "for each" loop container and perform the lookup on each customer as you iterate through the loop. This means that you would perform the lookup itself for every customer. Well, if you have a large lookup table -- say, something in the neighborhood of 500 million rows -- you would have to query that data for every row you import. And if the lookup table is not on the same system or if it's not indexed appropriately for your lookup, those lookups could take a lot of time.

Now, in SQL Server 2008, you can perform that lookup once for all the customers and cache the data in a local file. The upshot is that iterating through the loop will go much faster.

Another scenario where SSIS improvements come into play is in handling changed data. Let's say a customer address change needs to be reflected in the database you're loading. In SQL Server 2005, you have to perform a binary or text comparison, on a row-by-row basis, for each column you want to track. This means that the larger your customer table, the longer your lookup will be. But with CDC, you can easily see which rows and columns have changed and avoid the lookup altogether. Here again, you could reduce the time to load tremendously.

There are other enhancements in SSIS as well. Improved threading in the data pipeline will increase performance by better managing the resources and throttling the pipeline appropriately. There's also a new data-profiling task that makes it easier to identify any data purity issues. Finally, SSIS scripts can now be written in both VB and C#.