Category: "SQL Server Administration"

Calculating SQL Server Data Compression Savings

SQL Server 2008 Enterprise edition comes with an opportunity for storing table or index data in a compressed format which may save huge amount of storage space and - much more important - IO requests and buffer pool utilization. There's two different options for data compression, namely Row and Page level compression. This blog post is not concerned with how these two work internally and will also not explain the differences between the two. If you like to know more about this, you find much of useful information on the internet - including links to further articles (e.g. here, here, and here).

Whether compression is worth or not isn't an easy question to answer. One aspect that has to be taken into account is certainly the amount of storage that may be saved by storing a distinct table or index in any of the two compressed formats. SSMS offers a Data Compression Wizard than can provide storage-saving estimates for row or page level compression. From the context menu for a table or index just open Storage/Manage Compression. In the Combo box at the top select the compression type (Row or Page) and press the Calculate Button at the bottom. Here's a sample of a calculated saving for an index:

image

Unfortunately, SSMS does not offer an option for calculating estimated savings for more than one table or index at once. If you, let's say, would like to know the estimated storage savings of page level compression for you largest 10 tables, there's no GUI support in SSMS that will assist you in finding an answer. This is, where the stored procedure sp_estimate_data_compression_savings comes in handy. This procedure - as you may have guessed from its name - provides estimated savings for row or page level compression for any table or index. You have to provide the table or index as a parameter to the procedure. In other words: The procedure will only calculate the estimations for one table or index at a time. If you want to retrieve the calculations of more than one table or index as a result set, there's some more work to do, since the procedure has to be invoked multiple times. Here's a script that calculates the estimated savings of page level compression for the database in context.

-- Determine the estimated impact of compression
-- NOTE: This script is only for SQL Server Enterprise and Developer edition.

set
nocount on

-- We create a temp table for the result
if (object_id('tempdb..#comp', 'U') is not
null)
drop table #comp
go
create table #comp
(
object_name
sysname
,schema_name
sysname
,index_id
int
,partition_number
int
,[size_with_current_compression_setting (KB)]
bigint
,[size_with_requested_compression_setting (KB)] bigint

,[sample_size_with_current_compression_setting (KB)] bigint

,[sample_size_with_requested_compression_setting (KB)]
bigint
)
go

-- Calculate estimated impact of page level compression for all
-- user-tables and indexes in all schemas.
-- NOTE:
-- 1) To get the estimated impact of row level compression change the last parameter
-- of sp_estimate_data_compression_savings to 'row' instead.
-- 2) We don't care about partitioning here. If this is important for you,
-- you have to modify forth parameter of sp_estimate_data_compression_savings.
-- Please refer to BOL.

declare @cmd nvarchar(max
)
set @cmd =
''
select @cmd =
@cmd
+
';insert #comp exec sp_estimate_data_compression_savings '''
+ schema_name(schema_id)+''','''
+ name + ''',null, null, ''page'''
from sys.
tables
where objectproperty(object_id, 'IsUserTable') = 1
exec (@cmd)

;
-- Do some further calculations for a more meaningful result
with
compressionSavings as
(
select quotename(schema_name) + '.' + quotename(object_name) as
table_name
,
index_id
,
[size_with_current_compression_setting (KB)]
,
[size_with_requested_compression_setting (KB)]
,cast(
case
when [size_with_current_compression_setting (KB)] = 0 then
0
else 100.0*(1.0-
1.0
*
[size_with_requested_compression_setting (KB)]
/[size_with_current_compression_setting (KB)]
)
end as decimal(6,2)) as
[Estimated Savings (%)]
from #comp
)
select cs.
table_name
,isnull(i.name, i.type_desc) as
index_name
,cs.
[size_with_current_compression_setting (KB)]
,cs.
[size_with_requested_compression_setting (KB)]
,cs.[Estimated Savings (%)]

from compressionSavings as
cs
left outer join sys.indexes as
i
on i.index_id = cs.
index_id
and i.object_id = object_id(cs.table_name, 'U'
)
order by cs.[Estimated Savings (%)]
desc

-- Get rid of the temp table
drop
table #comp
go

The script calculates the impact of Page level compression but may easily be adapted to consider Row level compression instead. Please read the comments inside the script. Also, please notice that the script will only run on SQL Server Enterprise and Developer edition. All other editions don't provide the opportunity for data compression.

Here's a partial result retrieved from running the script against the AdventureWorksDW2008R2 database.

image

If you execute the script, please be aware that it may produce some extensive I/O. Running the script against your production database at business hours wouldn't be a very good idea therefore.

SQL Server Start Time

Have you tried finding out the time, your SQL Server instance has been started? There are some sophisticated solutions, like the one from Tracy Hamlin (twitter), which takes advantage of the fact that tempdb is re-created every time, SQL Server starts. Her solution goes like this:

select create_date
from sys.databases where database_id=2

Another answer to the question, I've seen sometimes on the internet queries the login time for any of the system processes:

select login_time
from sys.dm_exec_sessions where session_id=1

This was my preferred way - until yesterday, when I discovered the following simple method:

select sqlserver_start_time
from sys.
dm_os_sys_info

Easy, isn't it? Interestingly though, every of the above three queries yields a different result. Here's a query with a sample output:

select (select sqlserver_start_time
from sys.dm_os_sys_info) as
sql_server_start_time
,(select
create_date
from sys.databases where database_id=2) as
tempdb_creation_time
,(select
login_time
from sys.dm_exec_sessions where session_id=1) as sysprocess_login_time

Result:

image

It seems the SQL Server service must be started first. Only after the service is running, tempdb is created followed by a subsequent start of all sysprocesses. I can't imagine that the diverse three times make any difference in practice, e.g. if you try finding out for how many hours your SQL Server instance is running. But out there may be existing applications that have to be aware of the difference.

How useful is your backup?

A backup is worth nothing, if you can't utilize it for restore.

You probably agree with this well known word of wisdom, don't you?

This week I had to learn another aspect - the hard way: There are many cases that may require a restore. Only one of those cases is recovering a database from a state of failure. Another situation may require access to legacy data of a meanwhile deactivated database, a database that doesn't exist anymore. In this case only a successful restore from an older backup, although necessary, may not be sufficient. Here's the story, why.

I was called by a customer who requested me to pull out two documents from a legacy (SQL Server 2000) database. We found an 8 year old backup we could rely on and restore worked well. Great! But then we discovered the documents had been stored in an IMAGE column and nobody had an idea how it had been encoded at the time it was stored. We soon recognized that we need the original application to get access to the documents, only to recognize that nobody had an idea where to find the installation package. Eventually I could find an 8 year old backup which included the old VB6 code. I could have used this code to rebuild the application and also an installation package only with considerable difficulties, because:

  • The application uses some ActiveX components that I may not be able to find anymore.
  • The original application run on Windows NT, so I may have to install this OS first, including the required service pack. Even if I find the installation CDs; I doubt, I will be able to find the appropriate device drivers but maybe, I can set up a virtual machine.
  • I had to install Visual Studio 6, including the latest service pack and I had no idea where to find the installation CDs.
  • The application may rely on deprecated SQL Server 2000 features. So I have to have a SQL Server 2000 installation on which the existing backup has to be restored. If it gets worse, I might even have to install the appropriate service pack to make the application working. I have no idea which service pack this would be, so there's a chance I have to experiment.

I recognized that all those steps require a big effort and will take some days to accomplish. If only we had built a virtual machine of at least one legacy client system before replacing all client PCs by newer machines.

So, I'd modify the introductory statement like this:

A backup is worth nothing, if you can't utilize the data that is contained inside this backup.

Multiple statistics sharing the same leading column

SQL Server will not prevent you from creating identical indexes or statistics. You may think that everything is under your control, but have you ever added a missing index that has been reported by the query optimizer, or the DTA? If so, you will certainly have created duplicate statistics. Have a look at the following example:

-- create test table with 3 columns
-- filler is only nedded to widen the row

create
table T1
(

c1
int not null identity(1,1) primary
key
,c2 int not
null
,filler nchar(200) not null default
'#'
)
go

-- insert 500000 rows with random
-- values between 0 and 49999 for c2

insert T1(c2
)
select top(500000) abs(checksum(newid())) %
50000
from sys.fulltext_system_stopwords as
a
,sys.fulltext_system_stopwords as b

Before invoking the following SELECT command, allow the actual execution plan being displayed (Strg-M or selecting Query/Include Actual Execution Plan from the menu).

-- get some data
select * from T1 where c2 = 100

If you look at the execution plan, you see that the optimizer complains about a missing index on column c2. The prognosticated improvement is about 99%, so adding the index it's certainly a good idea. Let's do it:

-- add missing index
create nonclustered index ix_1 on t1(c2)

Perfect! The query runs much faster now and needs a lot fewer resources. But have a look at the table's statistics:

image

You see three statistics, one for the primary key, a second one for our created index ix_1, and a third one that was automatically created during execution plan generation for the first SELECT statement. This is the statistics, named _WA_Sys.. If the AUTO CREATE STATISTICS option is set to ON, the optimizer will add missing statistics automatically. In our little experiment, the optimizer had to generate this column statistics on column c2 in order to make some assumptions about the number of rows that had to be processed.

And here's the problem: When creating the index on column c2, a statistics on this column is also created, since every index has a corresponding linked statistics. That's just the way it works. At the time the index was added, the column statistics on c2 (that _WA_Sys. statistics) already existed. If you don't remove it manually, this statistics will remain there forever, although it is needless now. All it's good for is to increase maintenance efforts during statistics updates. You can safely remove this statistics by executing:

drop statistics t1._WA_Sys_...

If you didn't think about this before, there's a chance that you'll find some of those superfluous statistics duplicates inside your database(s). Here's a query that finds index-related and column-statistics that match on the first column. Looking for matches on the first column is sufficient here, since the optimizer only automatically adds missing single-column statistics.

with all_stats as
(
select s.object_id, s.name as stats_name, s.stats_id, s.
has_filter
,s.auto_created, sc.
stats_column_id
,sc.column_id, c.name as
column_name
from sys.stats as
s
inner join sys.stats_columns as
sc
on s.stats_id = sc.
stats_id
and s.object_id = sc.
object_id
and sc.stats_column_id =
1
inner join sys.columns as
c
on c.object_id = s.
object_id
and c.object_id = sc.
object_id
and c.column_id = sc.
column_id
where objectproperty(s.object_id, 'IsUserTable') = 1
)
select row_number() over(partition by s1.object_id, s1.column_name order by s1.column_name) as
stats_no
,object_name(s1.object_id) as
table_name
,s1.stats_name as
stats_name
,s2.stats_name as
identical_stats_name
,s2.
column_name
from all_stats as
s1
inner join all_stats as
s2
on s1.object_id = s2.
object_id
and s1.stats_id != s2.
stats_id
and s1.stats_column_id = s2.
stats_column_id
and s1.column_id = s2.
column_id
and s1.has_filter = s2.
has_filter
and s1.auto_created != s2.auto_created

With that query at hand, you may easily find redundant statistics. Here's a sample output:

image

If you'd like to find out more about SQL Server statistics, you may want to check out my series of two articles, published recently on the Simple-Talk platform.

Part 1: Queries, Damned Queries and Statistics
Part 2: SQL Server Statistics: Problems and Solutions

It's free, so you might want to give it a try. Feel free to vote, if you like it!

How to treat your MDF and LDF files

Have you ever set a database to read only? If so, you probably did this by using SQL Server Management Studio or by by executing the regarding ALTER DATABASE command.

Here's another method that I had to investigate recently: I colleague of mine with some limited knowledge of SQL Server didn't discover the ALTER DATABASE statement so far. But he knew how to detach and attach a database, because this is what he does all the time in order to copy databases from one computer to another. One day, when he wanted to prevent modifications to one of his databases, he decided to protect the MDF- and LDF- files of these database. Very straightforward he detached the database, set the MDF- and LDF-file to read only mode (by using Windows Explorer) and attached the database after completing this. Voila: SQL Server does not complain at all (I was very surprised about this) and as he expected, the database was displayed as read only in the Object Explorer of SSMS.

As I said: I was very surprised, since I didn't expect this method would work. Smart SQL Server! But then the trouble began.

Very shortly, after a system reboot, SQL Server started showing the database in question as "suspect". What happened? I don't know, but I was able to reproduce the behavior with SQL Server 2008 on Windows Server 2008 R2 every time I repeated the following steps:

  • Create a database
  • Detach the database
  • Set the MDF- and LDF-file to read only
  • Attach the database again
  • Restart the computer

So, I think you should be careful with any modifications to MDF- and LDF-files outside SQL Server. This not only seems to be true for data itself but also for these files' attributes. You should always treat MDF- and LDF files as kind of SQL Server's exclusive property and never touch them!

Just another point to add here: If you try setting a database with read only MDF- or LDF files to read write again by executing ALTER DATABASE, you'll get an error like this:

Msg 5120, Level 16, State 101, Line 1
Unable to open the physical file "C:\SqlData\User\db2.mdf". Operating system error 5: "5(Access is denied.)".
Msg 5120, Level 16, State 101, Line 1
Unable to open the physical file "C:\SqlData\User\db2_log.LDF". Operating system error 5: "5(Access is denied.)".
Msg 945, Level 14, State 2, Line 1
Database 'db2' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.
Msg 5069, Level 16, State 1, Line 1
ALTER DATABASE statement failed.

But that's as expected, I'd say.