The Columnstore Indexes & Batch Mode Processing (R)Evolution in SQL Server 2014

Alias "Apollo ist gelandet"

(Apollo war der Codename für das Columnstore Index-Projekt vor SQL Server 2012)

(DE)

Während In-Memory Processing mit XTP die Grundlage für die Veränderung in der Verarbeitung von OLTP im SQL Server 2014 formt (Update 09-2013: XTP werde ich auf der TechNet Conference erstmalig in Deutschland präsentieren), wurde für OLAP/Datawarehouse-Systeme die Columnstore Indexe fundamental verbessert. Tatsächlich sind die Neuerungen so gewaltig, das es sich vorhersehen lässt, das Columnstore (spaltenorientiertes Speichern) mit dem SQL Server 2014 der Standard Speichertyp für DataWarehouses wird (zumindest für Systeme, welche die Enterprise Edition verwenden).

Beginnen wir mit einem Blick auf das, was wir mit SQL Server 2012 erhalten und zugleich vermisst haben, als ColumnStore Indexe eingeführt wurden:

aka "Apollo has landed"

(Apollo was the Codename for the Columnstore Index project before SQL Server 2012)

(EN-US)

Whereas In-Memory Transactional Processing with XTP builds the foundation for the shift in processing OLTP in SQL Server 2014, for OLAP/datawarehouse-systems Columnstore Indexes have been improved fundamentally. In fact the improvements are so huge, that it can be foreseen that Columnstore will become the standard type of storage for DataWarehouses starting in SQL Server 2014 (at least for systems running Enterprise Edition).

Let’s start with what we had and missed at the same time so far, in SQL Server 2012 when Columnstore Indexes were introduced:

Ein Non-clustered columnstore index (ab hier jetzt abgekürzt als NCCI) pro Tabelle – der geclusterte war weiterhin row-basiert
Keine DML Unterstützung: keine Updates (Datenaktualisierung) – größte Einschränkung für nahe-Echtzeit DataWarehouses und ständige Aktualisierungen
Mittelmäßiges memory management – z.B. wurde der Resource Governor nicht beachtet
Kein Batch hash join spilling
Eingeschränkte Datentypen Unterstützung – ok, Ich persönlich habe wenig Probleme damit, da es Entwickler dazu animiert/zwingt, Spalten-Datentypen und Tabellendesign bewusst zu wählen (wir erinnern uns, das, Columnstore Indexe hauptsächlich für Faktentabellen in DataWarehouses gedacht waren), aber das ist nur meine persönliche Meinung dazu :-)
Eingeschränkte Batch Operationen unterstützt– ein nicht unterstützter Operator führte zu row-mode-Verarbeitung der gesamten Abfrage

One Non-clustered columnstore index (I’ll refer to that as NCCI from now on) per table – the clustered was still row-based
No DML support: no updates (data refresh) – biggest bummer for close to real-time DataWarehouses and continuous updating
Mediocre memory management – i.e. Resource Governor not honored
No batch hash join spilling
Limited data types support – ok, I personally have little problems with that as it forced developers to wisely chose column-types and table design (remember, that columnstore indexes were mainly meant for FactTables in DataWarehouses), but that’s just my personal opinion on that :-)
Limited batch operations supported – one unsupported plan operator led to row-mode processing of the whole query

Aus dieser Liste bereiten sicherlich die folgenden Dinge die größten Probleme unter SQL Server 2012: Updates auf columnstore indexierte Tabellen sind nur mittels drop NCCI/Daten laden/create NCCI oder partition switching möglich und zum zweiten: viele Abfragen erhalten nicht einmal den vollständigen möglichen performance-Schub durch den NCCI, weil eine nicht unterstützte Operation im Abfrageplan stattfindet.

Auf der TechEd Europe 2013 in Madrid wurde die CTP1 des SQL Server 2014 als öffentlich verfügbar verkündet (www.insidesql.org/blogs/andreaswolter/2013/06/ctp1-sql-server-2014-release-techedeurope ) und da detaillierte Informationen über die gewaltigen Verbesserungen endlich offiziell sind, kann ich die guten Neuigkeiten mit meinen Lesern teilen (Ein großes Dankeschön an Igor Stanko vom Parallel Datawarehouse-Team bei Microsoft, der alle Fragen im Detail beantwortet hat, während er mit mir und anderen MCMs und MVPs am Microsoft Stand "gefangen" war;-) )

Zu den Fakten:

Out of that list, probably the biggest issues in SQL Server 2012 are, that updates to columnstore indexed tables are only possible by drop NCCI/load data/create NCCI or partition switching and secondly: many queries actually do not get even get all of the possible performance gain from the NCCI because of an unsupported operation inside the query (plan).

At the TechEd Europe 2013 in Madrid CTP1 of SQL Server 2014 was announced publically available (www.insidesql.org/blogs/andreaswolter/2013/06/ctp1-sql-server-2014-release-techedeurope ) and as detailed information on the big improvements are finally official, I can share the good news with my readers (Big Thanks to Igor Stanko from the Parallel Datawarehouse-Team at Microsoft who answered all questions in detail, while "bound" with me and other MCMs and MVPs at the Microsoft booth ;-) )

To the facts:

Im SQL Server 2014,

Können wir einen Clustered Columnstore Index (CCI) anlegen
sind Clustered Columnstore Indexes mit Standard DML-Operationen aktualisierbar! – das funktioniert über eine delta store und einem tuple mover, der ein einmal volles Segment in columnstore konvertiert
ist Memory Management dynamisch und Resource Governor Einstellungen werden beachtet
unterstützt (Hash Join) Spilling den Batch Mode
funktioniert ein Mix von row und batch mode innerhalb einer Abfrageausführung
haben wir mehr unterstützte Operatoren im Batch Mode (Outer Join, partial Aggregates, Union All)
erhalten wir mit dem neuen Komprimierungstyp
“COLUMNSTORE_ARCHIVE” weitere 30% Komprimierung für die ausgewählte Tabelle/Partition
werden mehr Datentypen unterstützt (decimal mit hoher Präzision, (var)binary, uniqueidentifier)

In SQL Server 2014,

We can create a Clustered Columnstore Index (CCI)
Clustered Columnstore Indexes will be Updatable with standard DML-Operations! – this works via a delta store and a tuple mover, that converts a once full segment into column store
Memory management is dynamic and honors resource governor settings
(Hash Join) Spilling now supports Batch Mode
Mixed row and batch mode inside query execution now works
We have more supported operators in batch mode (Outer Join, partial Aggregates, Union All)
The new compression type: “COLUMNSTORE_ARCHIVE” adds another 30% compression for chosen table/partition
More data types are supported (high precision decimal, (var)binary, uniqueidentifier)

Clustered ColumnStore Index Insert:
Clustered ColumnStore Insert

– Diese Verbesserungen in Kombination können schwerlich überschätzt werden, da sie letztlich zu einer Schlussfolgerung führen:

“Columnstore (und clustered columnstore Index) wird die BEVORZUGTE Storage Engine für DW Szenarien”

(das ist ein original Microsoft-Zitat - und ich habe keinen Grund, das irgendwie anders zu auszudrücken)

Hinweis: Das Clustered Columnstore Index Feature wurde von dem Parallel Data Warehouse (PDW)-Team für die PDW Version 2012 entwickelt, welche Anfang 2013 released worden ist. (Danke an Henk van der Valk für die Erinnerung an diesen nicht unerheblichen Fakt.)

Und hier ist ein Vergleich der möglichen Speicherplatz-Ersparnisse bei der Verwendung von CCI gegenüber allen anderen Indexierungs-Methoden bisher (herausgegeben auf der TechEd 2013 - ich lasse das Diagramm für sich selbst sprechen):

– Those improvement combined can hardly be overestimated as they ultimately lead to one conclusion:

“Columnstore (and clustered columnstore index) will be PREFERRED storage engine for DW scenarios”

(this is an official quote from Microsoft - and I see no need to say it any differently)

Note: The Clustered Columnstore Index feature is was actually developed by the Parallel Data Warehouse (PDW)-Team for the PDW Version 2012 which was released in the beginning of 2013. (Thanks, Henk van der Valk for reminding me of that not so insignificant fact.)

And here is a comparison of the possible space savings when using CCI over any other indexing methods so far (released at TechEd 2013 - I'll let the graph speak for itself):

Nach all den guten Neuigkeiten: Ist da noch etwas, was uns fehlt, bei der Verwendung von ColumnStore? Ich möchte auch die wenigen Dinge, die noch schmerzen können, nicht vergessen:

keine Constraints, die Indexe verwenden, werden unterstützt, noch nicht einmal Fremdschlüssel ohne Index. Wenn solchermaßen benötigt wird, muss man zum hybriden Ansatz mit NCCIs wechseln und Partition switching oder eine andere Methode wie zuvor verwenden.
Auch DML-Trigger sind mit CCIs nicht möglich.
Nur der Vollständigkeit halber: keine anderen Indexe sind möglich, wenn ein CCI auf einer Tabelle vorliegt. Aber dafür wäre auch keine Verwendung mehr, wie der aufmerksame Leser bis hier sicherlich erkannt hat :-)

After all the good news: Is there anything that we might still miss when using ColumnStore? I do not want to forget to mention the few things that might still hurt:

No constraints using index are supported, not even Foreign Keys without index. So if such is needed, one has to revert to using a hybrid approach with NCCIs and using partition switching or any other of the methods as before.
Also DML-Triggers are not possible with CCIs.
Just for completeness: no other indexes are possible when a CCI is present on a table. But there is no use for it anyways, as the attentive reader might have noticed by now :-)

Als Datawarehouse-Architekt bin ich begeistert und freue mich auf die ersten Implementierungen von DataWarehouses unter SQL Server 2014. Das Index-Design wird viel einfacher, in einigen Umgebungen sogar trivial. - Man möge beachten, dass ich nicht "Database-Design" schrieb: Ein geeignetes Datenbank-Design (Hinweis: "Star-Schema") ist weiterhin notwendig um die maximal mögliche Performance herauszuholen - nicht vergessen, dass Batch-Mode-Verarbeitung und Segment-elimination die Schlüssel zum Maximum sind.

So, as a datawarehouse-architect I am really excited and looking forward to the first implementations of Datawarehouses under SQL Server 2014. The index-design will become much easier, in fact in many environments even trivial. - Note, that I am not using the word “database-design” itself: A proper database-design (hint: “star-schema”) is still necessary to get the maximum performance gain possible – don’t forget that batch mode processing and segment elimination are the keys to the max.

happy indexing

Andreas