Performance overhead of tracing with Extended Event targets vs SQL Trace under CPU Load

Leistungseinbußen beim Tracing, Extended Event Ziele gegen SQL Trace unter CPU Last

(DE)

Inspiriert durch den Blog Post Post “Measuring Observer Overhead of SQL Trace vs. Extended Events” von Jonathan Kehayias, war ich neugierig, ein paar weitere Benchmarks durchzuführen. Die Resultate habe ich bereits in meinen SQL Server Master-Class Workshops zu XEvents und auf der PreCon des SQLSaturday #230 präsentiert. Nun teile ich sie endlich hier.

Vor allem folgende Aspekte interessierten mich:

Was ist der Performance-overhead einer Extended Event Session selber – ohne die Verarbeitung und des Dispatchens zu den Targets
Wie verhält sich das im Vergleich zu SQL Trace – auch das ohne Overhead für ein Ziel (Keine Datei, kein Profiler: gar kein Konsument)

Da ich einmal das Setup bereit hatte, habe ich mich entschieden, alle anderen Ziele ebenfalls aufzunehmen, sowie einige grundlegende Options-Variationen, um einen kompletten Überblick zu erhalten, der in sich vergleichbar ist.

(EN)

Inspired by the Blog Post “Measuring Observer Overhead of SQL Trace vs. Extended Events” by Jonathan Kehayias, I was curious to do some more benchmarking.
I have already presented the results in my SQL Server Master-Class workshops on XEvents as well as at the PreCon of the SQLSaturday #230. Now I am finally sharing them with you here.

I was interested in the following aspects, in particular:

What is the Performance-overhead of an Extended Event Session itself – not including the processing/dispatching for the target
How does this compare to the old-school SQL Trace – again with no Target overhead (No File, no Profiler: no consumer at all)

Besides that, once I had the setup ready, I decided to include all the other targets and some basic option-variations, to have a complete overview that is comparable in itself.

Ich entschied mich für ein System, das bereits unter CPU Druck steht – die CPU Verwendung liegt fast vollständig bei 100%.
Unter diesen Umständen hoffte ich die maximalen Auswirkungen von Extended Events („Erweiterte Ereignisse“) als auch SQL Trace (Zur Erinnerung: SQL Profiler basiert auf SQL Trace) zu erhalten. Man könnte meinen, dass eine E/A(I/O)-lastige Arbeitslast noch mehr beeinflusst werden würde, aber das hängt an den Targets für Extended Events. – 4 davon gehen lediglich in den Arbeitsspeicher, gegenüber einem File-Target (welches man ohnehin auf ein Nicht-Daten-Volume legen sollte). Und da ich auch den reinen Overhead der Trace-Technologien (ohne Ziel) messen möchte, macht das noch mehr Sinn. Abgesehen davon kann I/O in der Regel meist recht einfach getuned werden, indem man auf ein ansonsten unverwendetes und schnelles Device schreibt, wohingegen es für CPU nicht so einfach ist, die Kosten unsichtbar zu halten. Das im Hinterkopf zu behalten hilft, die Resultate, die hier präsentiert werden zu evaluieren und sie in den rechten Kontext zu rücken.

I also decided for a test of a system that’s already under CPU pressure - so the Processor Usage is almost constantly hitting 100%.
Under these circumstances I was hoping to get the maximum impact of Extended Events as well as SQL Trace. (Remember: SQL Profiler is based on SQL Trace) One could think that an I/O driven workload would be even more influenced, but that depends on the targets of Extended Events. - 4 of which only go to memory, vs. one file target (which one should put onto a non-data volume anyways). And since I also want to measure the pure overhead of the tracing technologies (without any target), this makes even more sense.
Besides that, I/O usually can be tuned quite easily by writing to an otherwise idle and fast device, whereas for CPU it is not as simple to keep the costs invisible. Keeping this in mind helps to evaluate the results presented here and bringing them into the right context.

Test Umgebung

“Einfach aber physikalisch”

1 Windows 2008 R2 Server mit SQL Server 2012 SP1. Keine laufenden Dienste abseits des minimal Notwendigem, um Seiten-Effekte zu verhindern.
Ich habe auch die Default-Trace sowie die system_health_session gestoppt.
Der Distributed Replay-Client wurde für alle Test local ausgeführt, und Ja, das hat Effekt auf den Server, aber das war gleich für alle Tests. Alle Tests wurden 3 Mal wiederholt (einige sogar öfter), um sicherzustellen, dass der Durchschnitt/Median valide ist.
Der Server, wie gesagt, war physikalisch (das kommt praktisch, wenn man Benchmarks mit einer Gesamtzeit von ~50 Stunden durchführt.): 1 Dual Core Intel Xeon mit 8 GB RAM (was bei Weitem genug war), Daten Volumen und Volumen für Trace-Targets auf separate Controllers.

Die Arbeitslast war ein Mix von meist kurzen (Sub-Sekunden) aber auch einigen Abfragen mit mehreren Sekunden Laufzeit, (Ad Hoc und gespeicherte Prozeduren), um eine einigermaßen realistische Decision-Support-System Last zu simulieren – jedoch ohne Datenänderungen.
Die Baseline Workload dauerte 00:24:24 Minuten. Die Statistiken blieben für alle Workload-Zyklen identisch. Die verwendete Datenbank war AdventureWorks2012.

Test Environment

“Simple but physical”

1 Windows 2008 R2 Server with SQL Server 2012 SP1. No running services above the minimum required, to avoid side-effects.
I also shut down the default trace as well as the system_health_session.
The Distributed Replay-client was run locally for all tests, so yes, this has effect on the server, but it was equally for all tests. All tests have been repeated 3 times (some even more) to make sure the average/median is valid.
The server, as I already said, was physical (which comes in handy when you are running benchmarks for a total of ~50 hours): 1 Dual Core Intel Xeon with 8 GB of RAM (which was by far enough), Data Volumes and Volumes for Trace-Targets on separate Controllers.

The workload was a mixture of mostly short (sub-second) but also several queries of several second runtime, (Ad Hoc and stored procedures) to simulate a somehow realistic decision-support-systems’ load – no data changes though.
The baseline workload took 00:24:24 minutes. The statistics stayed the same for all workloads cycles. The database used was AdventureWorks2012.

Trace-Konfigurationen

Ich habe mich für eine einfache, aber nicht unübliche Trace entschieden, die typischerweise viele Daten sammelt.
Die Trace sammelt nur 2 Ereignisse:

Trace-Configurations

I decided for a simple but not uncommon trace that typically collects a lot of data. The trace collects only 2 events:

Lock acquired
Lock released

Ausgefiltert: System-Sitzungen und andere Datenbanken als die, die hier unter Last steht.

Gesammelte Daten:

Filtered out: System-sessions and other databases than the one under load here.

CollectedData:

Lock:Released BinaryData
Lock:Released DatabaseID
Lock:Released TransactionID
Lock:Released SPID
Lock:Released ObjectID
Lock:Released Mode
Lock:Released BigintData1
Lock:Released IntegerData2
Lock:Released ObjectID2
Lock:Released Type
Lock:Released OwnerID
Lock:Released IsSystem

Lock:Acquired BinaryData
Lock:Acquired DatabaseID
Lock:Acquired TransactionID
Lock:Acquired SPID
Lock:Acquired Duration
Lock:Acquired ObjectID
Lock:Acquired IsSystem
Lock:Acquired Mode
Lock:Acquired BigintData1
Lock:Acquired IntegerData2
Lock:Acquired ObjectID2
Lock:Acquired Type
Lock:Acquired OwnerID

Das gab mir den Vorteil, einen validen Test für alle Extended Event Targets, die bereitgestellt werden, zu haben (lediglich ETW-Tracing wurde ausgelassen) - speziell das Histogramm und die „Ereignispaarbildung“ (Pair-Matching)(Die wenigen Ereignisse von Lock-Escalation störten mich bewusst nicht).

Die folgenden Trace Technologien und -Ziele wurden verwendet:

This gave me the advantage to have a valid test for all Extended Event Targets that are provided (only ETW-Tracing was left out) - specifically histogram and Pair Matching (I did not care about the few occasions of lock escalation on purpose).

The following Trace technologies and -Targets were used:

XEvent Trace, Target: None
XEvent Trace, Target: Ring Buffer
XEvent Trace, Target: File
XEvent Trace, Target: Counter
XEvent Trace, Target: Histogram
- Here I chose to filter on lock acquired and to bucketize on the lock mode
XEvent Trace, Target: Pair Matching
- Guess, what I matched here ;-)
SQL Trace, Target: None (I had to trick a little bit to get this to work, but I made sure, the behavior of this unsupported configuration did not distort the results: It’s just a SQL Trace with no Provider processing, so all events are lost by nature.)
SQL Trace, Target: File
SQL Profiler, remotely

Für jedes Ziel der Extended Events habe ich 4 Varianten getestet, basierend auf den Sitzungsoptionen:

For each Extended Event Target I tested 4 variations, based on the session options:

Single event loss
No event loss
Multiple event loss
Single event loss with causality tracking on

Alle anderen Optionen verwendeten die Standardwerte für diese Tests.

All other options were using the defaults for these tests.

XEvent_PerfComparison_sessions

Picture: the 24 prepared traces

Die Ergebnisse

Ich habe die Gesamtlaufzeit für die Workload sowie Batches pro Sekunde und CPU Zeit % gemessen.

Und hier ist die vollständige Tabelle mit den Ergebnissen:

The Results

I measured the total time for the workload to take as well as batches per second and CPU time %.

And here is the complete Table of Results:

XEvent_PerfComparison_Runtime_Avg_EvSec

EL = Ereignisverlust. S = Verlust einzelner Ereignisse, N = Kein Ereignisverlust, M = Verlust mehrerer Ereignisse

CT = Kausalitätsverfolgung (Causality Tracking) Ein

EL = Event Loss. S = Single event loss, N = No event loss, M = Multiple event loss

CT = Causality Tracking On

Um Zeit und Platz zu sparen, konzentriere ich mich auf die Benchmarks mit den Optionen single event loss und no event loss ohne Causality Tracking. Tatsächlich waren die Kosten von Causality Tracking weniger als 1% für alle Ziele.

Hier ist daher das komprimierte Ergebnis:

For the sake of saving space and time, I will focus on the benchmarks with the options single event loss and no event loss without causality tracking. In fact, the cost of causality tracking was less than 1% for all targets.

So this is the more condensed result:

XEvent_PerfComparison_Runtime_S_N_Avg_EvSec

Was wir demnach sagen können, ist: (Erinnern wir uns daran: alles läuft unter Hoher CPU Last. Das ist die Grundlage aller Ergebnisse.)	What we can tell from that, is: (Remember: Everything is running under High CPU Load, so that’s the base of all results.)
Eine blanke XEvent Session alleine, ohne Targets, hat bereits einen spürbaren Einfluss von ~15 % Leistungsverlust. Der Aufschlag („Overhead“) der künstlich erzeugten (nicht unterstützten) Version einer blanken SQL Trace, ohne den Aufschlag eines Rowset oder File Providers („Target“), ist nicht allzu weit von dem einer Extended Events Session ohne dem Dispatchen zu einem Target entfernt (~19% gegen ~15%). Der Ringpuffer (Ring Buffer) ist das XE-Target mit dem größten Overhead (~40%). (!) Das Ereignisdateiziel (“File-Target”) hat, wie erwartet, wirklich den geringsten Overhead, aber dieser beläuft sich immer noch auf 25%. Die Unterschiede zwischen den Targets Ereigniszähler („Counter“), Histogramm und Ereignispaarbildung („Pair Matching“) sind irrelevant. Ja, sogar der simple Counter hat einen höheren Overhead als das File Target. Das File Target mit No Event Loss zu konfigurieren, fügt fast den vierfachen Overhead hinzu. Die Unterschiede für die anderen Targets zwischen **single, multiple und sogar no event loss ist in diesem Szenario fast negierbar. Verglichen mit Extended Events ist der Overhead durch eine SQL Trace fast schon “brutal”: Er vervierfacht die Gesamtlaufzeit. Und noch mehr: Tracing mit dem SQL Profiler** ist in solch einem Szenario schlicht ein Ding der Unmöglichkeit: Ich habe 21 Stunden eingetragen, aber das ist tatsächlich eine Schätzung: Ich beendete diese Test-Zyklen nach rund 3 Stunden (man möge mir verzeihen, aber es ist schlicht Zeitverschwendung...), zählte die beendeten Events und rechnete aus, wie lange es noch gedauert hätte, grob – aber tatsächlich eher optimistisch. Profiler ist, wie ich es immer sage, ein “No-Go” um für längere Zeiträume zu tracen, vor allem, wenn die Performance ohnehin bereits leidet.	A plain XEvent Session itself, with no Targets, already does have a noticeable impact of ~15 % performance decrease. The overhead of the artificial (not supported) version plain SQL Trace, without the overhead of a rowset or file provider (aka target), is not too far off the Extended Events Session without the dispatching to any target (~19% versus ~15%). The Ring Buffer is the XE-Target with the greatest overhead (~40%). (!) The File-Target really does have the lowest overhead as expected, but it’s still 25%. The differences between the Targets Counter, Histogram and Pair Matching are irrelevant. Yes, even the simple Counter has a higher overhead than the File Target. Configuring the file target with no event loss adds approximately 4 times as much performance overhead. The difference for the other targets between **single, multiple and even no event loss is mostly negligible in this scenario. Compared to Extended Events, the overhead through SQL Trace is almost “brutal”: It quadruplicated the total runtime. And there's even more: Tracing with SQL Profiler** is absolutely impossible for such a scenario: I filled in 21 hours, but that’s actually an estimate: I stopped those test cycles after around 3 hours (forgive me, but it’s just a waste of time...), counted the finished events and calculated how much else it would have taken, roughly – in an actually rather optimistic manner. Profiler, as I always say, is a “no go” for tracing longer periods of time, especially if performance is already at stake.
Ein schönes Bild der SQL Trace Architektur findet sich in den BOL: msdn.microsoft.com/en-us/library/hh245121.aspx Die Extended Events Architektur wird hier bildlich gezeigt: msdn.microsoft.com/en-us/library/bb630318%28v=sql.105%29.aspx	You have a nice picture of the SQL Trace Architecture in BOL: msdn.microsoft.com/en-us/library/hh245121.aspx The Extended Events Architecture is depicted here: msdn.microsoft.com/en-us/library/bb630318%28v=sql.105%29.aspx

Hinzufügen von Filtern

Wie ändert das Hinzufügen von Filtern den Beobachter-Overhead?

Es ist wichtig, zu wissen, was ich als Filter verwendet habe: Die database_id / source_database_id.
- Tatsächlich habe ich einen Filter verwendet, der niemals “true” zurückgeben würde, indem ich einen Filter auf database_id = 77 angelegt habe – da keine Sitzung/Abfrage/Sperre je auf dieser nicht-existenten Datenbank lag.
Warum habe ich das gemacht? – Das ist ein rein künstlicher Test/Benchmark. Was mich interessiert, ist der reine Overhead des Tracen, und nicht, was passiert, wenn ein bestimmter Prozentsatz der Aktivität nicht protokolliert wird. (Das könnte vielleicht etwas für einen anderen Benchmark sein. :-))

Adding Filters

How does adding filters change the observer overhead?

Now it is important to note what I actually used as a filter: The database_id / source_database_id.
- I essentially added a filter that would never return “true” by adding a filter on database_id = 77 – since no session/query/lock was on this non-existing database.
Why did I do that? – This is a totally artificial test/benchmark. I am interested what the pure overhead of tracing is, and not what happens if a certain percentage of activity is not logged. (That might be something for a different benchmark. :-))

Dieses Mal konzentriere ich mich nur auf „single“ und „no event loss“.

Die Resultate des gefilterten Tracing, ohne tatsächlich irgendwelche Events zu speichern/protokollieren, ist wie folgt:

This time I only focused on “single” and “no event loss”.

The results of Filtered Tracing, without actually storing/logging any events is as follows:

XEvent_PerfComparison_Runtime_Filtered_Avg

Wie man sehen kann, fällt der Overhead der reinen XEvent Session ohne Target von ~ 15% auf 9% bzw. ~ 16,5% gg. 12,5 % (No Event loss)
Die anderen Targets zeigen in diesem Setup tatsächlich die größten Unterschiede:
- Das Counter Target mit single event loss profitiert am Meisten von dem Filter und fällt von ~33,5 auf 8,4 % Overhead.
- Nur Ring Buffer und Pair Matching zeigen keinen Unterschied zwischen single und no event loss.
- SQL Trace mit und ohne rowset Provider (Datei) als Target haben fast den gleichen Overhead – keine Überraschung, wenn man die Architektur kennt. Das ist tatsächlich ein weiterer risen Unterschied, der für Extended Events spricht, wo Filters viel früher in der Trace berücksichtigt werden.
- Das der Overhead von SQL Trace ohne Target und zugleich aller Events herausgefiltert sogar höher ist, als der Overhead ohne Target aber auch ohne Filter (von dem vorhergehenden Test-Zyklus), muss an dem Verarbeiten des Ausfilterns selber liegen. Ich werde mich damit aber nicht weiter befassen, da das Setup weder dokumentiert noch unterstützt ist.
  Und in meinen Augen ist “SQL Trace“ sowieso „tot“ ;-) – Okay, ich übertreibe hier ein wenig, aber ich bin auch ganz offen: Für SQL Server 2012 gibt es fast keinen Grund mehr zu dessen Verwendung. – Benchmarks mithilfe von Distributed Replay durchzuführen, ist so ein Grund – ist das nicht ironisch?

Für diesen Test habe ich SQL Profiler nicht ausgeführt, Pardon. Sie wissen bis hierhin wahrscheinlich schon, warum. ;-)

As one can see, the overhead of the pure XEvent session without target drops from ~ 15% to 9% respectively ~ 16.5% vs. 12.5 % (no Event loss)
The other targets do in fact show more differences in this setup:

The Counter Target with single event loss benefits most of the filter and drops from ~33.5 to 8.4 % overhead.
Only Ring Buffer and Pair Matching show no difference between single and no event loss.
SQL Trace with and without rowset provider (file) as target have almost the same overhead – no surprise if you know the architecture. This is in fact another huge difference which counts for Extended Events, where filters get honored much sooner within the Trace.
That the overhead of SQL Trace without target and at the same time all events being filtered out is actually higher than the overhead without target but no filtering (from the former test cycle) must be caused by the processing of for filtering out. I will not dive more into this though, since this setup is not documented and or supported.
And in my eyes, “SQL trace is dead” anyways ;-) – okay, I am exaggerating a bit here, but I am also just being frank: For SQL Server 2012 there is almost no need to use it any more. – Running benchmarks using Distributed Replay is such a reason – isn’t that ironic?

I did not run SQL Profiler for this Test, forgive me. You probably know why by now. ;-)

Wait-Types für Extended Events

Ein weiterer Aspekt, der mich interessierte, waren die XEvent Wait-Typen, die auftreten würden, wenn man Extended Event Sessions ausführt. (Die Warte-Statistiken sind oft die Basis für Performance-Analysen)

Allgemein sieht man folgende:

Wait-Types for Extended Events

Another aspect I was interested in were the specific XEvent wait-types which would occur when running Extended Event sessions.
(Wait-Statistics are often the base for performance analysis)

In general, you will see the following:

Beim Starten einer Sitzung:

When starting a session:

PREEMPTIVE_XE_TARGETINIT
PREEMPTIVE_XE_CALLBACKEXECUTE
PREEMPTIVE_XE_SESSIONCOMMIT

Beim Beenden einer Sitzung:

When stopping a session:

PREEMPTIVE_XE_SESSIONCOMMIT
XE_BUFFERMGR_ALLPROCESSED_EVENT
PREEMPTIVE_XE_CALLBACKEXECUTE

Während Sessions aktiv sind, sieht man:

While running sessions you will see:

XE_DISPATCHER_WAIT - From BOL: Occurs when a background thread that is used for Extended Events sessions is waiting for event buffers to process. - You should be able to safely ignore this unless you believe a problem is occurring with processing of events for async targets. Since this works on a queue you can have bursts of high wait times especially when no XEvent sessions are active.

XE_TIMER_EVENT – From BOL: Used to indicate a background task is waiting for "expired" timers for internal Xevent engine work. - You should be able to safely ignore this one. Just used by the Xevent engine for internal processing of its work. If something was possibly wrong with Xevent processing you might see if this thread ever "wakes up"

Beim Starten eines File-Targets sieht man außerdem:

When starting the File target you will also see:

PREEMPTIVE_XE_TARGETINIT

Wenn man eine Sitzung mit der No Event Loss Option ausführt, sieht man gegebenenfalls:

If you run a session with the no event loss option you might see

XE_BUFFERMGR_FREEBUF_EVENT - which by BOL means: An Extended Events session is configured for no event loss, and all buffers in the session are currently full. This can indicate that the buffers for an Extended Events session are too small, or should be partitioned.

So. Ich hoffe das war interessant. Man kann noch weitere und andere Schlüsse aus den Ergebnissen ziehen. Immer im Hinterkopf zu behalten, ist, dass das ein sehr spezielles Szenario ist, wo keine CPU Reserven zur Verfügung stehen, so das der Trace/Beobachter-Overhead sich manifestieren muss – keine Chance, als eben (CPU) Ressourcen wegzunehmen.

So, I hope this was interesting for you. You may draw more and other conclusions out of these results. Remember though: this is a very specific scenario with no CPU reserves, so the Tracing/observer overhead has to show up – no choice but to take away (CPU) resources.

Happy Tracing

Andreas

PS: I just discovered that my MCM and SQLSkills-class buddy Steven Ormrod also has recently blogged about the performance overhead from a SQL Trace to remote file share on production here: stevenormrod.com/2013/06/remote-trace-slowing-down-server/