How to deal with information overload in flamegraphs?

Andrew Grimm raised a question on Stack Overflow about how to deal with information overload in flamegraphs.

In essence just looking at a picture can be very overwhelming.

Here is how I go about interpreting results:

Remember that Rack introduces very deep call stacks

The architecture of Rack means that, middleware calls the next middleware in the chain, never removing itself from the call stack.

This means that any Gem that is involved in middleware tends to have very high percentages in the gem analysis:

Technically 98.18% of the time is spent inside rack-mini-profiler, simply cause rack-mini-profiler introduces middleware early on in the chain. It’s not a sign of slowness.

On the other hand, 4.10% samples in i18n

Are a strong indicator that localizing is taking 4% of the time cause i18n gem has no rack middleware and the frames are close to the top of the trace.

Overall you should distinguish between

  1. “middlware” gems like logster, message_bus, omniauth
  2. Gems that coordinate work of other gems like AM serializer (which triggers a lot of Active Record work)
  3. Gems that live on the top of the stack frame, like “arel”, “redis” and so on.

Zoom in LOTS

Zooming into the flame graph is critical, it reveals what is going on:

So here I can tell that we spend a fair amount of time in our method called prioritize_pinned_topics, vast majority of the time is not in running the SQL but in Arel :unicorn:s creating SQL and Active Record :unicorn:s taking the results and turning them into objects.

Highlight frames to see the relative cost

Here we can see that converting our Active Record relations to SQL is taking us 8.2% of our total page cost.

Run on slow machines

The flamegraph illustrated here is for the Discourse front page, it was run on a VERY fast server, the faster the server the more limited the detail you will get. The same front page on a MUCH smaller site will generate significantly more traces on a typical digital ocean droplet.

Notice how the Digital Ocean trace reveals a lot more fidelity over the “fast server” trace. We can reveal time spent in “instrumentation”, “sanitizing sql arrays” and more. These frames are invisible from the faster server.

So remember, the more frames you can get the better you can debug performance!

Run flame graphs multiple times

Perhaps somehow you current result is an outlier? Run it a few times to see if the “shape” is stable so it is clear you are looking at typical performance and not outlier.

Feel free to reply with more tips, this is a rough draft :slight_smile:

How much of a benefit would bring the whole rearchitecturing of middlewares that tenderlove talked about last year (middleware events, etc) to Rack IMO. The flamegraphs are always huge because the rack/rails stack is always huge, and distinguishing between the different frames is still a bummer. Hope that the adoption is fast so that we can really get a huge benefit from the flamegraphs.