Profiling the FreeBSD kernel with DTrace

225 views Asked by At

I'm looking to improve interface destruction time with FreeBSD. Destroying thousands of interfaces takes several minutes on my test machine running -CURRENT, and while -- admittedly -- my use case may be an unusual one, I'd like to understand what's taking the system so long.

From my initial observations, I was able to establish that most of the time is spent waiting somewhere inside if_detach_internal(). So in an attempt to profile this function, I came up with the following DTrace script:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option dynvarsize=256m

fbt:kernel:if_detach_internal:entry
{
        self->traceme = 1;
        t[probefunc] = timestamp;
}

fbt:kernel:if_detach_internal:return
{
        dt = timestamp - t[probefunc];
        @ft[probefunc] = sum(dt);
        t[probefunc] = 0;
        self->traceme = 0;
}

fbt:kernel::entry
/self->traceme/
{
        t[probefunc] = timestamp;
}

fbt:kernel::return
/self->traceme/
{
        dt = timestamp - t[probefunc];
        @ft[probefunc] = sum(dt);
        t[probefunc] = 0;
}

By hooking to the entry and return fbt probes, I'm expecting to get a list of function names and cumulative execution times for every function called by if_detach_internal() (no matter the stack depth), and filter out anything else.

What I'm getting, however, looks like this (destroying 250 interfaces):

  callout_when                                                   1676
  sched_load                                                     1779
  if_rele                                                        1801
[...]
  rt_unlinkrte                                            10296062843
  sched_switch                                            10408456866
  rt_checkdelroute                                        11562396547
  rn_walktree                                             12404143265
  rib_walk_del                                            12553013469
  if_detach_internal                                      24335505097
  uma_zfree_arg                                        25045046322788
  intr_event_schedule_thread                           58336370701120
  swi_sched                                            83355263713937
  spinlock_enter                                      116681093870088
[...]
  spinlock_exit                                      4492719328120735
  cpu_search_lowest                                 16750701670277714

Timing information for at least some of the functions seems to make sense, but I would expect if_detach_internal() to be the last entry in the list, with nothing taking longer than that, since this function is at the top of the call tree I'm trying to profile.

Clearly, it is not the case, as I'm also getting measurements for other functions (uma_zfree_arg(), swi_sched(), etc...) with seemingly crazy execution times. These results completely annihilate my trust in everything else DTrace tells me here.

What am I missing ? It this approach sound at all ?

2

There are 2 answers

0
pmdj On

I'll prefix my comments with the fact that I've not used DTrace on FreeBSD, only on macOS/OS X. So there might be something platform-specific at play here that I'm not aware of. With that out of the way:

  • I'm a little uneasy about your use of the global associative array t. You might want to make that thread-local (self->t), because as it stands, your code can produce junk results if if_detach_internal is called from multiple threads simultaneously.
  • Your use of the global dt variable is similarly dangerous and thread-unsafe. This really should be this->dt everywhere (a clause-local variable).
  • Another thing to be aware of, but which shouldn't cause problems in your code as it stands right now, is that the action fbt:kernel::entry /self->traceme/ will be invoked for if_detach_internal itself. This is because the latter function of course matches the wildcard, and actions are executed in the order in which they appear in the script, so by the time the predicate on the wildcard entry action is checked, the non-wildcard action will have set self->traceme = 1; Double-setting the timestamp like this should cause no ill effects, but judging by the way the code is written, you may have been unaware that this is in fact what it does, which could cause problems if you make further changes down the line.

Unfortunately, DTrace scoping rules are rather unintuitive, in that everything is global and thread-unsafe by default. And yes, this still bites me every now and then, even after having written a fair amount of DTrace script code.

I don't know if following the above advice will fix your problem entirely; if not, please update your question accordingly and drop me a comment below and I'll take another look.

0
Andrew Henle On

This is another variation of a really simple but extremely useful dTrace script that I've often used to find out where any kernel is actually spending most of its time:

#!/usr/sbin/dtrace -s

profile:::profile-1001hz
/arg0/
{
    @[ stack() ] = count();
}

That profiles the kernel's stack traces, and when the script exits via CTRL-C or some other method it will print something like this:

             .
             .
             .
          unix`z_compress_level+0x9a
          zfs`zfs_gzip_compress+0x4e
          zfs`zfs_compress_data+0x8c
          zfs`zio_compress+0x9f
          zfs`zio_write_bp_init+0x2b4
          zfs`zio_execute+0xc2
          genunix`taskq_thread+0x3ad
          unix`thread_start+0x8
          703

          unix`deflate_slow+0x8a
          unix`z_deflate+0x75a
          unix`z_compress_level+0x9a
          zfs`zfs_gzip_compress+0x4e
          zfs`zfs_compress_data+0x8c
          zfs`zio_compress+0x9f
          zfs`zio_write_bp_init+0x2b4
          zfs`zio_execute+0xc2
          genunix`taskq_thread+0x3ad
          unix`thread_start+0x8
         1708

          unix`i86_mwait+0xd
          unix`cpu_idle_mwait+0x1f3
          unix`idle+0x111
          unix`thread_start+0x8
        86200

That's an example set of stack traces and the number of times that stack trace was sampled. Note that it prints the most frequent stack traces last.

So you can immediately see the stack trace(s) most frequently sampled - which is going to be where the kernel is spending a lot of time.

Note also that the stack traces are printed in what you may think is reverse order - the outer, topmost call is printed last.