Wednesday, September 26, 2012

Use Dtrace to locate the source of unstable performance (Solaris 11)

I was running a micro benchmark.  The run takes about 10 minutes, with warm up and stable stages(2 minutes).  The score varies 50-100% from run to run.  When the score is good, the gc activities is 2 times of when the score is low.  prstat -mL -p <pid> shows not much lock activities.

The hardware is AMD Opteron (Abu Dhabi) 2 socket 16-core/socket. 
# /usr/bin/lgrpinfo
lgroup 0 (root):
        Children: 1-4
        CPUs: 0-31
        Memory: installed 48G, allocated 9.6G, free 38G
        Lgroup resources: 1-4 (CPU); 1-4 (memory)
        Latency: 16
lgroup 1 (leaf):
        Children: none, Parent: 0
        CPUs: 0-7
        Memory: installed 16G, allocated 2.3G, free 14G
        Lgroup resources: 1 (CPU); 1 (memory)
        Load: 0.122
        Latency: 10
lgroup 2 (leaf):
        Children: none, Parent: 0
        CPUs: 8-15
        Memory: installed 8.0G, allocated 1.9G, free 6.1G
        Lgroup resources: 2 (CPU); 2 (memory)
        Load: 1.53e-05
        Latency: 10
lgroup 3 (leaf):
        Children: none, Parent: 0
        CPUs: 16-23
        Memory: installed 16G, allocated 2.1G, free 14G
        Lgroup resources: 3 (CPU); 3 (memory)
        Load: 1.53e-05
        Latency: 10
lgroup 4 (leaf):
        Children: none, Parent: 0
        CPUs: 24-31
        Memory: installed 8.0G, allocated 3.3G, free 4.7G
        Lgroup resources: 4 (CPU); 4 (memory)
        Load: 0.119
        Latency: 10

# /usr/bin/plgrp -a all 7044
     PID/LWPID    HOME       AFFINITY
    7044/1        2     0-4/none
    7044/2        3     0-4/none
    7044/3        4     0-4/none
    7044/4        3     0-4/none
    7044/5        1     0-4/none
    7044/6        2     0-4/none
    7044/7        4     0-4/none
    7044/8        3     0-4/none
    7044/9        1     0-4/none
    7044/10       2     0-4/none
    7044/11       4     0-4/none
    7044/12       3     0-4/none
    7044/13       1     0-4/none
    7044/14       2     0-4/none
    7044/15       4     0-4/none
    7044/16       3     0-4/none
    7044/17       1     0-4/none
    7044/18       2     0-4/none
    7044/19       4     0-4/none
    7044/20       3     0-4/none
    7044/21       1     0-4/none
    7044/22       2     0-4/none
    7044/23       4     0-4/none
    7044/24       3     0-4/none
    7044/25       1     0-4/none
    7044/26       2     0-4/none
    7044/27       4     0-4/none
    7044/28       3     0-4/none
    7044/29       1     0-4/none
    7044/30       2     0-4/none
    7044/31       4     0-4/none
    7044/32       3     0-4/none
    7044/33       1     0-4/none
    7044/34       2     0-4/none
    7044/35       4     0-4/none
    7044/36       3     0-4/none
    7044/37       1     0-4/none
The process is not bind to lgroup.  
The dtrace script (from http://hub.opensolaris.org/bin/view/Community+Group+performance/dtrace)
syscall::fork*:return
/
progenyof($1) && pid != $1
/
{
        child_trace[pid] = 1;
}

syscall::vfork:return
/
progenyof($1) && pid != $1
/
{
        child_trace[pid] = 1;
}

sched:::on-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid != lgrp)
/
{
        self->remotetime[curthread] = timestamp;
}

sched:::off-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid != lgrp) &&
self->remotetime[curthread] != 0
/
{
        @lgrplat[curthread] = sum(timestamp - self->remotetime[curthread]);
        self->remotetime[curthread] = 0;
}

sched:::on-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid == lgrp)
/
{
        self->localtime[curthread] = timestamp;
}

sched:::off-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid == lgrp) &&
self->localtime[curthread] != 0
/
{
        @lgrphome[curthread] = sum(timestamp - self->localtime[curthread]);
        self->localtime[curthread] = 0;
}

END
{
        printa("\nThread 0x%p ran locally for : %@u nsecs\n", @lgrphome);
        printa("\nThread 0x%p ran remotely for : %@u nsecs \n", @lgrplat);
}

Shows the location of the threads causes the variations.
To verify the theory, I created psrset and bind the process:
psrset -c 1 16-23
start the jvm, and get the pid, bind the process to processor set 1
psrset -b 1 <pid>
or if the user has the privilege, I can bind the process when starting jvm
psrset -e 1 java ...
With this binding, there is very little variation in the test scores.  With java parameter -XX:+UseNUMA, the variation is reduced to 10%.  Though not as little as when binding it manually.

As an ordinary user, I do not have psrset privilege.  
-bash-4.1$  ppriv $$
14460:  -bash
flags = <none>
        E: basic
        I: basic
        P: basic
        L: all
When I execute psrset, I got error "not owner".
http://www.c0t0d0s0.org/archives/4075-Less-known-Solaris-features-RBAC-and-Privileges-Part-3-Privileges.html gives good details of role/privilege management. But for my purpose, as root, I added
sys_res_config to the users default privilege.

# usermod -K defaultpriv=basic,sys_res_config <username>