Use Dtrace to locate the source of unstable performance (Solaris 11)
I was running a micro benchmark. The run takes about 10 minutes, with warm up and stable stages(2 minutes). The score varies 50-100% from run to run. When the score is good, the gc activities is 2 times of when the score is low. prstat -mL -p <pid> shows not much lock activities.
The hardware is AMD Opteron (Abu Dhabi) 2 socket 16-core/socket.
# /usr/bin/lgrpinfo
lgroup 0 (root):
Children: 1-4
CPUs: 0-31
Memory: installed 48G, allocated 9.6G, free 38G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 16
lgroup 1 (leaf):
Children: none, Parent: 0
CPUs: 0-7
Memory: installed 16G, allocated 2.3G, free 14G
Lgroup resources: 1 (CPU); 1 (memory)
Load: 0.122
Latency: 10
lgroup 2 (leaf):
Children: none, Parent: 0
CPUs: 8-15
Memory: installed 8.0G, allocated 1.9G, free 6.1G
Lgroup resources: 2 (CPU); 2 (memory)
Load: 1.53e-05
Latency: 10
lgroup 3 (leaf):
Children: none, Parent: 0
CPUs: 16-23
Memory: installed 16G, allocated 2.1G, free 14G
Lgroup resources: 3 (CPU); 3 (memory)
Load: 1.53e-05
Latency: 10
lgroup 4 (leaf):
Children: none, Parent: 0
CPUs: 24-31
Memory: installed 8.0G, allocated 3.3G, free 4.7G
Lgroup resources: 4 (CPU); 4 (memory)
Load: 0.119
Latency: 10
lgroup 0 (root):
Children: 1-4
CPUs: 0-31
Memory: installed 48G, allocated 9.6G, free 38G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 16
lgroup 1 (leaf):
Children: none, Parent: 0
CPUs: 0-7
Memory: installed 16G, allocated 2.3G, free 14G
Lgroup resources: 1 (CPU); 1 (memory)
Load: 0.122
Latency: 10
lgroup 2 (leaf):
Children: none, Parent: 0
CPUs: 8-15
Memory: installed 8.0G, allocated 1.9G, free 6.1G
Lgroup resources: 2 (CPU); 2 (memory)
Load: 1.53e-05
Latency: 10
lgroup 3 (leaf):
Children: none, Parent: 0
CPUs: 16-23
Memory: installed 16G, allocated 2.1G, free 14G
Lgroup resources: 3 (CPU); 3 (memory)
Load: 1.53e-05
Latency: 10
lgroup 4 (leaf):
Children: none, Parent: 0
CPUs: 24-31
Memory: installed 8.0G, allocated 3.3G, free 4.7G
Lgroup resources: 4 (CPU); 4 (memory)
Load: 0.119
Latency: 10
# /usr/bin/plgrp -a all 7044
PID/LWPID HOME AFFINITY
7044/1 2 0-4/none
7044/2 3 0-4/none
7044/3 4 0-4/none
7044/4 3 0-4/none
7044/5 1 0-4/none
7044/6 2 0-4/none
7044/7 4 0-4/none
7044/8 3 0-4/none
7044/9 1 0-4/none
7044/10 2 0-4/none
7044/11 4 0-4/none
7044/12 3 0-4/none
7044/13 1 0-4/none
7044/14 2 0-4/none
7044/15 4 0-4/none
7044/16 3 0-4/none
7044/17 1 0-4/none
7044/18 2 0-4/none
7044/19 4 0-4/none
7044/20 3 0-4/none
7044/21 1 0-4/none
7044/22 2 0-4/none
7044/23 4 0-4/none
7044/24 3 0-4/none
7044/25 1 0-4/none
7044/26 2 0-4/none
7044/27 4 0-4/none
7044/28 3 0-4/none
7044/29 1 0-4/none
7044/30 2 0-4/none
7044/31 4 0-4/none
7044/32 3 0-4/none
7044/33 1 0-4/none
7044/34 2 0-4/none
7044/35 4 0-4/none
7044/36 3 0-4/none
7044/37 1 0-4/none
PID/LWPID HOME AFFINITY
7044/1 2 0-4/none
7044/2 3 0-4/none
7044/3 4 0-4/none
7044/4 3 0-4/none
7044/5 1 0-4/none
7044/6 2 0-4/none
7044/7 4 0-4/none
7044/8 3 0-4/none
7044/9 1 0-4/none
7044/10 2 0-4/none
7044/11 4 0-4/none
7044/12 3 0-4/none
7044/13 1 0-4/none
7044/14 2 0-4/none
7044/15 4 0-4/none
7044/16 3 0-4/none
7044/17 1 0-4/none
7044/18 2 0-4/none
7044/19 4 0-4/none
7044/20 3 0-4/none
7044/21 1 0-4/none
7044/22 2 0-4/none
7044/23 4 0-4/none
7044/24 3 0-4/none
7044/25 1 0-4/none
7044/26 2 0-4/none
7044/27 4 0-4/none
7044/28 3 0-4/none
7044/29 1 0-4/none
7044/30 2 0-4/none
7044/31 4 0-4/none
7044/32 3 0-4/none
7044/33 1 0-4/none
7044/34 2 0-4/none
7044/35 4 0-4/none
7044/36 3 0-4/none
7044/37 1 0-4/none
The process is not bind to lgroup.
The dtrace script (from http://hub.opensolaris.org/bin/view/Community+Group+performance/dtrace)
syscall::fork*:return
/
progenyof($1) && pid != $1
/
{
child_trace[pid] = 1;
}
syscall::vfork:return
/
progenyof($1) && pid != $1
/
{
child_trace[pid] = 1;
}
sched:::on-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid != lgrp)
/
{
self->remotetime[curthread] = timestamp;
}
sched:::off-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid != lgrp) &&
self->remotetime[curthread] != 0
/
{
@lgrplat[curthread] = sum(timestamp - self->remotetime[curthread]);
self->remotetime[curthread] = 0;
}
sched:::on-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid == lgrp)
/
{
self->localtime[curthread] = timestamp;
}
sched:::off-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid == lgrp) &&
self->localtime[curthread] != 0
/
{
@lgrphome[curthread] = sum(timestamp - self->localtime[curthread]);
self->localtime[curthread] = 0;
}
END
{
printa("\nThread 0x%p ran locally for : %@u nsecs\n", @lgrphome);
printa("\nThread 0x%p ran remotely for : %@u nsecs \n", @lgrplat);
}
/
progenyof($1) && pid != $1
/
{
child_trace[pid] = 1;
}
syscall::vfork:return
/
progenyof($1) && pid != $1
/
{
child_trace[pid] = 1;
}
sched:::on-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid != lgrp)
/
{
self->remotetime[curthread] = timestamp;
}
sched:::off-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid != lgrp) &&
self->remotetime[curthread] != 0
/
{
@lgrplat[curthread] = sum(timestamp - self->remotetime[curthread]);
self->remotetime[curthread] = 0;
}
sched:::on-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid == lgrp)
/
{
self->localtime[curthread] = timestamp;
}
sched:::off-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid == lgrp) &&
self->localtime[curthread] != 0
/
{
@lgrphome[curthread] = sum(timestamp - self->localtime[curthread]);
self->localtime[curthread] = 0;
}
END
{
printa("\nThread 0x%p ran locally for : %@u nsecs\n", @lgrphome);
printa("\nThread 0x%p ran remotely for : %@u nsecs \n", @lgrplat);
}
Shows the location of the threads causes the variations.
To verify the theory, I created psrset and bind the process:
psrset -c 1 16-23
start the jvm, and get the pid, bind the process to processor set 1
psrset -b 1 <pid>
or if the user has the privilege, I can bind the process when starting jvm
psrset -e 1 java ...
With this binding, there is very little variation in the test scores. With java parameter -XX:+UseNUMA, the variation is reduced to 10%. Though not as little as when binding it manually.
As an ordinary user, I do not have psrset privilege.
-bash-4.1$ ppriv $$
14460: -bash
flags = <none>
E: basic
I: basic
P: basic
L: all
14460: -bash
flags = <none>
E: basic
I: basic
P: basic
L: all
When I execute psrset, I got error "not owner".
http://www.c0t0d0s0.org/archives/4075-Less-known-Solaris-features-RBAC-and-Privileges-Part-3-Privileges.html gives good details of role/privilege management. But for my purpose, as root, I added
sys_res_config to the users default privilege.
# usermod -K defaultpriv=basic,sys_res_config <username>
No comments:
Post a Comment