Wednesday, September 26, 2012

Use Dtrace to locate the source of unstable performance (Solaris 11)

I was running a micro benchmark.  The run takes about 10 minutes, with warm up and stable stages(2 minutes).  The score varies 50-100% from run to run.  When the score is good, the gc activities is 2 times of when the score is low.  prstat -mL -p <pid> shows not much lock activities.

The hardware is AMD Opteron (Abu Dhabi) 2 socket 16-core/socket. 
# /usr/bin/lgrpinfo
lgroup 0 (root):
        Children: 1-4
        CPUs: 0-31
        Memory: installed 48G, allocated 9.6G, free 38G
        Lgroup resources: 1-4 (CPU); 1-4 (memory)
        Latency: 16
lgroup 1 (leaf):
        Children: none, Parent: 0
        CPUs: 0-7
        Memory: installed 16G, allocated 2.3G, free 14G
        Lgroup resources: 1 (CPU); 1 (memory)
        Load: 0.122
        Latency: 10
lgroup 2 (leaf):
        Children: none, Parent: 0
        CPUs: 8-15
        Memory: installed 8.0G, allocated 1.9G, free 6.1G
        Lgroup resources: 2 (CPU); 2 (memory)
        Load: 1.53e-05
        Latency: 10
lgroup 3 (leaf):
        Children: none, Parent: 0
        CPUs: 16-23
        Memory: installed 16G, allocated 2.1G, free 14G
        Lgroup resources: 3 (CPU); 3 (memory)
        Load: 1.53e-05
        Latency: 10
lgroup 4 (leaf):
        Children: none, Parent: 0
        CPUs: 24-31
        Memory: installed 8.0G, allocated 3.3G, free 4.7G
        Lgroup resources: 4 (CPU); 4 (memory)
        Load: 0.119
        Latency: 10

# /usr/bin/plgrp -a all 7044
     PID/LWPID    HOME       AFFINITY
    7044/1        2     0-4/none
    7044/2        3     0-4/none
    7044/3        4     0-4/none
    7044/4        3     0-4/none
    7044/5        1     0-4/none
    7044/6        2     0-4/none
    7044/7        4     0-4/none
    7044/8        3     0-4/none
    7044/9        1     0-4/none
    7044/10       2     0-4/none
    7044/11       4     0-4/none
    7044/12       3     0-4/none
    7044/13       1     0-4/none
    7044/14       2     0-4/none
    7044/15       4     0-4/none
    7044/16       3     0-4/none
    7044/17       1     0-4/none
    7044/18       2     0-4/none
    7044/19       4     0-4/none
    7044/20       3     0-4/none
    7044/21       1     0-4/none
    7044/22       2     0-4/none
    7044/23       4     0-4/none
    7044/24       3     0-4/none
    7044/25       1     0-4/none
    7044/26       2     0-4/none
    7044/27       4     0-4/none
    7044/28       3     0-4/none
    7044/29       1     0-4/none
    7044/30       2     0-4/none
    7044/31       4     0-4/none
    7044/32       3     0-4/none
    7044/33       1     0-4/none
    7044/34       2     0-4/none
    7044/35       4     0-4/none
    7044/36       3     0-4/none
    7044/37       1     0-4/none
The process is not bind to lgroup.  
The dtrace script (from http://hub.opensolaris.org/bin/view/Community+Group+performance/dtrace)
syscall::fork*:return
/
progenyof($1) && pid != $1
/
{
        child_trace[pid] = 1;
}

syscall::vfork:return
/
progenyof($1) && pid != $1
/
{
        child_trace[pid] = 1;
}

sched:::on-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid != lgrp)
/
{
        self->remotetime[curthread] = timestamp;
}

sched:::off-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid != lgrp) &&
self->remotetime[curthread] != 0
/
{
        @lgrplat[curthread] = sum(timestamp - self->remotetime[curthread]);
        self->remotetime[curthread] = 0;
}

sched:::on-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid == lgrp)
/
{
        self->localtime[curthread] = timestamp;
}

sched:::off-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid == lgrp) &&
self->localtime[curthread] != 0
/
{
        @lgrphome[curthread] = sum(timestamp - self->localtime[curthread]);
        self->localtime[curthread] = 0;
}

END
{
        printa("\nThread 0x%p ran locally for : %@u nsecs\n", @lgrphome);
        printa("\nThread 0x%p ran remotely for : %@u nsecs \n", @lgrplat);
}

Shows the location of the threads causes the variations.
To verify the theory, I created psrset and bind the process:
psrset -c 1 16-23
start the jvm, and get the pid, bind the process to processor set 1
psrset -b 1 <pid>
or if the user has the privilege, I can bind the process when starting jvm
psrset -e 1 java ...
With this binding, there is very little variation in the test scores.  With java parameter -XX:+UseNUMA, the variation is reduced to 10%.  Though not as little as when binding it manually.

As an ordinary user, I do not have psrset privilege.  
-bash-4.1$  ppriv $$
14460:  -bash
flags = <none>
        E: basic
        I: basic
        P: basic
        L: all
When I execute psrset, I got error "not owner".
http://www.c0t0d0s0.org/archives/4075-Less-known-Solaris-features-RBAC-and-Privileges-Part-3-Privileges.html gives good details of role/privilege management. But for my purpose, as root, I added
sys_res_config to the users default privilege.

# usermod -K defaultpriv=basic,sys_res_config <username>

Monday, August 20, 2012

How to get hardware layout on Linux

Sometimes, you need to know the hardware layout for interrupt handling assignment, memory access, etc.

You can use dmidecode along cat /proc/XXX commands for find out the hardware layout.

http://www.thegeekstuff.com/2008/11/how-to-get-hardware-information-on-linux-using-dmidecode-command/ give a good descriptions about demidecode:

"dmidecode command reads the system DMI table to display hardware and BIOS information of the server. Apart from getting current configuration of the system, you can also get information about maximum supported configuration of the system using dmidecode. For example, dmidecode gives both the current RAM on the system and the maximum RAM supported by the system."

Here I will show an example how to decide the cpu threads and the layout:

[root@sc14ia06 nopermgen]# dmidecode -t 4
# dmidecode 2.10
SMBIOS 2.6 present.

Handle 0x0004, DMI type 4, 42 bytes
Processor Information
        Socket Designation: CPU 1
        Type: Central Processor
        Family: Xeon
        Manufacturer: Intel           
        ID: D2 06 02 00 FF FB EB BF
        Signature: Type 0, Family 6, Model 45, Stepping 2
        Flags:
                FPU (Floating-point unit on-chip)
                VME (Virtual mode extension)
                DE (Debugging extension)
                PSE (Page size extension)
                TSC (Time stamp counter)
                MSR (Model specific registers)
                PAE (Physical address extension)
                MCE (Machine check exception)
                CX8 (CMPXCHG8 instruction supported)
                APIC (On-chip APIC hardware supported)
                SEP (Fast system call)
                MTRR (Memory type range registers)
                PGE (Page global enable)
                MCA (Machine check architecture)
                CMOV (Conditional move instruction supported)
                PAT (Page attribute table)
                PSE-36 (36-bit page size extension)
                CLFSH (CLFLUSH instruction supported)
                DS (Debug store)
                ACPI (ACPI supported)
                MMX (MMX technology supported)
                FXSR (Fast floating-point save and restore)
                SSE (Streaming SIMD extensions)
                SSE2 (Streaming SIMD extensions 2)
                SS (Self-snoop)
                HTT (Hyper-threading technology)
                TM (Thermal monitor supported)
                PBE (Pending break enabled)
        Version: Genuine Intel(R) CPU  @ 2.30GHz               
        Voltage: 1.0 V
        External Clock: 100 MHz
        Max Speed: 4000 MHz
        Current Speed: 2300 MHz
        Status: Populated, Enabled
        Upgrade: Other
        L1 Cache Handle: 0x0005
        L2 Cache Handle: 0x0006
        L3 Cache Handle: 0x0007
        Serial Number: To Be Filled By O.E.M.
        Asset Tag: To Be Filled By O.E.M.
        Part Number: To Be Filled By O.E.M.
        Core Count: 8
        Core Enabled: 8
        Thread Count: 16
        Characteristics:
                64-bit capable

Handle 0x0008, DMI type 4, 42 bytes
Processor Information
        Socket Designation: CPU 2
        Type: Central Processor
        Family: Xeon
        Manufacturer: Intel           
        ID: D2 06 02 00 FF FB EB BF
        Signature: Type 0, Family 6, Model 45, Stepping 2
        Flags:
                FPU (Floating-point unit on-chip)
                VME (Virtual mode extension)
                DE (Debugging extension)
                PSE (Page size extension)
                TSC (Time stamp counter)
                MSR (Model specific registers)
                PAE (Physical address extension)
                MCE (Machine check exception)
                CX8 (CMPXCHG8 instruction supported)
                APIC (On-chip APIC hardware supported)
                SEP (Fast system call)
                MTRR (Memory type range registers)
                PGE (Page global enable)
                MCA (Machine check architecture)
                CMOV (Conditional move instruction supported)
                PAT (Page attribute table)
                PSE-36 (36-bit page size extension)
                CLFSH (CLFLUSH instruction supported)
                DS (Debug store)
                ACPI (ACPI supported)
                MMX (MMX technology supported)
                FXSR (Fast floating-point save and restore)
                SSE (Streaming SIMD extensions)
                SSE2 (Streaming SIMD extensions 2)
                SS (Self-snoop)
                HTT (Hyper-threading technology)
                TM (Thermal monitor supported)
                PBE (Pending break enabled)
        Version: Genuine Intel(R) CPU  @ 2.30GHz               
        Voltage: 1.0 V
        External Clock: 100 MHz
        Max Speed: 4000 MHz
        Current Speed: 2300 MHz
        Status: Populated, Enabled
        Upgrade: Other
        L1 Cache Handle: 0x0009
        L2 Cache Handle: 0x000A
        L3 Cache Handle: 0x000B
        Serial Number: To Be Filled By O.E.M.
        Asset Tag: To Be Filled By O.E.M.
        Part Number: To Be Filled By O.E.M.
        Core Count: 8
        Core Enabled: 8
        Thread Count: 16
        Characteristics:
                64-bit capable

It shows 2-socket cpu, 8-core per socket, and hyper threading enabled(core count:8, thread count: 16)

 cat /proc/cpuinfo |egrep 'processor|physical id|core id'
processor       : 0
physical id     : 0
core id         : 0
processor       : 1
physical id     : 0
core id         : 1
processor       : 2
physical id     : 0
core id         : 2
processor       : 3
physical id     : 0
core id         : 3
processor       : 4
physical id     : 0
core id         : 4
processor       : 5
physical id     : 0
core id         : 5
processor       : 6
physical id     : 0
core id         : 6
processor       : 7
physical id     : 0
core id         : 7
processor       : 8
physical id     : 1
core id         : 0
processor       : 9
physical id     : 1
core id         : 1
processor       : 10
physical id     : 1
core id         : 2
processor       : 11
physical id     : 1
core id         : 3
processor       : 12
physical id     : 1
core id         : 4
processor       : 13
physical id     : 1
core id         : 5
processor       : 14
physical id     : 1
core id         : 6
processor       : 15
physical id     : 1
core id         : 7
processor       : 16
physical id     : 0
core id         : 0
processor       : 17
physical id     : 0
core id         : 1
processor       : 18
physical id     : 0
core id         : 2
processor       : 19
physical id     : 0
core id         : 3
processor       : 20
physical id     : 0
core id         : 4
processor       : 21
physical id     : 0
core id         : 5
processor       : 22
physical id     : 0
core id         : 6
processor       : 23
physical id     : 0
core id         : 7
processor       : 24
physical id     : 1
core id         : 0
processor       : 25
physical id     : 1
core id         : 1
processor       : 26
physical id     : 1
core id         : 2
processor       : 27
physical id     : 1
core id         : 3
processor       : 28
physical id     : 1
core id         : 4
processor       : 29
physical id     : 1
core id         : 5
processor       : 30
physical id     : 1
core id         : 6
processor       : 31
physical id     : 1
core id         : 7

This shows there are 32 logic processors. 
core id: the processor id
physical id: the socket id
core id: the core id

To put in a picture, the processor mapping is:
Another way to get this information is:
cd /sys/devices/system/cpu
cat cpu0/topology/thread_siblings_list