Linux tips

Thursday, April 17, 2014

Understand G1 gc log

Basic flags: -XX:+PrintGCTimeStamps -XX:+PrintGCDetails

G1 prints out various logs, depending on the flags you use. The basic ones are:
-XX:+PrintGCTimeStamps -XX:+PrintGCDetails

Log example:

39.570: [GC pause (G1 Evacuation Pause) (young), 0.2028160 secs]

   [Parallel Time: 149.2 ms, GC Workers: 8]

      [GC Worker Start (ms): Min: 39571.0, Avg: 39571.1, Max: 39571.2, Diff: 0.2]

      [Ext Root Scanning (ms): Min: 0.4, Avg: 0.5, Max: 0.6, Diff: 0.2, Sum: 4.0]

      [Update RS (ms): Min: 5.1, Avg: 6.6, Max: 11.7, Diff: 6.6, Sum: 52.7]

         [Processed Buffers: Min: 1, Avg: 1.5, Max: 4, Diff: 3, Sum: 12]

      [Scan RS (ms): Min: 11.0, Avg: 14.8, Max: 17.8, Diff: 6.9, Sum: 118.7]

      [Code Root Scanning (ms): Min: 2.3, Avg: 3.7, Max: 4.7, Diff: 2.4, Sum: 29.4]

      [Object Copy (ms): Min: 121.1, Avg: 122.0, Max: 122.9, Diff: 1.8, Sum: 975.6]

      [Termination (ms): Min: 0.0, Avg: 0.5, Max: 0.6, Diff: 0.6, Sum: 4.0]

      [GC Worker Other (ms): Min: 0.1, Avg: 0.5, Max: 0.9, Diff: 0.8, Sum: 4.0]

      [GC Worker Total (ms): Min: 148.1, Avg: 148.6, Max: 148.9, Diff: 0.8, Sum: 1188.4]

      [GC Worker End (ms): Min: 39719.3, Avg: 39719.7, Max: 39720.0, Diff: 0.8]

   [Code Root Fixup: 0.0 ms]

   [Code Root Migration: 3.7 ms]

   [Clear CT: 4.6 ms]

   [Other: 45.4 ms]

      [Choose CSet: 0.1 ms]

      [Ref Proc: 0.8 ms]

      [Ref Enq: 0.0 ms]

      [Free CSet: 43.2 ms]

   [Eden: 7744.0M(7744.0M)->0.0B(7743.0M) Survivors: 448.0M->449.0M Heap: 8199.9M(16.0G)->457.5M(16.0G)]

 [Times: user=1.26 sys=0.01, real=0.20 secs]

Parallel Phases:

39.570: [GC pause (G1 Evacuation Pause) (young), 0.2028160 secs]

At timestamp 39.570, a young gc happened, this young gc lasted 0.2028160 seconds.

[Parallel Time: 149.2 ms, GC Workers: 8]

Out of 202.8ms, 149.2 ms is executed by the parallel gc threads. We have 8 GC threads.

[GC Worker Start (ms): Min: 39571.0, Avg: 39571.1, Max: 39571.2, Diff: 0.2]

The gc threads start timestamp.

[Ext Root Scanning (ms): Min: 0.4, Avg: 0.5, Max: 0.6, Diff: 0.2, Sum: 4.0]

External root scanning takes a average of 0.5 ms. This is the time gc workers to scan the external roots, such as registers, thread stacks, system dictionary.

[Update RS (ms): Min: 5.1, Avg: 6.6, Max: 11.7, Diff: 6.6, Sum: 52.7]

         [Processed Buffers: Min: 1, Avg: 1.5, Max: 4, Diff: 3, Sum: 12]

Update remember set takes a average of 6.6 ms. On average, each thread processed 1.5 buffers. These buffers are completed but not processed by concurrent refinement threads.

[Scan RS (ms): Min: 11.0, Avg: 14.8, Max: 17.8, Diff: 6.9, Sum: 118.7]

Scan RS takes a average of 14.8ms. This is the time to scan the remember set for references into CSet.

[Code Root Scanning (ms): Min: 2.3, Avg: 3.7, Max: 4.7, Diff: 2.4, Sum: 29.4]

Time to scan code root.

[Object Copy (ms): Min: 121.1, Avg: 122.0, Max: 122.9, Diff: 1.8, Sum: 975.6]

Time to copy object from 1 region to the other region.

[Termination (ms): Min: 0.0, Avg: 0.5, Max: 0.6, Diff: 0.6, Sum: 4.0]

Time between the threads are offered to terminate, and when it terminates.

[GC Worker Other (ms): Min: 0.1, Avg: 0.5, Max: 0.9, Diff: 0.8, Sum: 4.0]

Time that can not be contributed to any known categories.

[GC Worker Total (ms): Min: 148.1, Avg: 148.6, Max: 148.9, Diff: 0.8, Sum: 1188.4]

Total parallel time for gc threads.

[GC Worker End (ms): Min: 39719.3, Avg: 39719.7, Max: 39720.0, Diff: 0.8]

Time stamp when gc worker threads end.
For all the parallel gc activities, if big difference is observed between min and max, that indicates the work is not distributed among gc worker threads evenly.

Sequential phases (we may introduce more parallelism later)

[Code Root Fixup: 0.0 ms]

Time for clean up the code root

[Code Root Migration: 3.7 ms]

Time for update the retained code root

[Other: 45.4 ms]

      [Choose CSet: 0.1 ms]

      [Ref Proc: 0.8 ms]

      [Ref Enq: 0.0 ms]

      [Free CSet: 43.2 ms]

Choose CSet: time to decide regions added to CSet
Ref Proc: time spent in processing the deferred references from the prior gc
Ref Enq: time spent in processing the deferred references from the prior gc
Free CSet: time spent freeing the set of regions that have been collected, also includes the time to free their remembered sets.
Note, some times, we see uncounted Other time, especially when 'to-space exhausted' happened. Evacuation failure handling time includes recalculating the used heap, removing self-forward references, and restoring remembered sets.

Heap Summary

[Eden: 7744.0M(7744.0M)->0.0B(7743.0M) Survivors: 448.0M->449.0M Heap: 8199.9M(16.0G)->457.5M(16.0G)]

Eden: used Eden before gc is 7744.0M, the Eden size for this gc is 7744.0M. After gc, the used Eden is 0M, and the eden size for next gc should be 7743M.
Survivors: survivor used before gc is 448.0M, and after gc is 449.0M.
Heap: Heap used before gc is 8199.9M. The heap size before gc is 16.0G. Heap used after gc is 457.5M. Heap size for next gc is 16.0G.

TimeSummary

[Times: user=1.26 sys=0.01, real=0.20 secs]

High system CPU could be swapping, or some other kind of OS/kernel interaction, or activity
Low User CPU relative to real CPU can be an indication of lack of parallelism and root cause should also be identifiable in the “Parallel Time” stats area of the GC log

-XX:+PrintAdaptiveSizePolicy

If this flag is used, G1 will print out more information about how g1 makes the ergonomic decisions.

2.900: [GC pause (G1 Evacuation Pause) (mixed) 2.900: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 21789, predicted base time: 31.52 ms, remaining time: 168.48 ms, target pause time: 200.00 ms] 2.900: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 3 regions, survivors: 9 regions, predicted young region time: 22.64 ms]

 2.900: [G1Ergonomics (CSet Construction) finish adding old regions to CSet, reason: old CSet region num reached max, old: 26 regions, max: 26 regions]

 2.900: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 3 regions, survivors: 9 regions, old: 26 regions, predicted pause time: 96.49 ms, target pause time: 200.00 ms]

 2.927:
 [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC 
overhead higher than threshold after GC, recent GC overhead: 12.52 %, threshold: 10.00 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)]

At the beginning of a gc cycle, how G1 decides the regions in the CSet.
The minimum young regions included in cset should include the survivor regions, can keep up with allocation rate. Also the predicted time to process the cards and evacuate the regions should be under the target pause.

2.927: [G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 178257920 bytes, allocation request: 0 bytes, threshold: 120795930 bytes (45.00 %), source: end of GC]

 2.927: [G1Ergonomics (Mixed GCs) continue mixed GCs, reason: candidate old regions available, candidate old regions: 49 regions, reclaimable: 28742816 bytes (10.71 %), threshold: 10.00 %]

At the end of gc, continue to do another mixed gc. Because the reclaimable old gen(28742816 bytes (10.72%)) is higher than the G1HeapWastedPercent(10%).

3.019: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 189792256 bytes, allocation request: 2097168 bytes, threshold: 120795930 bytes (45.00 %), source: concurrent humongous allocation]

 3.019:
 [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, 
reason: requested by GC cause, GC cause: G1 Humongous Allocation]

 3.019: [G1Ergonomics (Concurrent Cycles) initiate concurrent cycle, reason: concurrent cycle initiation requested]

At the end of gc, we decide to start a concurrent cycle, since the heap occupancy (189792256 bytes) is higher than InitiatingHeapOccupancyPercent(45%).

-XX:+UnlockExperimentalVMOptions -XX:G1LogLevel=finest

This flag prints out more detailed information, such as per gc worker thread.

2.900: [GC pause (G1 Evacuation Pause) (mixed) 2.900: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 21789, predicted base time: 31.52 ms, remaining time: 168.48 ms, target pause time: 200.00 ms] 2.900: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 3 regions, survivors: 9 regions, predicted young region time: 22.64 ms]

 2.900: [G1Ergonomics (CSet Construction) finish adding old regions to CSet, reason: old CSet region num reached max, old: 26 regions, max: 26 regions]

 2.900: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 3 regions, survivors: 9 regions, old: 26 regions, predicted pause time: 96.49 ms, target pause time: 200.00 ms]

 2.927:
 [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC 
overhead higher than threshold after GC, recent GC overhead: 12.52 %, threshold: 10.00 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)]

 2.927: [G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 178257920 bytes, allocation request: 0 bytes, threshold: 120795930 bytes (45.00 %), source: end of GC]

 2.927: [G1Ergonomics (Mixed GCs) continue mixed GCs, reason: candidate old regions available, candidate old regions: 49 regions, reclaimable: 28742816 bytes (10.71 %), threshold: 10.00 %]

, 0.0271740 secs]

   [Parallel Time: 21.0 ms, GC Workers: 23]

      [GC Worker Start (ms):  2900.3  2900.3  2900.4  2900.4  2900.5  2900.5  2900.6  2900.6  2900.6  2900.6  2900.6  2900.6  2900.7  2900.7  2900.7  2900.8  2900.8  2900.8  2900.9  2900.9  2900.9  2900.9  2901.0

       Min: 2900.3, Avg: 2900.7, Max: 2901.0, Diff: 0.7]

      [Ext Root Scanning (ms):  0.5  0.4  0.4  0.4  0.3  0.3  0.2  0.2  0.2  0.2  0.2  0.2  0.1  0.1  0.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

       Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.5, Sum: 3.8]

      [Update RS (ms):  3.8  3.9  3.6  3.8  3.7  3.9  4.0  3.8  4.3  4.2  4.1  4.2  3.7  3.6  3.9  4.0  3.9  3.9  3.6  3.9  4.1  3.7  3.5

       Min: 3.5, Avg: 3.9, Max: 4.3, Diff: 0.8, Sum: 89.0]

         [Processed Buffers:  4  3  5  3  5  3  2  4  4  3  4  6  5  3  4  3  3  3  3  4  4  6  3

          Min: 2, Avg: 3.8, Max: 6, Diff: 4, Sum: 87]

      [Scan RS (ms):  1.4  1.3  1.6  1.4  1.5  1.3  1.2  1.4  0.9  1.0  1.1  1.1  1.4  1.6  1.2  1.1  1.3  1.4  1.5  1.2  1.0  1.3  1.5

       Min: 0.9, Avg: 1.3, Max: 1.6, Diff: 0.7, Sum: 29.6]

      [Code Root Scanning (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

       Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]

      [Object Copy (ms):  13.4  13.4  13.4  13.4  13.4  13.4  13.4  13.4  13.4  13.4  13.4  13.4  13.5  13.4  13.5  13.4  13.4  13.3  13.4  13.4  13.4  13.4  13.5

       Min: 13.3, Avg: 13.4, Max: 13.5, Diff: 0.1, Sum: 308.7]

since the number of gc workers is 23, it prints out the time statistics for each thread:

[GC Worker Start (ms):  2900.3  2900.3  2900.4  2900.4  2900.5  2900.5  2900.6  2900.6  2900.6  2900.6  2900.6  2900.6  2900.7  2900.7  2900.7  2900.8  2900.8  2900.8  2900.9  2900.9  2900.9  2900.9  2901.0

       Min: 2900.3, Avg: 2900.7, Max: 2901.0, Diff: 0.7]

thread-1 starts at 2900.3
thread-2 starts at 2900.3
...
the minimum thread start time stamp is 2900.3.
the average thread start time stamp is 2900.7.
the latest thread start time stamp is 2901.0.

Wednesday, January 30, 2013

HotSpot -XX:+PrintComilation output, what does it mean?

HotSpot -XX:+PrintAssembly How to:

Reference:

https://wikis.oracle.com/display//HotSpotInternals/PrintAssembly
http://stackoverflow.com/questions/9341083/how-to-use-xxunlockdiagnosticvmoptions-xxcompilecommand-print-option-with-j

Build the plugin:

For PrintAssembly to work, you need hotspot disassembler plugin. You can download a prebuilt one, or build it your self. The source is located <jdk>/src/share/tools/hsdis. There is a README. Here are the steps I did:
(0. copy the source code from hotspot bundles: hotspot/src/share/tools/hsdis)
1. cd <jdk>/src/share/tools/hsdis
2. Download binutils from http://directory.fsf.org/project/binutils, put it under build/binutils. The makefile will look for binutils under build by default. Or you can build binutils separately and export the environment variable BINUTILS=<path> "export BINUTILS=<.../tools/hsdis/build/binutils-2.22>
3. make
4. You might see error

hsdis.c:32:20: error: sysdep.h: No such file or directory

Need to change hsdis.c

-bash-4.1$ diff hsdis.c hsdis.c.orig
32c32
< #include <errno.h>
---
> #include <sysdep.h>

Try make again.
5. copy hsdis-amd64.so (or other names depending on the platform) to <jdk>/jre/lib/amd64

Generate the assembly code:

-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly
You can separate the assembly output to another file -XX:+LogVMOutput -XX:LogFile=foo*.log .

Things to look for:

lock addl
callq

HotSpot -XX:+LogCompilation How to:

"The -XX:+LogCompilation flag produces a low-level XML file about compiler and runtime decisions, which may be interesting to some. The -XX:+UnlockDiagnosticVMOptions must come first. The dump is to hotspot.log in the current directory; use -XX:LogFile=foo.log to change this."

LogCompilation gives similar information as -XX:+PrintCompilation -XX:+PrintInling, with more information why a decision is made.

Generate the log:

-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:LogFile=foo.log

Build the log parser:

The log is low level xml, which is hard to read. You can build a parser logc.jar to parse the log. Steps:
(0. copy the source code from hotspot bundles: hotspot/src/share/tools/LogCompilation)
1. cd <jdk>/src/share/tools/LogCompilation
2. make
It creates build dir, and generates logc.jar

How to use the log parser: (From README under share/tools/LogCompilation)

-bash-4.1$ cat README
This is a very rough tool for parsing -XX:+LogCompilation output.
It's main purpose is to recreate output similar to
-XX:+PrintCompilation -XX:+PrintInlining output from a debug JVM. It
requires a 1.5 JDK to build and simply typing make should build it.

It produces a jar file, logc.jar, that can be run on the
hotspot.log from LogCompilation output like this:

java -jar logc.jar hotspot.log

This will produce something like the normal PrintCompilation output.
Adding the -i option with also report inlining like PrintInlining.

More information about the LogCompilation output can be found at

http://wikis.sun.com/display/HotSpotInternals/LogCompilation+overview
http://wikis.sun.com/display/HotSpotInternals/PrintCompilation
http://wikis.sun.com/display/HotSpotInternals/LogCompilation+tool

What does it mean in the log output?

The output produced by logc.jar has better format than those produced by -XX:+PrintComilation -XX:+PrintInlining (some lines tangle together). Also gives more details on compilation, like 'uncommon trap', 'not entrant' , 'zombie'.
'uncommon trap': something changed. May need recompile
'not entrant': new calls are not allowed to enter this
'zombie': on its way to death.

`70 make_not_entrant 2 uncommon trap class_check maybe_recompile @14 HighContentionSimulator run ()V 2 uncommon trap class_check maybe_recompile @14 HighContentionSimulator run ()V 2 uncommon trap class_check maybe_recompile @14 HighContentionSimulator run ()V 2 make_not_entrant 65 make_not_entrant 3% HighContentionSimulator::run @ 2 (24 bytes) @ 14 java.lang.Runnable::run (0 bytes) type profile java/lang/Runnable -> HighContentionSimulator$Mode$1 (59%)`

`Lines start with @ is line:`

`@ 14 java.lang.Runnable::run (0 bytes)`

`If no other reasons are printed, it is inlined.`

`@ 23 java.util.concurrent.locks.AbstractQueuedSynchronizer::selfInterrupt never execute`

If there is a reason, then it is not inlined.

With hotspot   -XX:+PrintComilation, you can get something like this:
8922    3 %     4       HighContentionSimulator::run @ 2 (24 bytes)
<timestamp> <compilation-id> flags <comilation-level if TieredCompilation is on> <Method> @ 2 (code size)
flags: %: is_osr_method (on stack replacement)
         s: is_synchronized
         !: has_exception_handler
         b: is_blocking
         n: is_native
@: 2: it is osr_method and osr_bci is 2

Wednesday, September 26, 2012

Use Dtrace to locate the source of unstable performance (Solaris 11)

I was running a micro benchmark. The run takes about 10 minutes, with warm up and stable stages(2 minutes). The score varies 50-100% from run to run. When the score is good, the gc activities is 2 times of when the score is low. prstat -mL -p <pid> shows not much lock activities.

The hardware is AMD Opteron (Abu Dhabi) 2 socket 16-core/socket.

# /usr/bin/lgrpinfo
lgroup 0 (root):
        Children: 1-4
        CPUs: 0-31
        Memory: installed 48G, allocated 9.6G, free 38G
        Lgroup resources: 1-4 (CPU); 1-4 (memory)
        Latency: 16
lgroup 1 (leaf):
        Children: none, Parent: 0
        CPUs: 0-7
        Memory: installed 16G, allocated 2.3G, free 14G
        Lgroup resources: 1 (CPU); 1 (memory)
        Load: 0.122
        Latency: 10
lgroup 2 (leaf):
        Children: none, Parent: 0
        CPUs: 8-15
        Memory: installed 8.0G, allocated 1.9G, free 6.1G
        Lgroup resources: 2 (CPU); 2 (memory)
        Load: 1.53e-05
        Latency: 10
lgroup 3 (leaf):
        Children: none, Parent: 0
        CPUs: 16-23
        Memory: installed 16G, allocated 2.1G, free 14G
        Lgroup resources: 3 (CPU); 3 (memory)
        Load: 1.53e-05
        Latency: 10
lgroup 4 (leaf):
        Children: none, Parent: 0
        CPUs: 24-31
        Memory: installed 8.0G, allocated 3.3G, free 4.7G
        Lgroup resources: 4 (CPU); 4 (memory)
        Load: 0.119
        Latency: 10

# /usr/bin/plgrp -a all 7044
     PID/LWPID    HOME       AFFINITY
    7044/1        2     0-4/none
    7044/2        3     0-4/none
    7044/3        4     0-4/none
    7044/4        3     0-4/none
    7044/5        1     0-4/none
    7044/6        2     0-4/none
    7044/7        4     0-4/none
    7044/8        3     0-4/none
    7044/9        1     0-4/none
    7044/10       2     0-4/none
    7044/11       4     0-4/none
    7044/12       3     0-4/none
    7044/13       1     0-4/none
    7044/14       2     0-4/none
    7044/15       4     0-4/none
    7044/16       3     0-4/none
    7044/17       1     0-4/none
    7044/18       2     0-4/none
    7044/19       4     0-4/none
    7044/20       3     0-4/none
    7044/21       1     0-4/none
    7044/22       2     0-4/none
    7044/23       4     0-4/none
    7044/24       3     0-4/none
    7044/25       1     0-4/none
    7044/26       2     0-4/none
    7044/27       4     0-4/none
    7044/28       3     0-4/none
    7044/29       1     0-4/none
    7044/30       2     0-4/none
    7044/31       4     0-4/none
    7044/32       3     0-4/none
    7044/33       1     0-4/none
    7044/34       2     0-4/none
    7044/35       4     0-4/none
    7044/36       3     0-4/none
    7044/37       1     0-4/none

The process is not bind to lgroup.

The dtrace script (from http://hub.opensolaris.org/bin/view/Community+Group+performance/dtrace)

syscall::fork*:return
/
progenyof($1) && pid != $1
/
{
        child_trace[pid] = 1;
}

syscall::vfork:return
/
progenyof($1) && pid != $1
/
{
        child_trace[pid] = 1;
}

sched:::on-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid != lgrp)
/
{
        self->remotetime[curthread] = timestamp;
}

sched:::off-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid != lgrp) &&
self->remotetime[curthread] != 0
/
{
        @lgrplat[curthread] = sum(timestamp - self->remotetime[curthread]);
        self->remotetime[curthread] = 0;
}

sched:::on-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid == lgrp)
/
{
        self->localtime[curthread] = timestamp;
}

sched:::off-cpu
/
progenyof($1) &&
(pid == $1 || child_trace[pid]) &&
(curthread->t_lpl->lpl_lgrpid == lgrp) &&
self->localtime[curthread] != 0
/
{
        @lgrphome[curthread] = sum(timestamp - self->localtime[curthread]);
        self->localtime[curthread] = 0;
}

END
{
        printa("\nThread 0x%p ran locally for : %@u nsecs\n", @lgrphome);
        printa("\nThread 0x%p ran remotely for : %@u nsecs \n", @lgrplat);
}

Shows the location of the threads causes the variations.

To verify the theory, I created psrset and bind the process:

psrset -c 1 16-23

start the jvm, and get the pid, bind the process to processor set 1

psrset -b 1 <pid>

or if the user has the privilege, I can bind the process when starting jvm

psrset -e 1 java ...

With this binding, there is very little variation in the test scores. With java parameter -XX:+UseNUMA, the variation is reduced to 10%. Though not as little as when binding it manually.

As an ordinary user, I do not have psrset privilege.

-bash-4.1$ ppriv $$
14460: -bash
flags = <none>
        E: basic
        I: basic
        P: basic
        L: all

When I execute psrset, I got error "not owner".

http://www.c0t0d0s0.org/archives/4075-Less-known-Solaris-features-RBAC-and-Privileges-Part-3-Privileges.html gives good details of role/privilege management. But for my purpose, as root, I added

sys_res_config to the users default privilege.

# usermod -K defaultpriv=basic,sys_res_config <username>

Monday, August 20, 2012

How to get hardware layout on Linux

Sometimes, you need to know the hardware layout for interrupt handling assignment, memory access, etc.

You can use dmidecode along cat /proc/XXX commands for find out the hardware layout.

http://www.thegeekstuff.com/2008/11/how-to-get-hardware-information-on-linux-using-dmidecode-command/ give a good descriptions about demidecode:

"dmidecode command reads the system DMI table to display hardware and BIOS information of the server. Apart from getting current configuration of the system, you can also get information about maximum supported configuration of the system using dmidecode. For example, dmidecode gives both the current RAM on the system and the maximum RAM supported by the system."

Here I will show an example how to decide the cpu threads and the layout:

[root@sc14ia06 nopermgen]# dmidecode -t 4
# dmidecode 2.10
SMBIOS 2.6 present.

Handle 0x0004, DMI type 4, 42 bytes
Processor Information
        Socket Designation: CPU 1
        Type: Central Processor
        Family: Xeon
        Manufacturer: Intel
        ID: D2 06 02 00 FF FB EB BF
        Signature: Type 0, Family 6, Model 45, Stepping 2
        Flags:
                FPU (Floating-point unit on-chip)
                VME (Virtual mode extension)
                DE (Debugging extension)
                PSE (Page size extension)
                TSC (Time stamp counter)
                MSR (Model specific registers)
                PAE (Physical address extension)
                MCE (Machine check exception)
                CX8 (CMPXCHG8 instruction supported)
                APIC (On-chip APIC hardware supported)
                SEP (Fast system call)
                MTRR (Memory type range registers)
                PGE (Page global enable)
                MCA (Machine check architecture)
                CMOV (Conditional move instruction supported)
                PAT (Page attribute table)
                PSE-36 (36-bit page size extension)
                CLFSH (CLFLUSH instruction supported)
                DS (Debug store)
                ACPI (ACPI supported)
                MMX (MMX technology supported)
                FXSR (Fast floating-point save and restore)
                SSE (Streaming SIMD extensions)
                SSE2 (Streaming SIMD extensions 2)
                SS (Self-snoop)
                HTT (Hyper-threading technology)
                TM (Thermal monitor supported)
                PBE (Pending break enabled)
        Version: Genuine Intel(R) CPU @ 2.30GHz
        Voltage: 1.0 V
        External Clock: 100 MHz
        Max Speed: 4000 MHz
        Current Speed: 2300 MHz
        Status: Populated, Enabled
        Upgrade: Other
        L1 Cache Handle: 0x0005
        L2 Cache Handle: 0x0006
        L3 Cache Handle: 0x0007
        Serial Number: To Be Filled By O.E.M.
        Asset Tag: To Be Filled By O.E.M.
        Part Number: To Be Filled By O.E.M.
        Core Count: 8
        Core Enabled: 8
        Thread Count: 16
        Characteristics:
                64-bit capable

Handle 0x0008, DMI type 4, 42 bytes
Processor Information
        Socket Designation: CPU 2
        Type: Central Processor
        Family: Xeon
        Manufacturer: Intel
        ID: D2 06 02 00 FF FB EB BF
        Signature: Type 0, Family 6, Model 45, Stepping 2
        Flags:
                FPU (Floating-point unit on-chip)
                VME (Virtual mode extension)
                DE (Debugging extension)
                PSE (Page size extension)
                TSC (Time stamp counter)
                MSR (Model specific registers)
                PAE (Physical address extension)
                MCE (Machine check exception)
                CX8 (CMPXCHG8 instruction supported)
                APIC (On-chip APIC hardware supported)
                SEP (Fast system call)
                MTRR (Memory type range registers)
                PGE (Page global enable)
                MCA (Machine check architecture)
                CMOV (Conditional move instruction supported)
                PAT (Page attribute table)
                PSE-36 (36-bit page size extension)
                CLFSH (CLFLUSH instruction supported)
                DS (Debug store)
                ACPI (ACPI supported)
                MMX (MMX technology supported)
                FXSR (Fast floating-point save and restore)
                SSE (Streaming SIMD extensions)
                SSE2 (Streaming SIMD extensions 2)
                SS (Self-snoop)
                HTT (Hyper-threading technology)
                TM (Thermal monitor supported)
                PBE (Pending break enabled)
        Version: Genuine Intel(R) CPU @ 2.30GHz
        Voltage: 1.0 V
        External Clock: 100 MHz
        Max Speed: 4000 MHz
        Current Speed: 2300 MHz
        Status: Populated, Enabled
        Upgrade: Other
        L1 Cache Handle: 0x0009
        L2 Cache Handle: 0x000A
        L3 Cache Handle: 0x000B
        Serial Number: To Be Filled By O.E.M.
        Asset Tag: To Be Filled By O.E.M.
        Part Number: To Be Filled By O.E.M.
        Core Count: 8
        Core Enabled: 8
        Thread Count: 16
        Characteristics:
                64-bit capable

It shows 2-socket cpu, 8-core per socket, and hyper threading enabled(core count:8, thread count: 16)

cat /proc/cpuinfo |egrep 'processor|physical id|core id'
processor       : 0
physical id     : 0
core id         : 0
processor       : 1
physical id     : 0
core id         : 1
processor       : 2
physical id     : 0
core id         : 2
processor       : 3
physical id     : 0
core id         : 3
processor       : 4
physical id     : 0
core id         : 4
processor       : 5
physical id     : 0
core id         : 5
processor       : 6
physical id     : 0
core id         : 6
processor       : 7
physical id     : 0
core id         : 7
processor       : 8
physical id     : 1
core id         : 0
processor       : 9
physical id     : 1
core id         : 1
processor       : 10
physical id     : 1
core id         : 2
processor       : 11
physical id     : 1
core id         : 3
processor       : 12
physical id     : 1
core id         : 4
processor       : 13
physical id     : 1
core id         : 5
processor       : 14
physical id     : 1
core id         : 6
processor       : 15
physical id     : 1
core id         : 7
processor       : 16
physical id     : 0
core id         : 0
processor       : 17
physical id     : 0
core id         : 1
processor       : 18
physical id     : 0
core id         : 2
processor       : 19
physical id     : 0
core id         : 3
processor       : 20
physical id     : 0
core id         : 4
processor       : 21
physical id     : 0
core id         : 5
processor       : 22
physical id     : 0
core id         : 6
processor       : 23
physical id     : 0
core id         : 7
processor       : 24
physical id     : 1
core id         : 0
processor       : 25
physical id     : 1
core id         : 1
processor       : 26
physical id     : 1
core id         : 2
processor       : 27
physical id     : 1
core id         : 3
processor       : 28
physical id     : 1
core id         : 4
processor       : 29
physical id     : 1
core id         : 5
processor       : 30
physical id     : 1
core id         : 6
processor       : 31
physical id     : 1
core id         : 7

This shows there are 32 logic processors.
core id: the processor id
physical id: the socket id
core id: the core id

To put in a picture, the processor mapping is:

Another way to get this information is:
cd /sys/devices/system/cpu

cat cpu0/topology/thread_siblings_list

Friday, December 23, 2011

Network Interrupts handling and monitor

Problem: We had a test case that shows network packets being dropped. The packets are dropped on TCP layer, not IP layer. And the behavior changes if we reboot the box. We have a westmere 2-socket 24-core(with hyper-thread enabled).

Conclusion: When the network interrupts handling happens to land on the cores that handles softirq, that core(s) are stressed. It can not handle the work and drops packets. Since we have irqbalance on, the smp_affinity can change from boot to boot. So network interrupts handling might/not land on the same core.

Solutions:
1. check rps_cpu to see which cpu handles softirq
cat /sys/class/net/<NIC>/queues/rx-0/rps_cpus
00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00111111
The value is Hexadecimal.Converting to binary it is: 100010001000100010001. This makes softirq goes to cpu 0,4,8,12,16,20.
2.Monitor which CPU the interrupts goes. As we are using infiniband, we grep melonox driver.
cat /proc/interrupts
150: 3 69695238 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth-mlx4-19:00.0-1
173: 3822 413 0 0 0 1576152 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge mlx4-19:00.0-(async)
173: mlx4_core Handles low-level functions, hardware interrupts
153: handles QP interrupt and runs the soft-interrupt threads.

We noticed smp_affinity for 150 changes after reboot.

3. To make it fix, we used /etc/sysconfig/irqbalance.conf
IRQ_AFFINITY_MASK=FFFFD1
This mask defines which cpu to skip(1 means skip) . Converting the Hexadecimal to binary, it is
111111111111111111010001. It pins interrupt handling on socket0, but skips cpu 0,4.

Deep dive about interrupt, irq, softirq, rps_cpu
What happens when NIC receives PDU(Protocol Data Unit)?
reference: http://perso.hexabyte.tn/ichihi/projects/hpc_optimisations.pdf
When NIC gets PDU, NIC copies PDU into kernel buffers using DMA(Direct Memory Access). NIC notifies kernel the arrival of PDU by raising a hard interrupt. Device driver (part of the kernel) handles the Hard interrupt. The hard interrupt handlers perform minimal work and schedule the rest to be handled asynchronously by a softirq. Hard interrupt handlers can not be preempted. Softirqs are processed as regular kernel code by special kernel threads. Kernel will drop packets if it cannot pick them from the NIC
quickly enough

IRQBalance
Reference: http://irqbalance.org/documentation.html
IRQBalance distributes interrupts among the cpus. It might be in performance mode or power save mode depending on the load.
"As a first step, the interrupts are assigned to packages. In power save mode, all interrupts are assigned to the first package (to allow the other packages to stay in their respective sleep states longer), while in performance mode, irqbalance tries to spread the interrupts over the available packages as much as possible such that the cumulative amount of work that each package will do is made equal as much as possible. In addition, in performance mode, irqbalance will always distribute interrupts of the same class to different packages as long as there are not more interrupts in the class than there are packages. This is done to prevent one networking (or storage) interrupt from interfering with another interrupt in the same class via the sharing of resources that goes with being assigned to the same package. To make things even more complex, irqbalance also takes into account which package the interrupt was already assigned to previously and will try to keep the interrupt assigned to the same package (since there might be valuable data in the cache there already for the device). It also takes into account which packages are directly connected to the hardware in a NUMA system (since using a core with such a direct connection is going to be faster than using a core which has an indirect connection)."

"For the Networking interrupt class, it is essential that the interrupt goes to one and one core only. The implementation of the Linux TCP/IP stack will then use this property to get some major efficiencies in its operation. In addition, if an interrupt source of another class is very high rate, irqbalance will also assign this to a specific core in order to maximize the efficiency of the level 1 cache of this core. This assigning to the cores uses the same algorithm as the assignment to cache-domains."

In my case, hw/fw interrupts are handled by 173 smp_affinity cpu, and generate software interrupts. irqbalancer check the node (NUMA architecture, cpu, scheduler ) and pick a cpu to send the sw interrupts(150 smp_affinity). The cpu pulls interrupts from hw, classify them. Need to send the to receiving processor. RPS(receiving package steering) comes in here. It generate hash value and spread the sw interrupts evenly to the rps_cpu.

RPS(Receive packet Steering)
Reference:http://lwn.net/Articles/362339/
RPS distributes the incoming data across the cpus.
"Some network interfaces can help with the distribution of incoming packets; they have multiple receive queues and multiple interrupt lines. Others, though, are equipped with a single queue, meaning that the driver for that hardware must deal with all incoming packets in a single, serialized stream. Parallelizing such a stream requires some intelligence on the part of the host operating system."

"Tom's patch provides that intelligence by hooking into the receive path - netif_rx() and netif_receive_skb() - right when the driver passes a packet into the networking subsystem. At that point, it creates a hash from the relevant protocol data (IP addresses and port numbers, in particular) and uses it to pick a CPU; the packet is then enqueued for the target CPU's attention. By default, any CPU on the system is fair game for network processing, but the list of target CPUs for any given interface can be configured explicitly by the administrator if need be."

/proc/interrupts still shows only CPU defined by /proc/<irq>/smp_affinity is used for NIC interrupt handling, even with RPS enabled. If you want to find out whether RPS is working, you have to look at /proc/softirqs instead (eg. with watch -n1 cat /proc/softirqs)

Tools:
to see all the interrupts: sar -I SUM -P ALL
to see pkg by interface: netstat -i
to see which thread is assigned to which cpu: ps -p <pid> -L -o pid,tid,psr,pcpu
to see which thread this process can set affinity to: taskset [-c] -p <pid>
to see cpu utilization per core: mpstat -P ALL 1