Friday, November 13, 2015

JVM Parameters and Young Generation GC Algorithms

Dear Friend,
After lot of R&D especially for myself, I am penning down the JVM Parameters and usage that we find in our day to day activities
and in especially in production during application tuning...

List of JVM Parameters:

1. -Xms256m  (Initial heap memory size)
2. -Xmx512m  (Maximum heap memory size) 
3. -Xmn64m  (Size of Young Generation)
4. -XX:PermSize=40m  (Initial Permanent Generation size. From Java SE 8, PermGen space is removed.) 
5. -XX:MaxPermSize=40m (Maximum Permanent Generation size. From Java SE 8, PermGen space is removed.) 
6. -XX:+UseParallelGC  (+ means boolean value is true, flag will work in JVM parameter)
7. -XX:+UseParallelOldGC (+ means boolean value is true, flag will work in JVM parameter)
8. -XX:+UseConcMarkSweepGC 
9. -XX:+PrintCompilation
10. -verbose:gc or -XX:+PrintGC (both are alias)
11. -XX:+PrintGCDetails 
12. -XX:+PrintGCTimeStamps 
13. -XX:+PrintGCDateStamps 
14. -Xloggc:<file>  OR -Xloggc:verbose_gc.log 
15. -XX:+HeapDumpOnOutOfMemoryError
16. -Xmaxjitcodesize100m 
17. -XX:CompileThreshold=1500
18. -XX:+PrintHeapAtGC
19. -XX:+PrintTenuringDistribution 
20. -XX:+PrintClassHistogram
21. -XX:+PrintConcurrentLocks
22. -XX:+PrintCommandLineFlags
23. -XX:+UseParNewGC 
24. -XX:ParallelGCThreads=2 
25. -Xnoclassgc 
26. -XX:+DisableExplicitGC  
27. -XX:+UseTLAB 
28. -XX:+UseBiasedLocking 
29. -XX:+UseStringCache
30. -XX:+StringCache: 
31. -XX:+OptimizeStringConcat
32. -XX:+UseCompressedStrings 
33. -XX:SurvivorRatio=128
34. -XX:NewSize 

JVM Architecture and Memory Configurations:

1. -Xms256m: -Xms is the initial size of heap memory. 2. -Xmx512m: -Xmx is maximum size of heap memory. Generally we set -Xms and -Xmx as same to prevent pauses caused by heap expansion. 3. -Xmn64m: -Xmn is the Young generation heap size. 4. -XX:PermSize=40m (Initial Permanent Generation size. From Java SE 8, PermGen space is removed.) 5. -XX:MaxPermSize=40m: Sets the maximum "permanent generation" size. Hotspot is unusual in that several types of data get stored in the "permanent generation", a separate area of the heap that is only rarely (or never) garbage-collected. The list of perm-gen hosted data is a little fuzzy, but it generally contains things like class metadata, bytecode, interned strings, and so on (and this certainly varies across Hotspot versions). Because this generation is rarely or never collected, you may need to increase its size. 6. -XX:+UseParallelGC: Turns on the parallel young-generation garbage collector. This is a stop-the-world collector that uses several threads to reduce pause times. Apply -XX:+UseParallelGC when you require parallel collection method over YOUNG generation ONLY, (but still) use serial-mark-sweep method as OLD generation collection. 7. -XX:+UseParallelOldGC: This says JVM to use a parallel collector for the old generation, it is generally only useful if you often have large numbers of old objects getting collected. Use parallel garbage collection for the full collections. Enabling this option automatically sets -XX:+UseParallelGC. UseParNewGC usually knowns as "parallel young generation collector" is same in all ways as the parallel garbage collector (-XX:+UseParallelGC), except that its more sophiscated and effiecient. 8. -XX:+UseConcMarkSweepGC: Turns on the concurrent mark-sweep collector. This one runs most GC operations in parallel to your application's execution, reducing pauses significantly. It still stops the world for its compact phase, but that's usually quicker than pausing for the whole set of GC operations. This is useful if you need to reduce the impact GC has on an application run and don't mind that it's a little slower than the full stop-the-world versions. Also, you obviously would need multiple processors to see full effect. If -XX:+UseConcMarkSweepGC is used on the command line then the flag UseParNewGC is also set to true if it is not otherwise explicitly set on the command line So the answer is you only need to use -XX:+UseConcMarkSweepGC and it will enable the concurrent collector with the parallel young generation collector. 9. -XX:+PrintCompilation: Prints out the name of each Java method Hotspot decides to JIT compile. The list will usually show a bunch of core Java class methods initially, and then turn to methods in your application. 10. -verbose:gc or -XX:+PrintGC (both are alias) The flag -verbose:gc or alias “-XX:+PrintGC” activates the “simple” GC logging mode, which prints a line for every young generation GC and every full GC. Here is an example: [GC 246656K->243120K(376320K), 0,0929090 secs] [Full GC 243120K->241951K(629760K), 1,5589690 secs] A line begins with the GC type, either “GC” or “Full GC”. Then follows the occupied heap memory before and after the GC, respectively (separated by an arrow), and the current capacity of the heap (in parentheses). The line concludes with the duration of the GC (real time in seconds). Thus, in the first line, 246656K->243120K(376320K) means that the GC reduced the occupied heap memory from 246656K to 243120K. The heap capacity at the time of GC was 376320K, and the GC took 0.0929090 seconds. The simple GC logging format is independent of the GC algorithm used and thus does not provide any more details. In the above example, we cannot even tell from the log if the GC moved any objects from the young to the old generation. For that reason, detailed GC logging is more useful than the simple one. 11. -XX:+PrintGCDetails Includes the data from -verbose:gc but also adds information about the size of the new generation and more accurate timings. If we use -XX:+PrintGCDetails instead of -XX:+PrintGC, we activate the “detailed” GC logging mode which differs depending on the GC algorithm used. We start by taking a look at the output produced by a young generation GC using the Throughput Collector. For better readability, I split the output in several lines and indented some of them. In the actual log, this is just a single line and less readable for humans. [GC [PSYoungGen: 142816K->10752K(142848K)] 246648K->243136K(375296K), 0,0935090 secs ] [Times: user=0,55 sys=0,10, real=0,09 secs] We can recognize a couple of elements from the simple GC log: We have a young generation GC (red) which reduced the occupied heap memory from 246648K to 243136K (blue) and took 0.0935090 seconds. In addition to that, we obtain information about the young generation itself: the collector used (orange) as well as its capacity and occupancy (green). In our example, the “PSYoungGen” collector was able to reduce the occupied young generation heap memory from 142816K to 10752K. Since we know the young generation capacity, we can easily tell that the GC was triggered because otherwise the young generation would not have been able to accommodate another object allocation: 142816K of the available 142848K were already used. Also we can conclude that most of the objects removed from the young generation are still alive and must have been moved to the old generation: Comparing the green and blue output shows that even though the young generation was almost completely emptied, the total heap occupancy remained roughly the same. The “Times” section of the detailed log contains information about the CPU time used by the GC, separated into user space (“user”) and kernel space (“sys”) of the operating system. Also, it shows the real time (“real”) that passed while the GC was running (which, however, with 0.09 is just a rounded value of the 0.0935090 seconds also shown in the log). If, like in our example, the CPU time is considerably higher than the real time passed, we can conclude that the GC was run using multiple threads. In that case, the CPU time logged is the sum of the CPU times of all GC threads. And indeed, I can reveal that the collector used 8 threads (0.55 divide by 0.09) in our example. No of threads will be in multiple of 2 (like 1, 2, 4, 8 etc). Now consider the output of a full GC. [Full GC [PSYoungGen: 10752K->9707K(142848K)] [ParOldGen: 232384K->232244K(485888K)] 243136K->241951K(628736K) [PSPermGen: 3162K->3161K(21504K)], 1,5265450 secs ] [Times: user=10,96 sys=0,06, real=1,53 secs] In addition to details about the young generation, the log also provides us with details about the old and permanent generations. For all three generations, we can see the collector used, the occupancy before and after GC, and the capacity at the time of GC. Note that each number shown for the total heap (blue) is equal to the sum of the respective numbers of the young and old generations. In our example, 241951K of the total heap are occupied, 9707K of which are in the young generation and 232244K of which belong to the old generation. The full GC took 1.53 seconds, and the CPU time of 10.96 seconds in user space shows that the GC used multiple threads (10.96 divide by 1.53, makes as above 8 threads). The detailed output for the different generations enables us to reason about the GC cause. If, for any generation, the log states that its occupancy before GC was almost equal to its current capacity, it is likely that this generation triggered the GC. However, in the above example, this does not hold for any of the three generations, so what caused GC in this case? A full GC may also happen when it is explicitly requested, either by the application or via one of the external JVM interfaces. Such a “system GC” can be identified easily in the GC log because in that case the line starts with “Full GC (System)” instead of “Full GC”. NOTE: For the Serial Collector, the detailed GC log is very similar to that of the Throughput (Parallel) Collector. The only real difference is that the various sections have different names because other GC algorithms are being used (for example, the old generation section is called “Tenured” instead of “ParOldGen”). It is good that the exact names of the collectors are used because it enables us to conclude just from the log some of the garbage collection settings used by the JVM. NOTE: For the CMS Collector, the detailed log for young generation GCs is very similar to that of the Throughput Collector as well, but the same cannot be said for old generation GCs. With the CMS Collector, old generation GCs are run concurrently to the application using different phases. More detailed Verbose_gc.log Sample: {Heap before GC invocations=0 (full 0): par new generation total 19136K, used 17024K [MEMORY ADDRESS) eden space 17024K, 100% used [MEMORY ADDRESS) from space 2112K, 0% used [MEMORY ADDRESS) to space 2112K, 0% used [MEMORY ADDRESS) concurrent mark-sweep generation total 2075904K, used 0K [MEMORY ADDRESS) concurrent-mark-sweep perm gen total 21248K, used 8351K [MEMORY ADDRESS) 2014-03-06T16:05:25.813+0530: 0.459: [GC 0.460: [ParNew Desired survivor size 1081344 bytes, new threshold 4 (max 4) - age 1: 802848 bytes, 802848 total: 17024K->798K(19136K), 0.0036420 secs] 17024K->798K(2095040K), 0.0037060 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] Heap after GC invocations=1 (full 0): par new generation total 19136K, used 798K [MEMORY ADDRESS) eden space 17024K, 0% used [MEMORY ADDRESS) from space 2112K, 37% used [MEMORY ADDRESS) to space 2112K, 0% used [MEMORY ADDRESS) concurrent mark-sweep generation total 2075904K, used 0K [MEMORY ADDRESS) concurrent-mark-sweep perm gen total 21248K, used 8351K [MEMORY ADDRESS) } =========================================== {Heap before GC invocations=123 (full 0): par new generation total 19136K, used 15516K [MEMORY ADDRESS) eden space 17024K, 78% used [MEMORY ADDRESS) from space 2112K, 100% used [MEMORY ADDRESS) to space 2112K, 0% used [MEMORY ADDRESS) concurrent mark-sweep generation total 2075904K, used 161513K [MEMORY ADDRESS) concurrent-mark-sweep perm gen total 21248K, used 21245K [MEMORY ADDRESS) 2014-03-06T16:13:56.610+0530: 511.256: [Full GC 511.256: [CMS: 161513K->79481K(2075904K), 0.3508450 secs] 177030K->79481K(2095040K), [CMS Perm : 21245K->21213K(21248K)], 0.3510700 secs] [Times: user=0.34 sys=0.01, real=0.35 secs] Heap after GC invocations=124 (full 1): par new generation total 76672K, used 0K [MEMORY ADDRESS) eden space 68160K, 0% used [MEMORY ADDRESS) from space 8512K, 0% used [MEMORY ADDRESS) to space 8512K, 0% used [MEMORY ADDRESS) concurrent mark-sweep generation total 2075904K, used 79481K [MEMORY ADDRESS) concurrent-mark-sweep perm gen total 35356K, used 21213K [MEMORY ADDRESS) } =========================================== 12. -XX:+PrintGCTimeStamps and 13. -XX:+PrintGCDateStamps It is possible to add time and date information to the (simple or detailed) GC log. With -XX:+PrintGCTimeStamps a timestamp reflecting the real time passed in seconds since JVM start is added to every line. An example: 0,185: [GC 66048K->53077K(251392K), 0,0977580 secs] 0,323: [GC 119125K->114661K(317440K), 0,1448850 secs] 0,603: [GC 246757K->243133K(375296K), 0,2860800 secs] And if we specify -XX:+PrintGCDateStamps each line starts with the absolute date and time when it was written: 2014-01-03T12:08:38.102-0100: [GC 66048K->53077K(251392K), 0,0959470 secs] 2014-01-03T12:08:38.239-0100: [GC 119125K->114661K(317440K), 0,1421720 secs] 2014-01-03T12:08:38.513-0100: [GC 246757K->243133K(375296K), 0,2761000 secs] It is possible to combine the two flags if both outputs are desired. I would recommend to always specify both flags because the information is highly useful in order to correlate GC log data with data from other sources. 14. -Xloggc:<file> OR -Xloggc:verbose_gc.log By default the GC log is written to stdout. With -Xloggc:<file> we may instead specify an output file. Note that this flag implicitly sets -XX:+PrintGC and -XX:+PrintGCTimeStamps as well. Still, I would recommend to set these flags explicitly if desired, in order to safeguard yourself against unexpected changes in new JVM versions. This option overrides -verbose:gc if both are given on the command line. If you use -Xloggc, you don't need -verbose:gc. 15. -XX:+HeapDumpOnOutOfMemoryError: Useful if you have a slow-leaking application you can't pin down. It will dump heap information to disk whenever there's an OutOfMemoryError, allowing you to do offline analysis. 16. -Xmaxjitcodesize100m: Maximum compiled code size. -Xmaxjitcodesize (used to be -Xmaxjitcodesize=32m, now -Xmaxjitcodesize32m). 17. -XX:CompileThreshold=1500 Number of method invocations/branches before (re-)compiling [10,000 -server, 1,500 -client]. 18. -XX:+PrintHeapAtGC The output from the flag -XX:+PrintHeapAtGC is intermixed with the output from -XX:+PrintGCDetails and -XX:+PrintGC (the latter is equivalent to -verbose:gc) with the "framework" collectors (serial gc, parnew gc, cms). This makes it hard to read and Analyze. 19. -XX:+PrintTenuringDistribution (Plus means boolean value is ON, so in gc log, objects ages will be printed). However if we change this to as like MINUS "-XX:-PrintTenuringDistribution", boolean value is off, so it will not be counted in JVM parameters. This was added to the Parallel Scavenge collector (-XX:+UseParallelGC) in jvm 6.0(mustang) build 55. It Print tenuring age information. I am using "1.6.0_26" and I confirm that I have the same behavior. To confirm the previous statement, I tried the serial garbage collector (-XX:+UseSerialGC) and tenuring age information is displayed correctly. 2014-02-14T16:04:03.285-0500: 686.879: [GC 686.880: [DefNew Desired survivor size 13402112 bytes, new threshold 15 (max 15) - age 1: 1288856 bytes, 1288856 total - age 2: 320312 bytes, 1609168 total - age 3: 9816 bytes, 1618984 total - age 4: 33352 bytes, 1652336 total - age 5: 6256 bytes, 1658592 total - age 6: 34464 bytes, 1693056 total - age 7: 9128 bytes, 1702184 total - age 8: 100192 bytes, 1802376 total - age 9: 9024 bytes, 1811400 total - age 10: 55632 bytes, 1867032 total - age 11: 14616 bytes, 1881648 total - age 12: 302304 bytes, 2183952 total - age 13: 682192 bytes, 2866144 total - age 14: 11928 bytes, 2878072 total - age 15: 13928 bytes, 2892000 total 20. -XX:+PrintClassHistogram Print a histogram of class instances on Ctrl-Break. The jmap -histo command provides equivalent functionality. 21. -XX:+PrintConcurrentLocks Print java.util.concurrent locks in Ctrl-Break thread dump. The jstack -l command provides equivalent functionality. By default, with Hotspot, a CTRL-Break thread dump will not list what threads are holding java.lang.concurrent locks. And I understand that with these locks, Hotspot cannot have information about at which stack frame a lock was acquired. If you add the JVM option -XX:+PrintConcurrentLocks, then a CTRL-Break stack dump will list (after a thread's stack trace) any concurrent locks held by that frame. For example: "D-Java-5-Lock" prio=6 tid=0x00000000069a1800 nid=0x196c runnable [0x000000000770f000] java.lang.Thread.State: RUNNABLE at com.Tester.longDelay( at com.Tester$ Locked ownable synchronizers: - <0x00000007d6030898> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) Without this option, it isn't possible to figure out what thread is holding this lock in a post-mortem. 22. -XX:+PrintCommandLineFlags Print flags that appeared on the command line. -XX:+PrintCommandLineFlags a flag that should always be set on JVM startup. This flag tells the JVM to print the names and values of exactly those XX flags that have been set by the user or JVM ergonomics on startup. 23. -XX:+UseParNewGC Uses a parallel version of the young generation copying collector alongside the default collector. This minimizes pauses by using all available CPUs in parallel. The collector is compatible with both the default collector and the Concurrent Mark and Sweep (CMS) collector. 24. -XX:ParallelGCThreads=n (n can be 2, 4, 8 and more) Sets the number of garbage collection threads in the young and old parallel garbage collectors. The default value varies with the platform on which the JVM is running. Number of parallel threads parallel gc will use. So if you have 4 cores and you're running 2 JVMs, configure 2 parallel threads in each. This value should be set based on number of processors. If a server has 4 processors, then the value of -XX:ParallelGCThreads= be set to 4. As gc threads are tight up to the processors. It will print like below. "GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000048ce6800 nid=0x628f runnable "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000048ce8000 nid=0x6290 runnable "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000048cea000 nid=0x6291 runnable "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000048cec000 nid=0x6292 runnable 25. -Xnoclassgc This option switches off garbage collection of storage associated with Java™ technology classes that are no longer being used by the JVM. The default behavior is as defined by -Xclassgc. Enabling this option is not recommended. 26. -XX:+DisableExplicitGC By default calls to System.gc() are enabled (-XX:-DisableExplicitGC). Use -XX:+DisableExplicitGC to disable calls to System.gc(). Note that the JVM still performs garbage collection when necessary. Explicit garbage collection is a really bad idea— something on the order of locking yourself in a phone booth with a rabid pit bull. Although the exact semantics of the call are implementation-dependent, assuming your JVM is running a generational garbage collector (which most of them are), System.gc(); forces the VM to do a "full sweep" of the heap, even if one isn't necessary. Full sweeps are typically several orders of magnitude more expensive than a regular GC operation, which is just plain bad math. But don't take my word for it — Sun's engineers provided us with a JVM flag for just this particular human-error problem: The -XX:+DisableExplicitGC flag automatically turns a System.gc() call into a no-op, giving you the opportunity to run your code and see for yourself whether System.gc() has helped or harmed the overall execution of the JVM. The same function that is provided on the HotSpot technology based JVMs can be achieved using the -Xdisableexplicitgc output on the IBM JVMs. 27. -XX:+UseTLAB Use thread-local object allocation (Introduced in 1.4.0, known as UseTLE prior to that.). UseTLAB is on by default on Sun/Oracle JVMs. Thread-local allocation buffer. Used to allocate heap space quickly without synchronization. Compiled code has a "fast path" of a few instructions which tries to bump a high-water mark in the current thread's TLAB, successfully allocating an object if the bumped mark falls before a TLAB-specific limit. When you do a new Object() in Java, jvm use a lockless algorithm to allocate memory. The JVM I am referring to in this case is the Hotspot VM. This improves concurrency by reducing contention on the shared heap lock. 28. -XX:+UseBiasedLocking Enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact. Essentially, if your objects are locked only by one thread, the VM can make an optimization and "bias" that object to that thread in such a way that subsequent atomic operations on the object incurs no synchronization cost. I suppose this is typically geared towards overly conservative code that performs locks on objects without ever exposing them to another thread. The actual synchronization overhead will only kick in once another thread tries to obtain a lock on the object. 29. -XX:+UseStringCache or 30. -XX:+StringCache: Enables caching of commonly allocated strings. 31. -XX:+OptimizeStringConcat Optimize String concatenation operations where possible. (Introduced in Java 6 Update 20) 32. -XX:+UseCompressedStrings Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release). 33. -XX:SurvivorRatio=128: Specifies a high survivor ratio, which goes along with the zero tenuring threshold to ensure that little space is reserved for absent survivors. 34. -XX:NewSize Defines the minimum young generation size. BEA recommends testing your production applications starting with a young generation size of 1/3 the total heap size. Using a larger young generation size causes fewer minor collections to occur but may compromise response time goals by cause longer-running full collections. ============================================================ Young generation garbage collection algorithms : The (original) copying collector (Enabled by default): When this collector kicks in, all application threads are stopped, and the copying collection proceeds using one thread (which means only one CPU even if on a multi-CPU machine). This is known as a stop-the-world collection, because basically the JVM pauses everything else until the collection is completed. The parallel copying collector (Enabled using -XX:+UseParNewGC): Like the original copying collector, this is a stop-the-world collector. However this collector parallelizes the copying collection over multiple threads, which is more efficient than the original single-thread copying collector for multi-CPU machines (though not for single-CPU machines). This algorithm potentially speeds up young generation collection by a factor equal to the number of CPUs available, when compared to the original singly-threaded copying collector. The parallel scavenge collector (Enabled using -XX:UseParallelGC): This is like the previous parallel copying collector, but the algorithm is tuned for gigabyte heaps (over 10GB) on multi-CPU machines. This collection algorithm is designed to maximize throughput while minimizing pauses. It has an optional adaptive tuning policy which will automatically resize heap spaces. If you use this collector, you can only use the the original mark-sweep collector in the old generation (i.e. the newer old generation concurrent collector cannot work with this young generation collector). Use parallel threads in the new generation. XX:+UseParallelGC Use parallel garbage collection for scavenges. ============================================================

No comments:

Post a Comment