Alexandra's Tech: jvm

Showing posts with label jvm. Show all posts

Saturday, 26 November 2016

Memory alignment: Oops and viewing mem structure with Jol

Why are 32-bit machines limited to 4GB memory?

On a 32-bit machine there are only 2^32 (~= 4 billion) distinct memory addresses.

Each memory address can hold 8 bits (= 1 byte).

That means that in total we can reference ~ 4 billion bytes which is approximately equal to 4 GB.

In practice, all of this memory won't be available to the JVM. Some of it will be used by the OS and any other processes running on the machine. Even less will be available to the heap because the JVM needs to store lots of other things including thread stacks, GC info, native memory, compiled code etc.

So 64-bit machines are the solution?

64 bit machines can reference 2^64 memory addresses which is more than 16 million terabytes of data. Therefore, it should be fine for a Java heap right?

The problem with this addressing is that you end up with huge pointers to objects that you have to store in the heap! The addresses suddenly take twice the amount of memory, which isn't terribly efficient.

What if you don't actually want a multi-terabyte heap? Is there some kind of middle ground?

The middle ground: Compressed Oops

Imagine we have 35-bit addressing. That would mean that we could store up to 2^35 (~= 32GB) memory addresses in heap.

Wouldn't that be great?! Obviously it isn't going to be easy because there aren't any 35-bit machines with 35-bit registers in which to hold memory addresses.

However, the JVM can do something very clever, where it assumes that is has a 35 bit number but the last 3 bits are all zero. When it writes to the memory address it adds the three zeros to lookup the address correctly but when it's just holding the memory address, it only stores 32 bits.

However this means that the JVM can only access every 8th memory address:

000 (0), 1_000 (8), 10_000 (16), 11_000 (24), 100_000 (32), 101_000 (40) etc

This means that the JVM needs to allocate memory in 8-byte chunks (because each memory address holds 1 byte and we are dealing with the memory addresses in chunks of 8).

However this is how the JVM works anyway, so it's all fine and nothing is lost. Convenient, eh?

There is a little bit of fragmentation. If you allocate an object that is 7 bytes then you have 1 empty byte but the impact isn't too big. Though this is probably why we don't try and use 2^36 addresses. That would mean being 16-byte aligned which would likely have much worse fragmentation and wasted memory gaps.

When does the JVM use Compressed Oops?

From Java 7 onwards, by default if the heap size is >4GB but <32GB then the flag, +XX:UseCompressedOops is switched on.

Jol

To see how objects are aligned in memory, Jol is a cute little tool.

Example 1:

import org.openjdk.jol.info.ClassLayout;import org.openjdk.jol.vm.VM;import static java.lang.System.out;
public class Jol {
    public static void main(String[] args) throws Exception {
        out.println(VM.current().details());        out.println(ClassLayout.parseClass(A.class).toPrintable());    }
    public static class A {
        boolean f;    }
}

Gives the following output:

# Objects are 8 bytes aligned.
# Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
# Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]

com.ojha.Jol$A object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 1 boolean A.f N/A
13 3 (loss due to the next object alignment)
Instance size: 16 bytes
Space losses: 0 bytes internal + 3 bytes external = 3 bytes total

We can see that the object header took 12 bytes, the boolean took 1 byte and 3 bytes were wasted

Example 2 (Same as above but add in a little integer)

import org.openjdk.jol.info.ClassLayout;import org.openjdk.jol.vm.VM;import static java.lang.System.out;
public class Jol {
    public static void main(String[] args) throws Exception {
        out.println(VM.current().details());        out.println(ClassLayout.parseClass(A.class).toPrintable());    }
    public static class A {
        boolean f;        int j;    }
}

Output:

# Objects are 8 bytes aligned.
# Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
# Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]

com.ojha.Jol$A object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 int A.j N/A
16 1 boolean A.f N/A
17 7 (loss due to the next object alignment)
Instance size: 24 bytes
Space losses: 0 bytes internal + 7 bytes external = 7 bytes total

Sunday, 26 June 2016

JIT Fun Part 4: Intrinsics and Inlining

Definition: Intrinsic

Intrinsic methods in Java are ones that have a native implementation by default in the JDK. They will have a Java version but most of the time it will get inlined and and the intrinsic implementation will be called instead.

Note that what methods are made intrinsic may depend on the platform.

Examples of Intrinsic methods include Math functions (sin, cos, min, max). The full list is here.

Example

Code:

We run this method with the following jvm flags:

-XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining

In the output we see the following:

@ 97 java.lang.Math::max (11 bytes) (intrinsic)

Woohoo! It got inlined with the intrinsic code.

JIT Fun Part 3: -XX:PrintCompilation

Ok enough guessing about these graphs. Let's look at what's really happening.

But first...

Compiler levels

Level	Compiler
0	Interpreter
1	C1 and destined to stay in C1 forever
2	C1 but only pays attention to loop/method counters
3	C1 but gathers details for C2 eg counters, percentage of time conditional evaluates to true
4	C2

Paths through these levels:

0 -> 3 -> 4

This is the most common case. Method initially sent to C1 level 3 after a fair few calls, then if it is called a lot or contains a loop with lots of iterations it gets promoted to the super fast level 4 (= C2).

0 -> 3 -> 1

This happens if the method is really tiny. It gets sent to level 3 where it is analysed and we realise that it will never go to C2 so we stick it into C1 level 1 forever.

0 -> 2 -> 3 -> 4

If C2 is busy at the time of promotion from level 0 then we know we won't get promoted for a while so we go hang out in level 2 rather than hopping off to 3 directly. Then we may get promoted to 4 if we deserve it.

Example

I've got the following code:

And the times (endTime - startTime) look like this:

I ran the code with -XX:PrintCompilation and I'm using Java 8 so tiered complation is enabled by default.

Let's see what is happening to the rawr method:

90
164 171 % 3 com.ojha.Rawr::main91 @ 50 (131 bytes)

What does this mean?

i (ith iteration of my for loop which is on the x axis) = 90
timestamp since vm start = 164
this method was 171st in the queue to be compiled
% indicates on stack replacement (method contains a loop)
3: This is the most important bit! This means we are moving into C1, ie doing the first compilation.
50 is the bytecode index of the loop.

More output

159
167 176 3 com.ojha.Rawr::main (131 bytes)

231
169 172 3 java.io.PrintStream::ensureOpen (18 bytes)
169 182 % 4 com.ojha.Rawr::main @ 50 (131 bytes)

At i = 231, the loop has been called enough times for the Rawr main method to be compiled at C2 (that's what the 4 means). Note that at the same time java.io.PrintStream.ensureOpen has been called enough times to be compiled at C1 level 3.

1818
3 com.ojha.Rawr::main @ -2 (131 bytes) made not entrant

This is basically saying that the level 3 version of the rawr method shouldn't be used any more (correct because we should now use the level 4 version). We can see this dip in the graph at x = 1818.

1999
193 234 ! 3 java.io.PrintStream::println (24 bytes)
193 182 % 4 com.ojha.Rawr::main @ -2 (131 bytes) made not entrant

At the very end of the method Rawr main is over so the level 4 version of the compiled code is made not entrant (ie nothing should 'enter' or call that compiled code).

Note that the ! indicated that the method contains a try catch loop.

JIT Fun Part 2: Throwing the JIT a curve ball

This post follows on from the previous post about visualising JIT.

Let's start with a simple, silly main method:

Plotting the elapsed time in ns:

We can see that:

Up to about the 70th run it is running in the interpreter
Then it drops into C1 until 200
At 200 it optimises again into C2

At 600 it hits our curve ball and deoptimises the method.
Method eventually drops back in C2 at around 800.

Curve ball:

The compiler thought that our String curveBall was always going to be null and it added that into the compiler. However, when we set it to something other than null the compiler realised that it has made a wrong assumption and had to deoptimise that compiled method.

JIT Fun Part 1: Quick JIT visualisation of tiered compliation

See the following code:
It's not doing anything exciting, just running the same inner loop 1000 times.

I have taken those times and put them in a file called 'output.txt'.

See the following python function which reads in the numbers from output.txt and plots them on a graph:

Here is the graph that it produces:

At first the code is running in the interperter, then the C1 JIT, then it settles into the C2 JIT. A couple of spikes probably indicate GC.

Alexandra's Tech

Saturday, 26 November 2016

Memory alignment: Oops and viewing mem structure with Jol

Why are 32-bit machines limited to 4GB memory?

So 64-bit machines are the solution?

The middle ground: Compressed Oops

When does the JVM use Compressed Oops?

Jol

Sunday, 26 June 2016

JIT Fun Part 4: Intrinsics and Inlining

Definition: Intrinsic

Example

JIT Fun Part 3: -XX:PrintCompilation

Compiler levels

Example

JIT Fun Part 2: Throwing the JIT a curve ball

JIT Fun Part 1: Quick JIT visualisation of tiered compliation

Scala with Cats: Answers to revision questions

Report Abuse

Labels