String literals using 2x the expected amount of permanent generation space

599 views Asked by At

This is Sun JDK 1.6u21, x64.

I have a class for the purpose of experimenting with perm gen usage which contains only a single large string (512k characters):

public class Big0 {
     public String bigString =
         "A string with 2^19 characters, should be 1 MB in size";
}

I check the perm gen usage using getUsage().toString() on the MemoryPoolMXBean object for the permanent generation (called "PS Perm Gen" in u21, although it has slightly different names with different versions, or with different garbage collectors.

When I first reference the class, say by reading Big0.class, perm gen jumps by ~500 KB - that's what I'd expect as the constant pool encoding of the string is UTF-8, and I'm using only ASCII characters.

When I actually create an instance of this class, however, perm gen jumps by ~2 MB. Since this is a 1 MB string in-memory (2 bytes per UTF16 character, certainly no surrogates), I'm confused about why the memory usage is double.

The same effect occurs if I make the string static. If I used final, it fails to compile as I exceed the limit for constant pool items of 65535 bytes (not sure why leaving final off avoids that either - consider that a bonus question).

Any insight appreciated!

Edit: I should also point out that this occurs with non-static, final non-static, and static strings, but not for final static strings. Since that's already a best practice for string constants, maybe this is of mostly academic interest.

4

There are 4 answers

3
Ron On BEST ANSWER

I think it's an artefact of your test class. I created a similar class, then decompiled it with javap.

The [eclipse] java compiler breaks the String literal into chunks, each no longer than 64k. The bytecode for initializing the non-constant field consists of cobbling the source string together with a sequence of StringBuilder operations. Although it is this final gigantic string that is interned, the large atoms it is made of take up space in the constant pool.

1
jtahlborn On

A good memory profiler (i personally use and really like yourkit java profiler) should be able to show you where the memory is being used.

2
Dirk On

Java characters have a width of 2 bytes per character (regardless of whether itd ASCII or a code point above 255). I think that what you seeing is the Java VM translating the internal class file storage (modified UTF8) version of the string into its internal expanded form as soon as the class is initialized (which is done prior to instance creation)

1
Joachim Sauer On

While the class file format specifies modified UTF-8 as its storage format for String literals, the internal format of the runtime is UTF-16. A String stores its data as in UTF-16 encoding in a char[] (usually, it's implementation-dependent, however) . Most characters take up 2 bytes in this encoding (characters outside the BMP take up more).

I've seen references to a modified rt.jar that contains a java.lang.String implementation with a specialized code-path/storage for ASCII-only Strings, which cut down on the memory requirement significantly.

Edit: it seems this option has found its way into the normal Oracle JRE since Java 6 Update 21 according to this reference:

-XX:-XX:+UseCompressedStrings

Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

(Found through this answer).