How does String Intern work ( C# vs Java )?

263 views Asked by At

I know that the string interning optimization is in both C# and Java.

When I try these two codes using Java:

public static void main(String[] args) {
    char p[]={'h','e','e','l','l','l','l'}; // Make string from char[] to ensure it's not already interned
    String s1 = new String(p);
    String i1 = s1.intern(); 
    System.out.println(s1== i1); // true
}

Code number 2:

public static void main(String[] args) {
    char p[]={'h','e','e','l','l','l','l'}; // Make string from char[] to ensure it's not already interned
    String s1 = new String(p);
    String i1 = s1.intern(); 
    System.out.println(s1== i1); // true
    String o = "heellll";
    System.out.println(o== i1); // true
    System.out.println(s1== o); // true
}

However, when I try the same two code using C#:

unsafe static void Main(string[] args)
{
    char[] p = { 'h', 'e', 'e', 'l', 'l', 'l', 'l' }; // Make string from char[] to ensure it's not already interned
    String s1 = new String(p);
    String i1 = string.Intern(s1);
    Console.WriteLine(object.ReferenceEquals(s1, i1)); // True
}

Code number 2:

unsafe static void Main(string[] args)
{
    char[] p = { 'h', 'e', 'e', 'l', 'l', 'l', 'l' }; // Make string from char[] to ensure it's not already interned
    String s1 = new String(p);
    String i1 = string.Intern(s1);
    Console.WriteLine(object.ReferenceEquals(s1, i1)); // False
    String o = "heellll";
    Console.WriteLine(object.ReferenceEquals(o, i1)); // True
    Console.WriteLine(object.ReferenceEquals(s1, o)); // False
}

I expect two different scenarios, and I hope one of these scenarios shows the reason why the output of the C# code (code number 2) is not the same as the output of the Java code (code number 2):

My first expectation: in C#, the CLR checks the string literals before it runs any line in your code, so String o = "heellll"; is executed before running the code (So String o = "heellll"; is executed String s1 = new String(p); then "heellll" is added to the intern pool before s1 so when you execute the line String i1 = string.Intern(s1); s1 is not added but string.Intern(s1); returns the reference of "heellll"), on the other hand (in Java) the JVM does not check the string literals before run any line in your code.

My second expectation: in C# for the CLR, it is a priority to save references of string literals (the CLR saves s1 reference (in the line String i1 = string.Intern(s1);) but when the CLR finds a string literals that represents the value of a string object (s1) (in line String o = "heellll";) the CLR replaces s1 reference ( that is in the intern pool) with that string literals reference, but if the CLR finds a string object that represents the value of another string object, the CLR does nothing), on the other hand (in Java) for the JVM, it is not a priority to save references of string literals.

So, which scenario is the right scenario? If the two scenarios are wrong, what is the reason why the output of the C# code (code number 2) is not the same as the output of the Java code (code number 2)?

2

There are 2 answers

2
OwnageIsMagic On

UPD2: See @user85421 comments below this post.

UPD: I was assuming that JVM checks string pool in string constructor, but actually it doesn't. Actually I don't really understand what is happening here, seems like some weird JVM optimization. Normative documents suggest that string should be loaded to pool on class load (see https://stackoverflow.com/a/3451183/5647513), but small examples show that it doesn't.

public class MyClass {
    String constantInCtor = "heellll"; // works as expected if this is `static`

    public static void main(String[] args) {
        String string = new String(new char[]{'h','e','e','l','l','l','l'});

        String interned = string.intern(); // <-+
                                           //   | or if I swap this lines 
        String constant = "heellll";       // <-+

        System.out.println(string == interned);   // true, expected false // "heellll" was not interned before string.intern()
        System.out.println(constant == interned); // true
        System.out.println(string == constant);   // true, expected false
    }
}

Seems that JVM interns literals on first use.

// CLR
var newstring = new string(new []{'a','b'});
var interned = string.Intern(newstring); // 
var constant = "ab";

Console.WriteLine(ReferenceEquals(newstring, interned)); // False // "ab" is interned before string.Intern
Console.WriteLine(ReferenceEquals(constant, interned)); // True
Console.WriteLine(ReferenceEquals(newstring, constant)); // False

string interning is performed before execution at class/method load (both JVM(see comments) and CLR). The reason why object.ReferenceEquals(s1, i1) returns false is that String constructor in CLR doesn't check string pool when called and always creates new object.

You probably want to ask -- why it does so? The exact answer is: CLR team decided that way. But it's not very informative, so I will speculate a bit on this topic.

String constructor is used to create strings in runtime, probably from untrusted source (user input), so checking is string interned provides very little benefit (small chance that random string will be in the pool), but hinders every string creation (pool lookup is not free, it's O(n) at least + thread synchronization).

String interning provided by runtime rarely finds good application beside reduced binary size (all duplicate string literals in source code are collapsed into 1). If you are certain that you are going to compare a lot of strings (from trusted source) manually crafted string cache would perform better in all cases (at least it will not contain all the unrelated string from your program and all of it's dependencies). Caching strings from untrusted source leads to unbound heap growth, it can be dealt if you are managing pool by itself, but runtime string pool is append only.


Do not use unsafe, it's really unsafe (probably even more unsafe than using C/C++ directly instead) and doesn't provide any benefits in regular code.

3
Charlieface On

I don't know Java JVM well, but as far as .NET is concerned, your first assumption is mostly correct, except that no code is executed.

Reworded:

In .Net, the CLR loads the string literals into memory when compiling (JIT), before running any line in that function. So although String o = "heellll"; is not executed before the rest of the code, the "heellll" string is already in the Intern pool.

So when you execute the line String i1 = string.Intern(s1); s1 is not added because the string is already in the pool, it returns the reference of "heellll" instead.

One thing I can categorically state: references are not replace using string.Intern unless you replace them yourself.