Thursday, 30 March 2017

Is Java's String '+' operator in a loop slow?

I have always understood the following to be true about Java String concatenation:

Java expressions that use the String concatenation operator ('+') to build strings are compiled to a bytecodes that use java.lang.StringBuilder behind the scenes.  (It is easy to confirm this by using javap -c to view the bytecodes of a class that does String building this way.) For example,
    String s = a + b + c;
will typically be compiled to code that is virtually identical to this:
    StringBuilder sb = new StringBuilder(a);
    sb.append(b);
    sb.append(c);
    String s = sb.toString(); 
It is usually unnecessary to "hand optimize" code that use '+' into equivalent code that uses StringBuilder because the compiler will do it for you. Indeed, hand optimization is a bad idea, since it patently makes the code much harder to read.

The exception to this rule is that when you perform String concatenations in a loop.  For example:
    String s = "";
    for (int i = 0; i < 100; i++) {
        s = s + "X";
    }
In this case the Java compiler is not capable of (and arguably not permitted to) generate the optimal sequence:
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 100; i++) {
       sb.append("X");
    }
    String s = sb.toString(); 
Instead you will get something like this:
    String s = "";
    for (int i = 0; i < 100; i++) {
        StringBuilder sb = new StringBuilder(s);
        sb.append("X");
        s = sb.toString();
    } 
You should be able to verify this.  Furthermore, if you examine in detail the source code of the StringBuilder methods used, it is clear that the two versions of the above sequence have different complexity.  The first one is O(N), and the second one is O(N^2).

However, there is a more convincing way to show this.  Compile and the following code:
public class Demo {
    public static void main(String args[]) {
        int dummy;
        int N = 10000;
        long M = 1000;
        long start = System.nanoTime();
        if (args.length > 0) {
            System.out.println("Using StringBuilder.append");
            dummy = concatWithStringBuilder(M, N);
        } else {
            System.out.println("Using String '+'");
            dummy = concatWithString(M, N);
        }
        long end = System.nanoTime();
        System.out.println("res = " + dummy + ", elapsed " + (end - start) +
                           " nanoseconds");
    }

    public static int concatWithString(long count, int length) {
        String s = "";
        for (long i = 0; i < count; i++) {
            s = "";
            for (long j = 0; j < length; j++) {
                s = s + "X";
            }
        }
        return s.hashCode();
    }

    public static int concatWithStringBuilder(long count, int length) {
        String s = "";
        for (long i = 0; i < count; i++) {
            StringBuilder sb = new StringBuilder(); // no hints
            for (long j = 0; j < length; j++) {
                sb.append("X");
            }
            s = sb.toString();
        }
        return s.hashCode();
    }
}
If you examine the code, you will see that it builds 100 Strings of 10,000 characters either using '+' or StringBuilder.append.  When I compile and run the above code on my laptop using Java 8, I get the following results:
$ java Demo
Using String '+'
res = 1151981568, elapsed 11469754082 nanoseconds
$ java Demo yes
Using StringBuilder.append
res = 1151981568, elapsed 112520918 nanoseconds
Now the above code is not a good example of good  benchmarking, and there are good reasons to doubt the timing results in absolute terms.  However, it is clear that there is a difference of two orders of magnitude between the versions.  Furthermore, if you change the value of the N constant, you will see clear evidence that the two versions have different computational complexity.