Faster Java: 2013

Sunday 29 September 2013

The cost of a null check

UPDATE: Thanks to Jin, Vladimir, and Gil for responding on the list. I've made updates below.

One thing that the JVM's JIT can do is optimise away the cost of a null check in certain cases. This is a nice optimisation; a common pattern is if(a != null) { doSomethingTo(a); }. In this case, the JIT might decide to not compile in the nullcheck, and go straight to execution the method. If a is indeed null, a signal is trapped from the hardware and the handler can then determine the signal can be safely ignored and proceeds.

The intent here is to find out a relative cost of null check and what the likely cost of this extra signal work is. The original discussion was regarding flags for code path enablement. Using a static final likely avoids the null checks here. So - the check was written to avoid the 10K compile threshold and simulates more the use of null/present for control flow and compares it with true/false.

The checks:

package com.bluedevel.nulls;


import org.openjdk.jmh.annotations.GenerateMicroBenchmark;
import org.openjdk.jmh.annotations.OperationsPerInvocation;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Setup;

@State(Scope.Thread)
/**
 * {@code java -ea -jar target/microbenchmarks.jar ".*NeverNullAlwaysTrue.*"}
 */
public class NeverNullAlwaysTrue {

    private Operations ops = new Operations();
    public int output = 0;

    // hopefully this means output is not eliminated
    public int getOutput() {
        return output;
    }

    @GenerateMicroBenchmark
    public int testPresentPresent() {
        
        output = doTestPP(output, ops);
        output = doTestPP(output, ops);
        
        return output;
    }

    private int doTestPP(int input, Operations o) {
        output = input;
        if (o != null) {
            output = o.op1(output);
        } else {
            output = ops.op2(output);
        }
        return output;
    }

    @GenerateMicroBenchmark
    public int testTrueTrue() {
        
        output = doTestTT(output, true);
        output = doTestTT(output, true);
        
        return output;
    }

    private int doTestTT(int input, boolean b) {
        output = input;
        if (b) {
            output = ops.op1(output);
        } else {
            output = ops.op2(output);
        }
        return output;
    }

    @GenerateMicroBenchmark
    public int testZPresentPresent() {
        
        output = doTestPP(output, ops);
        output = doTestPP(output, ops);
        
        return output;
    }
    @GenerateMicroBenchmark
    public int testZTrueTrue() {
        
        output = doTestTT(output, true);
        output = doTestTT(output, true);
        
        return output;
    }

}

And with Operations as:

package com.bluedevel.nulls;

public class Operations {
 public int op1(int i) { return i + 2; } 
 public int op2(int i) { return i + 3; }
}

The results below are interesting. If you are using the client VM, then you need to pay attention to your test approach. If you are using the server VM, then you can safely ignore it, as the code seems to run at approximately the same performance.


server:
Benchmark                                         Mode Thr    Cnt  Sec         Mean   Mean error    Units
c.b.n.NeverNullAlwaysTrue.testPresentPresent     thrpt   1     20    5   492452.297     1015.518 ops/msec
c.b.n.NeverNullAlwaysTrue.testTrueTrue           thrpt   1     20    5   482684.551     2934.298 ops/msec
c.b.n.NeverNullAlwaysTrue.testZPresentPresent    thrpt   1     20    5   485969.596     8488.587 ops/msec
c.b.n.NeverNullAlwaysTrue.testZTrueTrue          thrpt   1     20    5   484919.891     2415.732 ops/msec

client:
Benchmark                                         Mode Thr    Cnt  Sec         Mean   Mean error    Units
c.b.n.NeverNullAlwaysTrue.testPresentPresent     thrpt   1     20    5    67216.614     2456.394 ops/msec
c.b.n.NeverNullAlwaysTrue.testTrueTrue           thrpt   1     20    5    78801.555     2455.900 ops/msec
c.b.n.NeverNullAlwaysTrue.testZPresentPresent    thrpt   1     20    5    71080.747      790.894 ops/msec
c.b.n.NeverNullAlwaysTrue.testZTrueTrue          thrpt   1     20    5    83989.886      458.524 ops/msec

Saturday 31 August 2013

Caching Classes from the ClassLoader?

Code for the article is here. I also submitted an enhancement patch to Tomcat bugtracker.

I noticed previously that the Tomcat WebappClassLoader is heavily serialized. In fact, the entire loadClass entry point is marked synchronized, so for any poorly designed libraries, the impact of this on scalability is pretty remarkable. Of course, the ideal is not to hit the ClassLoader hundreds of times per second but sometimes that's out of your control.

I decided to play some more with JMH and run some trials to compare the impacts of various strategies to break the serialization.

I trialled four implementations:
1) GuavaCaching - a decorator on WebappCL which uses a Guava cache
2) ChmCaching - a decorator on WebappCL which uses a ConcurrentHashMap (no active eviction)
3) ChmWebappCL - a modified WebappCL using ConcurrentHashMap so that loadClass is only synchronized when it reaches up to parent loader, classes loaded through current loader are found in local map
4) Out of the box Tomcat 8.0.0-RC1 WebappClassLoader - synchronized fully around loadClass method

The results; in operations per microsecond, where an operation is a lookup of java.util.ArrayList and java.util.ArrayDeque.

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
GuavaCaching	3687	978	1150	1129	1385	1497	1607	1679	1777	1733	1834
ChmCaching	27241	51062	81678	107376	134798	162125	188192	213007	208034	210812	200231	211744	214431	215283	209782	212297
ChmWebappCL	185	48	81	81	83	81	84	85	85	85	80	84	82	83	83	84
WebappCL	181	69	91	92	91	92	100	98	95	94	94	95	102	102	95	98

And the explanation -

GuavaCaching seems remarkably slow compared to CHM. Might be worth investigating further. I also noticed significant issues with Guava implementation; some tests were running for extremely long time, seems there is an issue in the dequeue (quick look, it appears stalled in a while loop).
ChmCaching seems very effective; although it is caching classes loaded from parent and system loader. This seems OK per the API but unusual, I will have to check the API in more detail. Scales linearly with cores (it is an 8 core machine).
ChmWebappCL seemed to have less of an effect. This is likely because I am testing loading classes against core java.util.* rather than from a JAR added to the classloader. I expect ChmWebappCL can approach ChmCaching speed if I attach JARs directly to the class loader rather than passing through to system loader. (Going to system loader means entering the synchronized block).
WebappCL - very slow performance.

And pretty pictures. You can see that CHM caching is far and away the best of this bunch.

Same picture, at log scale -

Tuesday 16 July 2013

Gatling JMS API now ready for wider testing

Please take a look on here - https://github.com/jasonk000/gatling-jms

In brief, the Gatling JMS extension allows testing of JMS provider APIs using the nice Gatling DSL that you are familiar with.

You can get a pretty diagram, quite quickly, testing a JMS service, using the simulation script below.

package com.bluedevel.gatling.jms

import net.timewalker.ffmq3.FFMQConstants
import io.gatling.core.Predef._
import com.bluedevel.gatling.jms.Predef._
import scala.concurrent.duration._
import bootstrap._
import javax.jms.{ Message, TextMessage }

class TestJmsDsl extends Simulation {

  val jmsConfig = JmsProtocolBuilder.default
    .connectionFactoryName(FFMQConstants.JNDI_CONNECTION_FACTORY_NAME)
    .url("tcp://localhost:10002")
    .credentials("user", "secret")
    .contextFactory(FFMQConstants.JNDI_CONTEXT_FACTORY)
    .listenerCount(1)
    .deliveryMode(javax.jms.DeliveryMode.PERSISTENT)

  val scn = scenario("JMS DSL test").repeat(1) {
    exec(jms("req reply testing").reqreply
      .queue("jmstestq")
// -- four message types are supported; only StreamMessage is not currently supported
      .textMessage("hello from gatling jms dsl")
//      .bytesMessage(new Array[Byte](1))
//      .mapMessage(new ListMap[String, Object])
//      .objectMessage("hello!")
      .addProperty("test_header", "test_value")
      .addCheck(checkBodyTextCorrect)
    )
  }

  setUp(scn.inject(
       rampRate(10 usersPerSec) to (1000 usersPerSec) during (2 minutes)
    )).protocols(jmsConfig)

  /**
   * Checks if a body text is correct.
   * <p>
   * Note the contract on the checks is Message => Boolean, so you can perform
   * any processing you like on the message (check headers, check type, check body,
   * complex checks, etc).
   */
  def checkBodyTextCorrect(m: Message) = {
    // this assumes that the service just does an "uppercase" transform on the text
    val BODY_SHOULD_BE = "HELLO FROM GATLING JMS DSL"
    m match {
      case tm: TextMessage => (tm.getText.toString == BODY_SHOULD_BE)
      case _ => false
    }
  }

}

Monday 17 June 2013

Gatling JMS (first pass)

Simple add-in to Gatling to support JMS testing.

First cut is available on github - https://github.com/jasonk000/gatling-jms

Lots of work to do, but I thought I'd push it here so you can see a basic example of a gatling test.

Thursday 30 May 2013

Tomcat WebappClassLoader class loader performance degradation

Interesting performance degradation after upgrading to (relatively) recent versions of the Tomcat application stack - since 7.0.x, fix has been backported to 6.x and 5.5x as well. Sharing / documenting it here in case it helps others.

This stems from a change introduced to Tomcat as part of bug 48903/44041/48694 to fix a deadlock some users were experiencing.
http://svn.apache.org/viewvc?view=revision&revision=927565

For us, this issue was shown due to our use of the Saxon XSLT transformer library showing higher contended time.

The particular problem is that Saxon 9.1.x library, for each transform:
- To get an emitter, calls the Configuration.makeEmitter()
- This calls DynamicLoader.getInstance()
- Inside the container, this defers to WebappClassLoader
- loadClass on WebappClassLoader is now declared synchronized.

Other platforms eg Struts2 have worked around this by caching classes loaded through the class loader - see https://issues.apache.org/jira/browse/WW-3902 .

Haven't worked out the fix yet - ideal solution will be to remove the need for XSLT in the processing path, unlikely but one can hope. (XSLT is required here to support a particular legacy expectation of the formatting of the returned XML).

Monday 8 April 2013

Oracle JDBC row prefetch driver tuning

defaultRowPrefetch - almost an easy win

If you work in a large enterprise context, sooner or later you will connect to an Oracle database.

I wanted to share a tip on tuning the Oracle row prefetch settings. The default prefetch value is 10, which means the JDBC client will do one round-trip to the database to fetch 10 rows, rather than round-trip to the DB for every row fetch. To reduce the round-trips over the network, you can increase the row prefetch, which is especially effective if you are extracting large row volumes.

This is a common trick provided to increase JDBC performance especially where there will be a large ResultSet returned. The row prefetch can be set in two ways -

On the Statement object, call the setFetchSize before executing a query. This works if you are using JDBC directly.
Alternatively, you can set a default fetch size when the setFetchSize method cannot be called. On the connection properties object, use the defaultRowPrefetch value and set it to an int.

So an easy way to automatically improve your application performance is to increase the fetch size. Great.

Except for the caveat

Oracle JDBC client seems to pre-initialise some memory structures to hold the full prefetch size. So, if you set a prefetch size of 500, you allocate 50x as much memory than if you had prefetch size = 10. This is a huge extra demand on GC especially if you are not actually reading those rows. To think, you might be running a GC 50x more often than needed if you normally only fetch a few rows; this will be a big impact on your application responsiveness.

Recommendation

If possible, I recommend using the setFetchSize on a per-query basis. For example, if you know a particular query will only ever return a few rows, then set the fetch size to say 5. If you know a query will return 1000 rows, use a fetch size of 100.

As a heuristic, there are limited benefits from going over 50-100.

Saturday 6 April 2013

Disruptor example - UDP "echo" service, with capitalisation

A fully functioning Disruptor 3.x example

The Disruptor, helpfully open sourced by LMAX, has been a real eye opener for me in terms of what can be achieved in performance and efficiency from a multi threaded application. By using a ring buffer and careful use of the single writer principle, they create some extremely low latency and high throughput applications.

One of the things missing from the Disruptor package is a set of useful working examples. Unfortunately the examples posted on Disruptor Wiki are targeted for the 2.x series. The API has had a few changes in the 3.x series, although things have not fundamentally changed.

This is an example of a service which accepts packet requests over UDP, and sends a UDP packet back to the source with the text capitalised using the Java String.toUpperCase method.

Credits to Peter Lawrey as I've copied the technique for computing metrics percentiles:
https://code.google.com/p/core-java-performance-examples/source/browse/trunk/src/test/java/com/google/code/java/core/socket/PingTest.java

Design

An annotated picture shows the event flow.

The only thing specific to the problem of uppercase-ing the input is the BusinessLogicHandler. You could replace this with anything.

This example is enough to get started. Of course, in a Production system, you need to do a lot more around exception handling, pinning threads, service request logging, etc.

The only item not shown here is the DatagramEvent - this is the value event to store entries in the ring buffer. In other words, each slot in the ring buffer is one DatagramEvent.

Results

Tested on a 2x Xeon E5410 (8 cores). No process-core pinning etc. Query client runs on same machine, runs N threads, each using Java NIO in blocking mode to send a request and wait for a response.

I've compared the results with a standard blocking IO server. The blocking IO server runs with one "acceptor" thread and three "handler" threads. I tried various combinations of Old IO & Blocking NIO and this was the best I could get. Measurements include the client time to send and receive.

Disruptor was configured to use a SleepingWaitStrategy, which uses 4-6% per core but still shows excellent latency.

A couple of observations:

There is hardly any business logic here [uppercase a string], so the difference in performance is limited to thread hand-off effectiveness. The results are very noticeable.
The 99pc and 50pc are very consistent for Disruptor, which implies low variation in response times (ie: consistently low latency).

Also, given this is an 8-core server, running more than a few client threads is going to result in some contention. Above 4 threads, you can see the effect of queueing in Disruptor (with 99pc increasing significantly).

Other observations / future work

java.io.DatagramSocket is synchronised, so if you are using "old" java IO, best to create a socket per thread.
Given threads are are able to work totally independently, the threaded approach should theoretically be better latency (since it can handle concurrency easier). It would be interesting to include some access to a shared resource, such as incrementing a shared counter, to compare a more realistic workload.
I'd also like to test this across multiple nodes across network to push the system a little harder.
Interestingly, in this case, Sleeping Wait Strategy consistently showed better latency than BusySpin, I suspect this is due to thread migrations on linux scheduler but have not investigated.

UDP Echo Service code

Github link

Quick note on Scala - this is not functional styled Scala, I'm using Scala here as a scriptable Java. Feel free to provide feedback though if you think it can make the demos a bit more clear.

// DatagramEvent & DatagramFactory

/**

* Datagram value event

* This is the value event we will use to pass data between threads

class DatagramEvent(var buffer:Array[Byte], var address:SocketAddress,
var length:Int)

/**

* Creates instances of DatagramEvent for the ring buffer

object DatagramFactory extends EventFactory[DatagramEvent] {

def newInstance = new DatagramEvent(new Array[Byte](BYTE_ARRAY_SIZE), null, 0)

}

// Main channel receive loop (includes initialisation code)

/**

* Server thread - event loop

* We took a shortcut in the EventTranslator, since it does a blocking read, then

* all the server thread needs to do is request a new publish. The disruptor will
* request the translator to fill next event, and translator can write direct to
* the buffer. This would not work if we had multiple publishers, since the
* translator is effectively blocking the sequence from progressing while it
* is blocked in translateTo().

* The "in" disruptor is responsible for reading from UDP, doing the upper
* case transform (the business logic), and then copying the packet to
* the "out" disruptor"

val RING_SIZE:Int = 1*1024

val BYTE_ARRAY_SIZE:Int = 1*1024

{

// must have at least as many server threads available for each
// disruptor as there are handlers

val inExecutor = Executors.newFixedThreadPool(1)

val outExecutor = Executors.newFixedThreadPool(1)
// each disruptor is basically the same

val disruptorIn = new Disruptor[DatagramEvent](

DatagramFactory, RING_SIZE, inExecutor, ProducerType.SINGLE,
new SleepingWaitStrategy)

val disruptorOut = new Disruptor[DatagramEvent](

DatagramFactory, RING_SIZE, outExecutor, ProducerType.SINGLE,
new SleepingWaitStrategy)
// disruptor handle the events with the provided handlers

disruptorOut.handleEventsWith(new DatagramSendHandler)

disruptorOut.start

disruptorIn.handleEventsWith(new BusinessLogicHandler(disruptorOut))

disruptorIn.start

// socket receive resources

val buffer = ByteBuffer.allocateDirect(BYTE_ARRAY_SIZE)

val channel = DatagramChannel.open

channel.socket.bind(new InetSocketAddress(9999))

channel.configureBlocking(true)

// core event loop

while(true) {

val socket = channel.receive(buffer)

buffer.flip

// use an anonymous inner class to implement EventTranslator

// otherwise we need to do a double copy; once out of the DBB to
// a byte[], then from a byte[] to a byte[]

disruptorIn.publishEvent(new EventTranslator[DatagramEvent] {

def translateTo(event:DatagramEvent,sequence:Long) {

event.length = buffer.remaining

buffer.get(event.buffer, 0, buffer.remaining)

event.address = socket

}

})

}

// Business Logic Handler [this is what you replace with your own logic]

/**

* Business logic goes here

class BusinessLogicHandler(val output:Disruptor[DatagramEvent])
extends EventHandler[DatagramEvent] {

def onEvent(event:DatagramEvent, sequence:Long, endOfBatch:Boolean) {

if (event.address != null) {

// do the upper case work

val toSendBytes = new String(
event.buffer, 0, event.length).toUpperCase.getBytes

// then publish

output.publishEvent(new ByteToDatagramEventTranslator(
toSendBytes, event.address))

}

// ByteToDatagramEventTranslator

/**

* Pushes an output byte[] and address onto the given DatagramEvent

class ByteToDatagramEventTranslator(val bytes:Array[Byte],
val address:SocketAddress) extends EventTranslator[DatagramEvent] {

def translateTo(event:DatagramEvent, sequence:Long) {

event.address = this.address

event.length = bytes.length

System.arraycopy(bytes, 0, event.buffer, 0, bytes.length)

}

// DatagramSendHandler

/**

* Sends the datagrams to the endpoints

* Don't bother with endOfBatch, because it's assumed each udp
* packet goes to a different address

class DatagramSendHandler() extends EventHandler[DatagramEvent] {

// private channel for the outbound packets

val channel = DatagramChannel.open

// this buffer is reused to send each event

val buffer = ByteBuffer.allocateDirect(BYTE_ARRAY_SIZE)

def onEvent(event:DatagramEvent, sequence:Long, endOfBatch:Boolean) {

if (event.address != null) {

// copy from the byte[] into the buffer

buffer.put(event.buffer, 0, event.length)

buffer.flip

channel.send(buffer, event.address)

// assume the buffer isn't needed again, so clear it

buffer.clear

}

Welcome

This blog will chronicle some of my learnings while experimenting with making Java a faster, yet still productive, platform.