Berry Web - Testing FFI Hot Loop Overhead

I wanted to compare the FFIs of a few popular programming language to evaluate 1. the ease of use of the FFI functionality and 2. their minimal overhead in a hot-loop scenario. I chose java, c#, php, and golang for my experiment.

This is in no way representative of FFI usage in the general case -- just indicative of overhead of calling a minimal function.

The code for this experiment is here: https://gitlab.com/vancan1ty/ffihotloopexamples . The JNI code was based off of this github repo by RJ Fang: https://github.com/thefangbear/JNI-By-Examples .

All tests are run on my Dell Precision M4800 laptop. I evaluate three scenarios across several languages.

Scenario 1 is as follows -- we loop up to LOOPMAX -- which I set to 2000000000L -- and perform a few simple operations including a bit of modulus arithmetic and conditional logic to keep the loop from being unrolled by the compiler.

        Utilities util = new Utilities();
        long st1 = System.currentTimeMillis();
        long counter = 0;
        while(counter < LOOPMAX) {
            counter = util.addOne(counter);
            if(counter % 1000000 == 0) {
                counter = counter + 10;
            }
        }
        long et1 = System.currentTimeMillis();

Scenario 2 is the same as scenario 1 -- but we perform all the operations directly in the programming language being tested -- no FFI is involved. We can expect scenario 2 to run faster than scenario 1 -- and the difference in the performance between the scenarios is a measure of the overhead of calling out via FFI to the native library.

        long st2 = System.currentTimeMillis();
        long counter2 = 0;
        while(counter2 < LOOPMAX) {
            counter2 = counter2 + 1;
            if(counter2 % 1000000 == 0) {
                counter2 = counter2 + 10;
            }
        }
        long et2 = System.currentTimeMillis();

Scenario 3 delegates the entire hot loop to the natively implemented code. You might expect this to be the fastest of all.

        long st3 = System.currentTimeMillis();
        long counter3 = util.loopToMax(0,LOOPMAX);
        long et3 = System.currentTimeMillis();

Java -- JNI

First comes the Java -- JNI implementation of the scenarios.

$ java -version
openjdk version "17.0.1" 2021-10-19 LTS
OpenJDK Runtime Environment (build 17.0.1+12-LTS)
OpenJDK 64-Bit Server VM (build 17.0.1+12-LTS, mixed mode, sharing)

Using RJ Fang's script to run the code that I added to the JNI-By-Examples yields the below performance results

sh jnihelper.sh --execute-java

Java Native Interface Helper
Created by RJ Fang
Options: ./jnihelper.sh
--refresh-header     Refreshes the header
--build              Tries CMake build
--execute-java       Tests library

t1: 22.415 to find 2000000010 (89225964.756 per sec)
t2: 3.29 to find 2000000010 (607902735.562 per sec)
t3: 5.889 to find 2000000010 (339616233.656 per sec)

So in this case, the pure java implementation is able to outperform even the C++ implementation by a decent margin -- even though the C++ implementation is compiled with -O3 as part of the cmake process. Not sure why this is.

Note that the Java JNI testcase was kind of a pain to setup and I required the help of a the JNI Examples repo and associated scripts to get it working. The code for the java cases is in java/in/derros/jni/Utilities.java .

The code for the C++ side of the java implementation is written in Utilities.cpp (building on the JNI Examples project structure)

/*
 * ==============IMPLEMENTATION=================
 * Class:     in_derros_jni_Utilities
 * Method:    addOne
 * Signature: (J)J
 */
JNIEXPORT jlong JNICALL Java_in_derros_jni_Utilities_addOne
  (JNIEnv * env, jobject obj, jlong valIn) {
    return 1+valIn;
  }

  /*
    * ==============IMPLEMENTATION=================
   * Class:     in_derros_jni_Utilities
   * Method:    loopToMax
   * Signature: (JJ)J
   */
  JNIEXPORT jlong JNICALL Java_in_derros_jni_Utilities_loopToMax
    (JNIEnv * env, jobject obj, jlong start, jlong max) {
      long counter2 = start;
      while(counter2 < max) {
          counter2 = counter2 + 1;
          if(counter2 % 1000000 == 0) {
            counter2 = counter2 + 10;
          }
      }
      return counter2;
    }

JNI Generates the funky looking method signatures and then you just have to copy the signatures and fill in the implementations in your own cpp file. It's a bit weird but it does work.

PHP

FFI in PHP is a breeze, was able to write the following script to easily reproduce the java testcases above

addOne($counter);
            if($counter % 1000000 == 0) {
                $counter = $counter + 10;
            }
        }
        $et1 = microtime(true);

        $st2 = microtime(true);
        $counter2 = 0;
        while($counter2 < $LOOPMAX) {
            $counter2 = $counter2 + 1;
            if($counter2 % 1000000 == 0) {
                $counter2 = $counter2 + 10;
            }
        }
        $et2 = microtime(true);

        $st3 = microtime(true);
        $counter3 = $ffi->loopToMax(0,$LOOPMAX);
        $et3 = microtime(true);

        $tt1 = (($et1-$st1));
        $tt2 = (($et2-$st2));
        $tt3 = (($et3-$st3));
        print "t1: $tt1 to find $counter  (" . sprintf("%.3f",($LOOPMAX/$tt1)) . " per sec)\n";
        print "t2: $tt2 to find $counter2 (" . sprintf("%.3f",($LOOPMAX/$tt2)) . " per sec)\n";
        print "t3: $tt3 to find $counter3 (" . sprintf("%.3f",($LOOPMAX/$tt3)) . " per sec)\n";

?>

To complete the C side of the picture, I wrote the following simple header file (utility.h)

long addOne (long valIn);
long loopToMax(long start, long max);

and the following simple C implementation of the core logic (utility.c)

#include "utility.h"

   long addOne (long valIn) {
    return 1+valIn;
  }

  long loopToMax(long start, long max) {
      long counter2 = start;
      while(counter2 < max) {
          counter2 = counter2 + 1;
          if(counter2 % 1000000 == 0) {
            counter2 = counter2 + 10;
          }
      }
      return counter2;
    }

Compile the small c program to a shared object file that can be loaded by php with the following command

gcc -shared -O3 -o simple.so -fPIC utility.c

Note that I believe we are able to get somewhat better performance by disabling the -fPIC

Then we are able to benchmark

$ php -version
PHP 8.1.2-1ubuntu2.11 (cli) (built: Feb 22 2023 22:56:18) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.1.2, Copyright (c) Zend Technologies
    with Zend OPcache v8.1.2-1ubuntu2.11, Copyright (c), by Zend Technologies


$ php bench.php 
t1: 215.97937607765 to find 2000000010  (9260143.428 per sec)
t2: 37.147371053696 to find 2000000010 (53839610.806 per sec)
t3: 5.6606829166412 to find 2000000010 (353314260.744 per sec)

PHP is much slower than Java, by about a factor of 10. Scenario 3 comes out to about the same runtime, as expected, as all the heavy listing is borne by the native code here.

C#

With C#, I found that it is essential to set Optimize to true in the .csproj file to get competitive performance.

dotnet --version 
7.0.202

Similarly to PHP, C# makes FFI a breeze. Just load the .so file , define the signatures of the external functions, and then you can call the native functions directly. Massively simpler than JNI.

using System;
using System.Runtime.InteropServices;

public class Bench
{
    // Import user32.dll (containing the function we need) and define
    // the method corresponding to the native function.
    [DllImport("../../../simple.so")]
    private static extern long addOne (long valIn);

    [DllImport("../../../simple.so")]
    private static extern long loopToMax(long start, long max);

    public static long CurrentMillis()
    {
       long milliseconds = DateTime.Now.Ticks / TimeSpan.TicksPerMillisecond;
       return milliseconds;
    }

    public static long LOOPMAX = 2000000000;
    

    public static void Main(string[] args)
    {

        long st1 = CurrentMillis();
         long counter = 0;
         while(counter < LOOPMAX) {
             counter = addOne(counter);
             if(counter % 1000000 == 0) {
                 counter = counter + 10;
             }
         }
         long et1 = CurrentMillis();

         long st2 = CurrentMillis();
         long counter2 = 0;
         while(counter2 < LOOPMAX) {
             counter2 = counter2 + 1;
             if(counter2 % 1000000 == 0) {
                 counter2 = counter2 + 10;
             }
         }
         long et2 = CurrentMillis();

         long st3 = CurrentMillis();
         long counter3 = loopToMax(0,LOOPMAX);
         long et3 = CurrentMillis();

         double tt1 = ((et1-st1)/1000.0);
         double tt2 = ((et2-st2)/1000.0);
         double tt3 = ((et3-st3)/1000.0);
         Console.WriteLine("t1: " + tt1 + " to find " + counter + " (" + String.Format("{0}",(LOOPMAX/tt1)) + " per sec)");
         Console.WriteLine("t2: " + tt2 + " to find " + counter2 + " (" + String.Format("{0}",(LOOPMAX/tt2)) + " per sec)") ;
         Console.WriteLine("t3: " + tt3 + " to find " + counter3 + " (" + String.Format("{0}",(LOOPMAX/tt3))  + " per sec)");
    }
}

The performance is great for dotnet.

$ dotnet run
t1: 7.466 to find 2000000010 (267881060.8090008 per sec)
t2: 3.963 to find 2000000010 (504668180.6712087 per sec)
t3: 5.799 to find 2000000010 (344887049.4912916 per sec)

.Net 7 is able to run the FFI hot path in about a third of the time that Java 17 requires. .Net 7 is not nearly as fast as Java 17 when comparing the implementations written with no FFI, but it's in the same ballpark. Scenario 3 is close between all the languages so far, as expected as the heavy lifting here is in native code.

I think the excellent FFI support of C# -- both usability and performance -- in tandem with the better support for value-oriented memory layouts are two of the secret weapons that give C# a big edge for applications that involve integrating with native or high performance code. Java has good performance for a managed language but bogs down compared to C# when asked to talk to native code. For example, Unity uses C# -- doubt Java would have been as successful in this role. The "everything is an object" model of Java also likely hampers the interop with native libraries doing heavy math on image, video, or language modelling data -- I don't have proof of that but it's a hunch on my part.

These are two areas Java could stand to catch up with C# in.

Go

For go, I only implemented scenario 1, as that is the scenario that uses FFI the most.

I used cgos inline functionality to define the native function within the go file -- we could likely call out to simple.so as well but I haven't implemented that yet.

I timed this testcase on the command line like so

$ time go run bench.go 
2000000010

real    2m7.746s
user    2m7.812s
sys     0m0.240s

Go's FFI hot-loop performance is far worse than Java and C#. I think this has something to do with the green-threads model that Go uses.

Java - Project Panama

Last, lets return to java to look at the new project Panama FFI Api. This promises to reduce the effort to interact with native code using java (JNI is so clunky!). For this test I downloaded the latest generally released Jdk build (at the time I ran my tests)

$ ~/software/jdk-20.0.1/bin/java -version
java version "20.0.1" 2023-04-18
Java(TM) SE Runtime Environment (build 20.0.1+9-29)
Java HotSpot(TM) 64-Bit Server VM (build 20.0.1+9-29, mixed mode, sharing)

Finally I can write code in java to call out to a C library with comparable complexity to the other languages in this comparison. Note that I only implemented scenario 1 (ffi in the loop) using the panama approach.

package net.berryplace.testing;

import java.lang.foreign.Linker;
import java.lang.foreign.FunctionDescriptor;
import java.lang.foreign.SymbolLookup;
import java.lang.foreign.ValueLayout;
import java.nio.file.Paths;

import java.lang.invoke.MethodHandle;
import java.lang.invoke.MethodType;

public class JavaBench2 {

    private static final SymbolLookup libLookup;
    static String libPath = "simple.so";

    static {
        // loads a particular C library
        System.load(Paths.get("").toAbsolutePath().toString() + "/" + libPath);
        libLookup = SymbolLookup.loaderLookup();
    }

    public static MethodHandle getaddOneMethod() {
        Linker linker          = Linker.nativeLinker();
        var addOneMethod = libLookup.find("addOne");

        if (addOneMethod.isPresent()) {
            var methodReference = linker
                    .downcallHandle(
                            addOneMethod.get(),
                            FunctionDescriptor.of(ValueLayout.JAVA_LONG, ValueLayout.JAVA_LONG)
                    );

            return methodReference;

        }
        throw new RuntimeException("addOne function not found.");
    }

    static long LOOPMAX = 2000000000L;

    public static void main(String[] args) throws Throwable {
        MethodHandle addOne = getaddOneMethod();

        long st1 = System.currentTimeMillis();
        long counter = 0;

        while(counter < LOOPMAX) {
            counter = (long) addOne.invoke(counter);
            if(counter % 1000000 == 0) {
                counter = counter + 10;
            }
        }
        long et1 = System.currentTimeMillis();

        double tt1 = ((et1-st1)/1000.0);
        System.out.println("t1: " + tt1 + " to find " + counter + " (" + String.format("%.3f",(LOOPMAX/tt1)) + " per sec)");
    }
}

This isn't quite as slick as the C# approach but what it lacks in slickness it makes up for in being very composable and structured. The panama team has also built something called jextract that further simplifies calling in to foreign libraries -- I haven't tried that yet -- i'm pretty pleased with the default Panama API on its own.

Note that the above calls in to the simple.so object we built earlier for the php FFI test.

From the directory containing the class file, we can compile and run like so using java 20

~/software/jdk-20.0.1/bin/javac --release 20 --enable-preview JavaBench2.java 
(cd ../../../ && ~/software/jdk-20.0.1/bin/java --enable-preview net/berryplace/testing/JavaBench2 )

Resulting in the below output on my machine

t1: 27.383 to find 2000000010 (73038016.287 per sec)

While the Panama API is much simpler to use, the bad news is that the performance is even worse than the standard JNI api for this usecase. C# is much faster.

Conclusions

Here is a table containing all the benchmark results:

	Java – JNI	PHP	C#	Go	Java – Panama
Scenario 1 (FFI Hotloop)	22.415	215.979	7.466	127	27.383
Scenario 2 (Pure)	3.29	37.147	3.963
Scenario 3 (Offloaded)	5.889	5.661	5.799

C# and PHP tie for ease of use. C# is far away the winner in terms of FFI performance. Java is last for ease of use with JNI, but competitive with the project Panama FFI api. However the project Panama FFI api regresses in terms of performance. Java FFI performance is middling. Go and PHP bring up the rear on FFI performance -- for PHP I think it's not so much that the FFI is slow but that PHP itself is a slower language so more time is spent on the PHP side when running the hot loop.