Py4J has bigger overhead than Jython and JPype

8.1k views Asked by At

After searching for an option to run Java code from Django application(python), I found out that Py4J is the best option for me. I tried Jython, JPype and Python subprocess and each of them have certain limitations:

  • Jython. My app runs in python.
  • JPype is buggy. You can start JVM just once after that it fails to start again.
  • Python subprocess. Cannot pass Java object between Python and Java, because of regular console call.

On Py4J web site is written:

In terms of performance, Py4J has a bigger overhead than both of the previous solutions (Jython and JPype) because it relies on sockets, but if performance is critical to your application, accessing Java objects from Python programs might not be the best idea.

In my application performance is critical, because I'm working with Machine learning framework Mahout. My question is: Will Mahout also run slower because of Py4J gateway server or this overhead just mean that invoking Java methods from Python functions is slower (in latter case performance of Mahout will not be a problem and I can use Py4J).

5

There are 5 answers

0
bastian On

I don't know Mahout. But think about that: At least with JPype and Py4J you will have performance impact when converting types from Java to Python and vice versa. Try to minimize calls between the languages. Maybe it's an alternative for you to code a thin wrapper in Java that condenses many Javacalls to one python2java call.

0
Ian Lee On

My solutions

java thread/process <-> Pipes <-> py subprocess

Use pipes by java's ProcessBuilder to call py with args "-u" to transfer data via pipes.

Here is a good practice.

https://github.com/JULIELab/java-stdio-ipc

Here is my stupid research result about "java <-> py"

  • [Jython] Java implement of python.

  • [Jpype] JPype is designed to allow the user to exercise Java as fluidly as possible from within Python. We can break this down into a few specific design goals. Unlike Jython, JPype does not achieve this by re-implementing Python, but instead by interfacing both virtual machines at the native level. This shared memory based approach achieves good computing performance while providing the access to the entirety of CPython and Java libraries.

  • [Runtime] The Runtime class in java (old method).

  • [Process] Java ProcessBuilder class gives more structure to the arguments.

  • [Pipes] Named pipes could be the answer for you. Use subprocess. Popen to start the Java process and establish pipes to communicate with it. Try mkfifo() implementation in python.
    https://jj09.net/interprocess-communication-python-java/

     -> java<-> Pipes <-> py https://github.com/JULIELab/java-stdio-ipc 
    
  • [Protobuf] This is the opensource solution Google uses to do IPC between Java and Python. For serializing and deserializing data efficiently in a language-neutral, platform-neutral, extensible way, take a look at Protocol Buffers.

  • [Socket] CS-arch throgh socket Server(Python) - Client(Java) communication using sockets https://jj09.net/interprocess-communication-python-java/ Send File From Python Server to Java Client

  • [procbridge] A super-lightweight IPC (Inter-Process Communication) protocol over TCP socket. https://github.com/gongzhang/procbridge https://github.com/gongzhang/procbridge-python https://github.com/gongzhang/procbridge-java

  • [hessian binary web service protocol] using python client and java server.

  • [Jython] Jython is a reimplementation of Python in Java. As a result it has much lower costs to share data structures between Java and Python and potentially much higher level of integration. Noted downsides of Jython are that it has lagged well behind the state of the art in Python; it has a limited selection of modules that can be used; and the Python object thrashing is not particularly well fit in Java virtual machine leading to some known performance issues.

  • [Py4J] Py4J uses a remote tunnel to operate the JVM. This has the advantage that the remote JVM does not share the same memory space and multiple JVMs can be controlled. It provides a fairly general API, but the overall integration to Python is as one would expect when operating a remote channel operating more like an RPC front-end. It seems well documented and capable. Although I haven’t done benchmarking, a remote access JVM will have a transfer penalty when moving data.

  • [Jep] Jep stands for Java embedded Python. It is a mirror image of JPype. Rather that focusing on accessing Java from within Python, this project is geared towards allowing Java to access Python as a sub-interpreter. The syntax for accessing Java resources from within the embedded Python is quite similar to support for imports. Notable downsides are that although Python supports multiple interpreters many Python modules do not, thus some of the advantages of the use of Python may be hard to realize. In addition, the documentation is a bit underwhelming thus it is difficult to see how capable it is from the limited examples.

  • [PyJnius] PyJnius is another Python to Java only bridge. Syntax is somewhat similar to JPype in that classes can be loaded in and then have mostly Java native syntax. Like JPype, it provides an ability to customize Java classes so that they appear more like native classes. PyJnius seems to be focused on Android. It is written using Cython .pxi files for speed. It does not include a method to represent primitive arrays, thus Python list must be converted whenever an array needs to be passed as an argument or a return. This seems pretty prohibitive for scientific code. PyJnius appears is still in active development.

  • [Javabridge] Javabridge is direct low level JNI control from Python. The integration level is quite low on this, but it does serve the purpose of providing the JNI API to Python rather than attempting to wrap Java in a Python skin. The downside being of course you would really have to know a lot of JNI to make effective use of it.

  • [jpy] This is the most similar package to JPype in terms of project goals. They have achieved more capabilities in terms of a Java from Python than JPype which does not support any reverse capabilities. It is currently unclear if this project is still active as the most recent release is dated 2014. The integration level with Python is fairly low currently though what they do provide is a similar API to JPype.

  • [JCC] JCC is a C++ code generator that produces a C++ object interface wrapping a Java library via Java’s Native Interface (JNI). JCC also generates C++ wrappers that conform to Python’s C type system making the instances of Java classes directly available to a Python interpreter. This may be handy if your goal is not to make use of all of Java but rather have a specific library exposed to Python.

  • [VOC] https://beeware.org/project/projects/bridges/voc/_ A transpiler that converts Python bytecode into Java bytecode part of the BeeWare project. This may be useful if getting a smallish piece of Python code hooked into Java. It currently list itself as early development. This is more in the reverse direction as its goals are making Python code available in Java rather providing interaction between the two.

  • [p2j] This lists itself as “A (restricted) python to java source translator”. Appears to try to convert Python code into Java. Has not been actively maintained since 2013. Like VOC this is primilarly for code translation rather that bridging.

  • [GraalVM] Source: https://github.com/oracle/graal

0
Tagar On

PySpark uses Py4J quite successfully. If all the heavylifting is done on Spark (or Mahout in your case) itself, and you just want to return result back to "driver"/Python code, then Py4J might work for you very well as well.

Py4j has slightly bigger overhead for huge results (that's not necessarily the case for Spark workloads, as you only return summaries /aggregates for the dataframes). There is an improvement discussion for py4j to switch to binary serialization to remove that overhead for higher badnwidth requirements too: https://github.com/bartdag/py4j/issues/159

0
subes On

Because the performance is also a question about your usage screnario (how often you call the script and how large is the data that is moved) and because the different solutions have their own specific benefits/drawbacks, I have created an API to switch between different implementations without you having to change your python script: https://github.com/subes/invesdwin-context-python

Thus testing what works best or just being flexible about what to deploy to is really easy.

0
mirekphd On

JPype issue that @HIP_HOP mentioned with JVM getting detached from new threads can be overcome with the following hack (add it before the first call to Java objects in the new thread which does not have JVM yet):

# ensure that current thread is attached to JVM
# (essential to prevent JVM / entire container crashes 
# due to "JPJavaEnv::FindClass" errors)
if not jpype.isThreadAttachedToJVM():
    jpype.attachThreadToJVM()