Nashorn Java to JavaScript interoperability issues

This page is to document, discuss and possible solutions for short comings of the Java JavaScript interoperability in Java8

Extending a Spark Lambda function, Nashorn does not support extending Java class that extends java.io.Serializable

Spark processing id accomplished by providing Lambda functions to Spark class for example RDD

 JavaRDD complete_ratings_data = complete_ratings_raw_data.filter(new Function<String, Boolean>() {
            public Boolean call(String line) {
                if (line.equals(complete_ratings_raw_data_header)) {
                    return false;
                } else {
                    return true;
                }
            }
        });

In the Example above we are implementing the org.apache.spark.api.java.function.Function class The Nashorn Documentation states that we can implement/extend a Java class by either

// This syntax is primarily used to support anonymous class-like syntax for
// Java interface implementation as shown below.
 
var r = new java.lang.Runnable() {
    run: function() { print("run"); }
}

or


var ArrayList = Java.type("java.util.ArrayList")
var ArrayListExtender = Java.extend(ArrayList)
var printSizeInvokedArrayList = new ArrayListExtender() {
    size: function() { print("size invoked!"); }
}

So for org.apache.spark.api.java.function.Function we would have code like:

 var jsFunc = new org.apache.spark.api.java.function.Function() {
        call: function(line) {
            return line != "userId,movieId,rating,timestamp"; //complete_ratings_raw_data_header;
        }
    }
 var xx = complete_ratings_raw_data_JavaObj.filter(boolFunctionExtender);

or

    var sparkFunction = Java.type("org.apache.spark.api.java.function.Function")
    var sparkFunctionExtender = Java.extend(sparkFunction)

    var boolFunctionExtender = new sparkFunctionExtender() {
        call: function(line) {
            return line != "userId,movieId,rating,timestamp"; //complete_ratings_raw_data_header;
        }
    }
    var xx = complete_ratings_raw_data_JavaObj.filter(boolFunctionExtender);

will throw the exception Exception in thread "main" java.lang.RuntimeException: org.apache.spark.SparkException: Task not serializable

Nashorn parseInt returns a java.lang.Double

var x = parseInt("3");
print(x.getClass()); // prints java.lang.Double

This causes class cast exceptions when running many mllib classes. One way to fix this would be to add RDD.map() to every place we run into the class cast exceptions and insure we have the correct types by using java.lang.Integer.parseInt() to ensure integers. But a better solution would be to just "monkey patch" the Nashorn implementation of parseInt.

/**
 *  We need to replace the Nashorn's implementation of parseInt becouse it returns
 *  a java.lang.Double. Why you ask, that is a good question!
 *  Any way this really mess up spark as we need a parseInt to be a java.lang.Integer
 *  so we will replace it globally with an implementation that works for spark
 * @param string
 * @param radix
 * @returns {Number}
 * @private
 */
parseInt = function(string, radix) {

    var val = NaN;
    try{
        if (radix) {
            val = java.lang.Integer.parseInt(string, radix);
        } else {
            val = java.lang.Integer.parseInt(string);
        }
    } catch (e) {
        // bad parseInt value
    }

    return val;
};

Automatic conversion of Arrays to/from Java/JavaScript

Convert arrays from JavaScript to Java is not automatic if we have a JavaScript array of int[] we must do

ret = Java.to(l, "int[]");

for double[]

ret = Java.to(l, "double[]");

for Object[]

ret = Java.to(l);

and for a Java array we need to call var keys = Java.from(javaObj.keySet().toArray());

Performance penetal using anonymous functions instead of looping for Arrays

We have notices that looping through and array is faster than using LAMBDA functions to process the array

    a.forEach(function(x) {
        args.push(x);
    });

takes longer than

    for (var i = 1; i < arguments.length; i++) {
        args.push(Serialize.javaToJs(arguments[i]));
    }

The above code snippet is from Utils_invoke So this code is invoked every time we "setup" to call a users LAMBDA functions. I would suspect that the issues is setting up the stack for the anonymous function, while this may not be significant if this were to only happen once but with large datasets like movieLen (100M) this time would be significant.

Running the LAMBDA functions in Nashorn.

Running the LAMBDA functions in Nashorn has a cost, running code that just loads a large dataset and filters the dataset removing a string.

    var obj = complete_ratings_raw_data.getJavaObject();

    var start = new Date().getTime();
    var complete_ratings_data = obj.filter(new org.eclairjs.nashorn.JSFunctionTest());


    print("There are recommendations in the complete dataset:  " + complete_ratings_data.count());

    var end = new Date().getTime();
    var time = end - start;
    print('Execution time: ' + time + " milliseconds");

Running the filter in Java with

package org.eclairjs.nashorn;

import org.apache.commons.lang.ArrayUtils;
import org.apache.spark.api.java.function.Function;

import javax.script.Invocable;
import javax.script.ScriptEngine;

public class JSFunctionTest implements Function {


    public JSFunctionTest() {

    }

    @SuppressWarnings({ "null", "unchecked" })
    @Override
    public Object call(Object l) {
        String line = (String) l;
        if (line.equals("userId,movieId,rating,timestamp")) {
            return false;
        } else {
            return true;
        }
    }
}

gives us a time of

There are recommendations in the complete dataset:  22884377
Execution time: 2379 milliseconds

Changing the LAMDA to run in nashorn using

import org.apache.commons.lang.ArrayUtils;
import org.apache.spark.api.java.function.Function;

import javax.script.Invocable;
import javax.script.ScriptEngine;

public class JSFunctionTest2 implements Function {

    private Object fn = null;

    public JSFunctionTest2() {

    }

    @SuppressWarnings({ "null", "unchecked" })
    @Override
    public Object call(Object l) throws Exception {
        ScriptEngine e =  NashornEngineSingleton.getEngine();
        if (this.fn == null) {
            String func = "function myTestFunc(line) { return line != \"userId,movieId,rating,timestamp\";}";
            this.fn = e.eval(func);
        }
        Invocable invocable = (Invocable) e;

        Object params[] = {this.fn, l};


        Object ret = invocable.invokeFunction("myTestFunc", params);

        return ret;
    }
}

gives us a time of:

There are recommendations in the complete dataset:  22884378
Execution time: 8372 milliseconds

just using Nashorn to run the equivalent JavaScript code without our serialization cost us 60 milliseconds

String concatenation in LAMBDA functions in Nashorn.

When loading the large dataset, and the Lambda function contains a String concatenation ( "something"+obj), the time is 90 seconds

When the String concatenation is removed, the time is only 45 seconds.

Wrapping Java object with a JavaScript object

We need to place the Java object inside a JavaScript object so that the JavaScript programmer can use common JavaScript programming practices in their Spark code. For example a LAMBDA function like:

 var small_movies_titles = small_movies_data.mapToPair(
        function( tuple2) { // Tuple2
            print("tuple2 JSON= " + JSON.stringify(tuple2));
            return new Tuple2(tuple2._1(), tuple2._2());

        });

If tuple2 is a scala Tuple2 the output result from the JSON.stringify(tuple2) is undefined The reason for this is JSON.stringify doesn't stringify POJOs, and can't be extended by Nashorn to stringify POJOs So in order to support common JavaScript practices like JSON.stringify we have a JavaScript object Tuple2 takes a Java Tuple object.

(function () {

    var JavaWrapper = require(EclairJS_Globals.NAMESPACE + '/JavaWrapper');
    var Logger = require(EclairJS_Globals.NAMESPACE + '/Logger');
    var Utils = require(EclairJS_Globals.NAMESPACE + '/Utils');
    var javaTuple2 = Java.type('scala.Tuple2');
    /**
     * @classdesc
     * @param {object} obj
     * @param {object} obj
     * @constructor
     * @memberof module:eclairjs
     */
    var Tuple2 = function () {
        this.logger = Logger.getLogger("Tuple2_js");
        var jvmObject;
        if (arguments.length == 2) {
            jvmObject = new javaTuple2(Serialize.jsToJava(arguments[0]), Serialize.jsToJava(arguments[1]));
        } else {
            jvmObject = Utils.unwrapObject(arguments[0]);
        }
        JavaWrapper.call(this, jvmObject);

    };

    Tuple2.prototype = Object.create(JavaWrapper.prototype);

    Tuple2.prototype.constructor = Tuple2;

    /**
     *
     * @returns {object}
     */
    Tuple2.prototype._1 = function () {
        return Utils.javaToJs( this.getJavaObject()._1());
    };

    /**
     *
     * @returns {object}
     */
    Tuple2.prototype._2 = function () {
        return Utils.javaToJs( this.getJavaObject()._2());
    };

    Tuple2.prototype.toJSON = function () {
        var jsonObj = {};
        jsonObj[0] = this._1();
        jsonObj[1] = this._2();
        jsonObj.length = 2;
        return jsonObj;

    };

    module.exports = Tuple2;

})();

Before calling the LAMBDA function we create a new JavaScript Tuple2 var tuple2 = new Tuple2(javaTuple2); and pass the JavaScript tuple2 as a argument to the LAMBDA. Now when the JavaScript JSON.stringify the JavaScript Tuple2.toJSON method is invoked and the sting {"0":"value1", "1":"value2"} is returned from JSON.Stringify(tuple2) as the JavaScript programmer would expect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly