-
Notifications
You must be signed in to change notification settings - Fork 11
Nashorn Java to JavaScript interoperability issues
This page is to document, discuss and possible solutions for short comings of the Java JavaScript interoperability in Java8
Extending a Spark Lambda function, Nashorn does not support extending Java class that extends java.io.Serializable
Spark processing id accomplished by providing Lambda functions to Spark class for example RDD
JavaRDD complete_ratings_data = complete_ratings_raw_data.filter(new Function<String, Boolean>() {
public Boolean call(String line) {
if (line.equals(complete_ratings_raw_data_header)) {
return false;
} else {
return true;
}
}
});
In the Example above we are implementing the org.apache.spark.api.java.function.Function
class
The Nashorn Documentation states that we can implement/extend a Java class by either
// This syntax is primarily used to support anonymous class-like syntax for
// Java interface implementation as shown below.
var r = new java.lang.Runnable() {
run: function() { print("run"); }
}
or
var ArrayList = Java.type("java.util.ArrayList")
var ArrayListExtender = Java.extend(ArrayList)
var printSizeInvokedArrayList = new ArrayListExtender() {
size: function() { print("size invoked!"); }
}
So for org.apache.spark.api.java.function.Function
we would have code like:
var jsFunc = new org.apache.spark.api.java.function.Function() {
call: function(line) {
return line != "userId,movieId,rating,timestamp"; //complete_ratings_raw_data_header;
}
}
var xx = complete_ratings_raw_data_JavaObj.filter(boolFunctionExtender);
or
var sparkFunction = Java.type("org.apache.spark.api.java.function.Function")
var sparkFunctionExtender = Java.extend(sparkFunction)
var boolFunctionExtender = new sparkFunctionExtender() {
call: function(line) {
return line != "userId,movieId,rating,timestamp"; //complete_ratings_raw_data_header;
}
}
var xx = complete_ratings_raw_data_JavaObj.filter(boolFunctionExtender);
will throw the exception Exception in thread "main" java.lang.RuntimeException: org.apache.spark.SparkException: Task not serializable
Nashorn parseInt returns a java.lang.Double
var x = parseInt("3");
print(x.getClass()); // prints java.lang.Double
This causes class cast exceptions when running many mllib classes. One way to fix this would be to add RDD.map()
to every place we run into the class cast exceptions and insure we have the correct types by using java.lang.Integer.parseInt()
to ensure integers. But a better solution would be to just "monkey patch" the Nashorn implementation of parseInt.
/**
* We need to replace the Nashorn's implementation of parseInt becouse it returns
* a java.lang.Double. Why you ask, that is a good question!
* Any way this really mess up spark as we need a parseInt to be a java.lang.Integer
* so we will replace it globally with an implementation that works for spark
* @param string
* @param radix
* @returns {Number}
* @private
*/
parseInt = function(string, radix) {
var val = NaN;
try{
if (radix) {
val = java.lang.Integer.parseInt(string, radix);
} else {
val = java.lang.Integer.parseInt(string);
}
} catch (e) {
// bad parseInt value
}
return val;
};
Convert arrays from JavaScript to Java is not automatic if we have a JavaScript array of int[] we must do
ret = Java.to(l, "int[]");
for double[]
ret = Java.to(l, "double[]");
for Object[]
ret = Java.to(l);
and for a Java array we need to call var keys = Java.from(javaObj.keySet().toArray());
We have notices that looping through and array is faster than using LAMBDA functions to process the array
a.forEach(function(x) {
args.push(x);
});
takes longer than
for (var i = 1; i < arguments.length; i++) {
args.push(Serialize.javaToJs(arguments[i]));
}
The above code snippet is from Utils_invoke
So this code is invoked every time we "setup" to call a users LAMBDA functions. I would suspect that the issues is setting up the stack for the anonymous function, while this may not be significant if this were to only happen once but with large datasets like movieLen (100M) this time would be significant.
Running the LAMBDA functions in Nashorn has a cost, running code that just loads a large dataset and filters the dataset removing a string.
var obj = complete_ratings_raw_data.getJavaObject();
var start = new Date().getTime();
var complete_ratings_data = obj.filter(new org.eclairjs.nashorn.JSFunctionTest());
print("There are recommendations in the complete dataset: " + complete_ratings_data.count());
var end = new Date().getTime();
var time = end - start;
print('Execution time: ' + time + " milliseconds");
Running the filter in Java with
package org.eclairjs.nashorn;
import org.apache.commons.lang.ArrayUtils;
import org.apache.spark.api.java.function.Function;
import javax.script.Invocable;
import javax.script.ScriptEngine;
public class JSFunctionTest implements Function {
public JSFunctionTest() {
}
@SuppressWarnings({ "null", "unchecked" })
@Override
public Object call(Object l) {
String line = (String) l;
if (line.equals("userId,movieId,rating,timestamp")) {
return false;
} else {
return true;
}
}
}
gives us a time of
There are recommendations in the complete dataset: 22884377
Execution time: 2379 milliseconds
Changing the LAMDA to run in nashorn using
import org.apache.commons.lang.ArrayUtils;
import org.apache.spark.api.java.function.Function;
import javax.script.Invocable;
import javax.script.ScriptEngine;
public class JSFunctionTest2 implements Function {
private Object fn = null;
public JSFunctionTest2() {
}
@SuppressWarnings({ "null", "unchecked" })
@Override
public Object call(Object l) throws Exception {
ScriptEngine e = NashornEngineSingleton.getEngine();
if (this.fn == null) {
String func = "function myTestFunc(line) { return line != \"userId,movieId,rating,timestamp\";}";
this.fn = e.eval(func);
}
Invocable invocable = (Invocable) e;
Object params[] = {this.fn, l};
Object ret = invocable.invokeFunction("myTestFunc", params);
return ret;
}
}
gives us a time of:
There are recommendations in the complete dataset: 22884378
Execution time: 8372 milliseconds
just using Nashorn to run the equivalent JavaScript code without our serialization cost us 60 milliseconds
When loading the large dataset, and the Lambda function contains a String concatenation ( "something"+obj), the time is 90 seconds
When the String concatenation is removed, the time is only 45 seconds.
We need to place the Java object inside a JavaScript object so that the JavaScript programmer can use common JavaScript programming practices in their Spark code. For example a LAMBDA function like:
var small_movies_titles = small_movies_data.mapToPair(
function( tuple2) { // Tuple2
print("tuple2 JSON= " + JSON.stringify(tuple2));
return new Tuple2(tuple2._1(), tuple2._2());
});
If tuple2 is a scala Tuple2 the output result from the JSON.stringify(tuple2)
is undefined
The reason for this is JSON.stringify doesn't stringify POJOs, and can't be extended by Nashorn to stringify POJOs
So in order to support common JavaScript practices like JSON.stringify
we have a JavaScript object Tuple2
takes a Java Tuple object.
(function () {
var JavaWrapper = require(EclairJS_Globals.NAMESPACE + '/JavaWrapper');
var Logger = require(EclairJS_Globals.NAMESPACE + '/Logger');
var Utils = require(EclairJS_Globals.NAMESPACE + '/Utils');
var javaTuple2 = Java.type('scala.Tuple2');
/**
* @classdesc
* @param {object} obj
* @param {object} obj
* @constructor
* @memberof module:eclairjs
*/
var Tuple2 = function () {
this.logger = Logger.getLogger("Tuple2_js");
var jvmObject;
if (arguments.length == 2) {
jvmObject = new javaTuple2(Serialize.jsToJava(arguments[0]), Serialize.jsToJava(arguments[1]));
} else {
jvmObject = Utils.unwrapObject(arguments[0]);
}
JavaWrapper.call(this, jvmObject);
};
Tuple2.prototype = Object.create(JavaWrapper.prototype);
Tuple2.prototype.constructor = Tuple2;
/**
*
* @returns {object}
*/
Tuple2.prototype._1 = function () {
return Utils.javaToJs( this.getJavaObject()._1());
};
/**
*
* @returns {object}
*/
Tuple2.prototype._2 = function () {
return Utils.javaToJs( this.getJavaObject()._2());
};
Tuple2.prototype.toJSON = function () {
var jsonObj = {};
jsonObj[0] = this._1();
jsonObj[1] = this._2();
jsonObj.length = 2;
return jsonObj;
};
module.exports = Tuple2;
})();
Before calling the LAMBDA function we create a new JavaScript Tuple2 var tuple2 = new Tuple2(javaTuple2);
and pass the JavaScript tuple2
as a argument to the LAMBDA. Now when the JavaScript JSON.stringify
the JavaScript Tuple2.toJSON
method is invoked and the sting {"0":"value1", "1":"value2"}
is returned from JSON.Stringify(tuple2)
as the JavaScript programmer would expect.