-
Notifications
You must be signed in to change notification settings - Fork 9
xslt
The XSLTProcessor class defined in spark-xml-utils provides methods that enable the transformation of a record by applying a stylesheet. The record is assumed to be a string of XML.
The following import is required for the XSLTProcessor.
import com.elsevier.spark_xml_utils.xslt.XSLTProcessor
All that is required is the stylesheet that will be used for the transformation. Typically I store the stylesheet (as a string) in an S3 bucket. The stylesheet can then be easily retrieved using sc.textFile. Alternatively, the stylesheet could be defined in the code as a string.
val stylesheet = sc.textFile("s3n://spark-xml-utils/stylesheets/srctitle.xsl").collect.head
val proc = XSLTProcessor.getInstance(stylesheet)
The result of an transform operation will be the result of applying the stylesheet against the content (a string of XML). The transformation can occur locally on the driver (if you have returned records to the driver) or on the workers. In practice, the transformation will typically occur on the workers but I will show examples of both. The transform() method will accept either a String or an InputStream.
When transforming locally on the driver , the code would be something like the following. In the example below local is an Array of (String,String) where the first item is the key and the second item is the string of XML.
import com.elsevier.spark_xml_utils.xslt.XSLTProcessor
val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*")
val local = xmlKeyPair.take(10)
val stylesheet = sc.textFile("s3n://spark-xml-utils/stylesheets/srctitle.xsl").collect.head
val proc = XSLTProcessor.getInstance(stylesheet)
val localSrctitles = local.map(rec => proc.transform(rec._2))
When transforming on the workers, the code would be something like the following. In the example below xmlKeyPair is an RDD of (String,String) where the first item is the key and the second item is the string of XML. We use mapPartitions to initialize the processor for XSLT once per partition for optimal performance. We then use an iterator to process each record in the partition.
import com.elsevier.spark_xml_utils.xslt.XSLTProcessor
val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*")
val stylesheet = sc.textFile("s3n://spark-xml-utils/stylesheets/srctitle.xsl").collect.head
val srctitles = xmlKeyPair.mapPartitions(recsIter => {
val proc = XSLTProcessor.getInstance(stylesheet)
recsIter.map(rec => proc.transform(rec._2))
})
If there is an error encountered during the operation, the error will be logged and an exception will be raised.
There is also support for stylesheet parameters. The following sets the stylesheet parameter named publisher to the value <p>Elsevier</p>. This parameter can then be easily accessed in the stylesheet
import com.elsevier.spark_xml_utils.xslt.XSLTProcessor
import scala.collection.JavaConverters._
import java.util.HashMap
val stylesheet = sc.textFile("/mnt/spark-xml-utils/stylesheets/params.xsl").collect.head
xmlKeyPair.mapPartitions(recsIter => {
val proc = XSLTProcessor.getInstance(stylesheet)
recsIter.map(rec => {
val stylesheetParams = new HashMap[String,String](Map("publisher" -> "<p>Elsevier</p>").asJava)
proc.transform(rec._2,stylesheetParams)
})
}).collect.foreach(println(_))
I have successfully used XSLTProcessor from the spark-shell and notebook environments (such as Databricks and Zeppelin). Depending on the environment, you just need to get the spark-xml-utils.jar installed and available to the driver and workers. For the spark-shell, something like the following would be done.
cd {spark-install-dir}
./bin/spark-shell --jars lib/uber-spark-xml-utils-1.4.0.jar
You can also use the 'packages' option as well.
cd {spark-install-dir}
./bin/spark-shell --packages elsevierlabs-os:spark-xml-utils:1.4.0