-
Notifications
You must be signed in to change notification settings - Fork 28
Test Framework
Schedoscope's test framework integrates tightly with view development. It facilitate frequent test runs, speeding up development-testing roundtrip time.
From a conceptual point of view, the framework provides "unit testing" for Schedoscope views and transformations. The focus is neither on integration testing with a target / staging environment nor on load testing against large datasets. Rather, the framework supports correctness tests for views and transformations with simple specification of test input data, fast test execution, and painless checking of resulting view data with expected results.
The main design principles of Schedoscope's test framework are:
- Integration with Scalatest: Test specification and result checking is based on Scalatest, with its expressive library of assertions.
- Local transformation execution: In order to decouple testing from the availability of a Hadoop cluster environment and to achieve fast test execution times, the framework embeds various cluster components required for executing the different transformation types and configures most of them to run in local mode.
- Typesafe test data specification: Because the specification of input and output data of tests "lives" within Scala, we can completely eliminate errors stemming from wrong types, column names, etc. by compile time checks. Another beneficial side-effect of this approach is that one can use auto-completion in IDEs like Eclipse while writing tests.
- Default data generation: The framework encourages you to write separate tests for different aspects of a view transformation by generating reasonable default values for non-specified columns of input data. It is thus possible to focus on some specific colums, reducing the effort of specifying test data to a minimal amount. This is in stark contrast to other approaches where the test data definition overhead encourages you to create one huge input data set covering all test cases.
Briefly summarized, writing a test of a given view V
with the Schedoscope test framework consists of the following steps:
- defining the input data for all views that
V
depends on - adapting the configuration of
V
to accommodate local test execution (if necessary) - executing the transformation of
V
(in local mode) - checking the outcomes against expected results
To express these steps you can choose between different testing styles, depending on your concrete test case.
The following section shows complete test examples based on the [tutorial](Open Street Map Tutorial); the subsequent sections detail on the individual testing aspects.
The following example is taken from the tutorial. It tests the view Restaurants
that is populated
by a Hive transformation from a single other view Nodes
. Comments can be found within the source code; details on specifying test data input can be found in the subsequent section Test Data Definition; the section Test Running & Result Checking then explains test execution and result inspection. Schedoscope ships with its own test spec for ScalaTest: SchedoscopeSpec.
class RestaurantsTest extends SchedoscopeSpec {
//
// specify input data (OSM nodes). By extending the actual view by
// "with rows", we can specify input data. Each call of "set"
// adds a line of input data; "v" sets the value for a particular
// column.
//
val nodes = new Nodes(p("2014"), p("09")) with rows {
set(v(id, "267622930"),
v(geohash, "t1y06x1xfcq0"),
v(tags, Map("name" -> "Cuore Mio",
"cuisine" -> "italian",
"amenity" -> "restaurant")))
set(v(id, "288858596"),
v(geohash, "t1y1716cfcq0"),
v(tags, Map("name" -> "Jam Jam",
"cuisine" -> "japanese",
"amenity" -> "restaurant")))
set(v(id, "302281521"),
v(geohash, "t1y17m91fcq0"),
v(tags, Map("name" -> "Walddörfer Croque Café",
"cuisine" -> "burger",
"amenity" -> "restaurant")))
set(v(id, "30228"),
v(geohash, "t1y77d8jfcq0"),
v(tags, Map("name" -> "Giovanni",
"cuisine" -> "italian")))
}
//
// run tests & check results. By extending the actual view by "with test",
// we enable test running and result checking. With "basedOn", the input
// views ares specified; "then" actually runs the transformation,
// "numRows" yields the number of result rows, and "row" fetches a single
// line of test output. Within a row, individual column values can
// be checked by "v", using Scalatest assertions like "shouldBe".
//
// Please note that for sake of brevity, we only inspect the details
// of the first output row.
//
"datahub.Restaurants" should "load correctly from processed.nodes" in {
new Restaurants() with test {
basedOn(nodes)
then()
numRows shouldBe 3
row(v(id) shouldBe "267622930",
v(restaurant_name) shouldBe "Cuore Mio",
v(restaurant_type) shouldBe "italian",
v(area) shouldBe "t1y06x1")
}
}
}
As stated above, testing using the Schedoscope test framework is based on small, individually defined datasets. For this reason, the framework allows for the specification of datasets in a row-by-row manner. Because this process tends to be tedious, the framework helps users by
- supporting typechecking / autocompletion of input data fields test data specification and
- default value generation for non-specified input fields.
To stick with the above example, the view Restaurants depends on the view Nodes. Its (simplified) definition looks like
case class Nodes(
year: Parameter[String],
month: Parameter[String]) extends View
[...] {
val version = fieldOf[Int]
val user_id = fieldOf[Int]
val longitude = fieldOf[Double]
val latitude = fieldOf[Double]
val geohash = fieldOf[String]
val tags = fieldOf[Map[String, String]]
[...]
)
Nodes maps to a Hive table with columns version
, user_id
, etc., and the partitions year
and month
. In order to generate input rows for this view, creates an anonymous subclass of Nodes with the trait rows
. Then, one can call the set
method to add rows and v
to set column values.
For example:
val nodes = new Nodes(p("2014"), p("09")) with rows {
set(v(version, 1),
v(user_id, 1234),
v(longitude, 11111.11),
v(latitude, 22222.22),
v(geohash, "t1y1716cfcq0"),
v(tags, Map("tag1" -> "val1",
"tag2" -> "val2")))
set(v(version, 2),
v(user_id, 5678),
v(longitude, 33333.33),
v(latitude, 33333.33),
v(geohash, "t1y1716cfcqx"),
v(tags, Map("tag3" -> "val3",
"tag3" -> "val3")))
}
This results in the following two rows in the input table:
version | user_id | longitude | latitude | geohash | tags |
---|---|---|---|---|---|
1 | 1234 | 11111.11 | 22222.22 | t1y1716cfcq0 | {'tag1':'val1', 'tag2':'val2'} |
2 | 5678 | 33333.33 | 44444.44 | t1y1716cfcqx | {'tag3':'val3', 'tag4':'val4'} |
For specifying lists and maps, the standard Scala types List
and Map
can be used.
In order to specify input for views which contain structures (i.e., Hive STRUCT fields),
refer to the section 'Defining Structs as Input'. Please note that:
- when defining data, auto-completion can be used - i.e. all fields of the current view are in scope;
- when defining values, these are type-checked against the column type at compile-time.
Often, when testing a specific aspect of a view, not all fields are relevant. For this
reason, the testing framework allows one to skip fields when defining input. It
fills those irrelevant fields with default values. Using ROW_ID
as a placeholder
for the current row number the schema for generating default values for a field named FIELDNAME
is (by type):
-
String:
FIELDNAME-ROW_ID
(ROW_ID
is left-padded with zeroes to length 4) -
Numeric types (Int, Double, ...):
ROW_ID
-
Map: empty
Map()
-
Array: empty
Array()
- Struct: default value generation for structs is currently not supported
To stick with the Nodes
example, let's say only longitude and
latitude would be of interest for our current test. Then we could write:
val nodes = new Nodes(p("2014"), p("09")) with rows {
set(v(longitude, 11111.11),
v(latitude, 22222.22))
set(v(longitude, 33333.33),
v(latitude, 33333.33))
}
The resulting table looks like this:
version | user_id | longitude | latitude | geohash | tags |
---|---|---|---|---|---|
1 | 1 | 11111.11 | 22222.22 | geohash-0001 | {} |
2 | 2 | 33333.33 | 44444.44 | geohash-0002 | {} |
When the test input data contains structured fields (i.e. fields of the Hive STRUCT type), each struct input must be defined manually. Let's say we have a struct that looks like this:
case class ProductInList() extends Structure {
val productName = fieldOf[String]
val productNumber = fieldOf[String]
val productPosition = fieldOf[String]
}
Then, values for it can be assigned by subclassing with trait values
, together with the set
and v
functions you already know for setting values:
val product0815 = new ProductInList with values {
set(
v(productName, "The Gadget"),
v(productNumber, "0815"),
v(productPosition, "rec-2"))
}
The execution of tests is based on Scalatest and can look like this:
case class RestaurantsTest() extends SchedoscopeSpec {
val nodesInput = new Nodes(p("2014"), p("09")) with rows {
// input data definition from above
}
// test
"datahub.Restaurants" should "load correctly from processed.nodes" in {
new Restaurants() with test {
basedOn(nodes)
then()
numRows() shouldBe 3
row(v(id) shouldBe "267622930",
v(restaurant_name) shouldBe "Cuore Mio",
v(restaurant_type) shouldBe "italian",
v(area) shouldBe "t1y06x1")
}
}
}
As you can see, the view which is to be tested is actually being
instantiated by new Restaurants()
(this is an example of a
non-parameterized view). By extending it using with test
, a set of
testing functions is added to the view itself. Following the above example,
these are:
-
basedOn(View*)
: Using this function, we can define the actual input data for our view to be tested. Usually, one defines one dataset for each view dependency. -
then(sortedBy?, disableDependencyCheck?, disableTransformationValidation?)
: When calling this function, the materialization process of the view is triggered - in other words, the transformation which populates the view is being executed in local mode. Plain local mode is currently available for Hive, Pig, and Mapreduce transformations; Oozie transformations can be tested against a local minicluster (see advanced section below).One can optionally pass a view field to
then
as a sorting criteria to make sure result rows are returned and checkable in a predictable order.The test framework checks by default, whether each input view you pass to the test via
basedOn()
is among the tested view's dependencies and whether you pass at least one input view for each type of dependency as sanity checks for dependency declarations. This check can be disabled by settingdisableDependencyCheck
totrue
.Finally, the test framework performs some core validations on transformations, which can be disabled by setting
disableTransformationValidation
totrue
. Currently, this only checks whether eachJOIN
clause in your Hive transformations is accompanied by anON
predicate. -
numRows()
: after the transformation has been executed, this function yields the number of rows in the resulting view. -
row()
: By invoking this function, test results can be inspected row by row. For each call, an internal row pointer is being incremented (starting from the first row with the first call of this function); then, within the current row, each field can be individually inspected. -
v(Field)
: As a counterpart to thev
function used for specifying input data, this function returns the value for the given field
in the current result row. The value is returned in a typesafe manner, and can be compared using the full Scalatest matcher
functionality (e.g.shouldBe
,shouldHaveLength
, ...).
Schedoscope offers some slight variations from the standard test style to enhance the readability and faster runtime for certain test scenarios.
The SchedoscopeSpec trait offers the possibility to declare a view under test to the test suite. This has two advantages:
- The view is only transformed once during a test.
- You can define several independent tests on the result of this one transformation.
To do this, you simply tell the test suite which rows you want to transform before testing. This is done by passing a View
with a test
trait into putViewUnderTest()
class RestaurantsTest extends SchedoscopeSpec {
// specify input data. This step is the same as before.
val nodes = new Nodes(p("2014"), p("09")) with rows {
set(v(id, "267622930"),
v(geohash, "t1y06x1xfcq0"),
v(tags, Map("name" -> "Cuore Mio",
"cuisine" -> "italian",
"amenity" -> "restaurant")))
set(v(id, "288858596"),
v(geohash, "t1y1716cfcq0"),
v(tags, Map("name" -> "Jam Jam",
"cuisine" -> "japanese",
"amenity" -> "restaurant")))
}
//
//The view under test is registered at the test suite.
//The suite will transform this view at the beginning of the test run
//before any tests are evaluated.
//You can add multiple views under tests.
//
val restaurant = putViewUnderTest{
new Restaurants() with test {
basedOn(nodes)
sortRowsBy(restaurant_name)
}
}
//
//Import the view under test in order to have access to its fields.
//If you have multiple views under test,
//you have to import the specific view you're testing into the test case.
//
import restaurant._
//Now you are able to define multiple tests on the same transformed view.
"datahub.Restaurants" should "have the right amount of rows" in {
numRows shouldBe 2
}
//A second test on the first row
it should "have a restaurant" in {
row(v(id) shouldBe "267622930",
v(restaurant_name) shouldBe "Cuore Mio",
v(restaurant_type) shouldBe "italian",
v(area) shouldBe "t1y06x1")
}
//
//If you are testing subsequent rows you have to
//forward the row index before defining assertions.
//This is done by calling startWithRow(index).
//
it should "have a second restaurant" in {
startWithRow(1)
row(v(id) shouldBe "288858596",
v(restaurant_name) shouldBe "Jam Jam",
v(restaurant_type) shouldBe "japanese")
}
}
The above example shows the then()
method is not invoked by the developer anymore, the transformation is triggered by the test suite in the background. You can still change the settings for the test transformation by using dedicated methods:
val restaurant = putViewUnderTest{
new Restaurants() with test {
basedOn(nodes)
sortRowsBy(restaurant_name)
disableDependencyCheck()
disableTransformationValidation()
}
}
The test styles shown require that all input views and their data are defined outside the test cases. You migh want to test the same transformation with different input sets, however. In the previous examples, this would mean to define the same view schema multiple times but with different data or once with all data for all test cases. ReusableHiveSchema
provides a way to separate view definition from input data. This solves the discussed problems and provides the means for better-structured tests. The trait allows you to reuse predefined schemas for views in multiple tests. The test suite will fill these schemas during the tests. After each test case, the schemas are then emptied. The following example shows a refactoring of the previous full test case:
class RestaurantsTest extends SchedoscopeSpec with ReusableHiveSchema {
// Specify an input schema:
val nodes = new Nodes(p("2014"), p("09")) with InputSchema
//
// Specify an output schema:
// The basedOn() method defines which `InputSchema` the `OutputSchema` will
// use for input the tests.
//
val restaurant = new Restaurants() with OutputSchema {
basedOn(nodes)
}
//
//Import the input schema in order to have access to its fields.
//If you have multiple output schemas to test
//you have to import the specific schema
//you're testing into the test case.
//
import restaurant._
//
//Each test case now follows the pattern of
//filling the input schema with data,
//triggering a transformation and verifying the results.
//
it should "load an italian restaurant" in {
//Fill the input schema
{
//define which schema you're acessing.
//Make sure to enclose it in it's own scope.
import nodes._
set(v(id, "267622930"),
v(geohash, "t1y06x1xfcq0"),
v(tags, Map("name" -> "Cuore Mio",
"cuisine" -> "italian",
"amenity" -> "restaurant")))
}
//trigger the test transformation
then(restaurant)
//validate input
numRows() shouldBe 1
row(v(id) shouldBe "267622930",
v(restaurant_name) shouldBe "Cuore Mio",
v(restaurant_type) shouldBe "italian",
v(area) shouldBe "t1y06x1")
}
//
//Use the schemas for another test.
//
it should "load an japanese restaurant" in {
{
import nodes._
set(v(id, "288858596"),
v(geohash, "t1y1716cfcq0"),
v(tags, Map("name" -> "Jam Jam",
"cuisine" -> "japanese",
"amenity" -> "restaurant")))
}
then(restaurant)
numRows() shouldBe 1
row(v(id) shouldBe "288858596",
v(restaurant_name) shouldBe "Jam Jam",
v(restaurant_type) shouldBe "japanese")
}
}
The ReusableHiveSchema
trait tries to reuse also some resources between test cases, so it should have some minor improvements in runtime in regards to the default test style.
Apart from the standard testing functionality explained above, the Schedoscope testing framework also offers more advanced features.
Within Schedoscope, all transformations support a generic configuration
mechanism by using key-value pairs. Typically, the required configuration
is specified directly within the transformation definition of the view
like this (simplified Pig transformation, taken from the tutorial view
Trainstations
):
transformVia(() =>
PigTransformation(
scriptFromResource("pig_scripts/datahub/insert_trainstations.pig"))
.configureWith(
Map("storage_format" -> "parquet.pig.ParquetStorer()")))
As you can see, view configuration is done by the configureWith
method.
In some cases, it can be necessary to override these specifications for testing.
In the current example, the Pig script produces Parquet output;
however, for local testing, a plain text output (i.e. PigStorage)
is desired. This can be accomplished by the following mechanism
in the test case:
"datahub.Trainstations" should "load correctly from processed.nodes" in {
new Trainstations() with test {
basedOn(nodesInput)
withConfiguration(
("storage_format" -> "PigStorage()"))
...
}
This allows to override any predefined configuration property in order to adapt the views to the test environment.
Another common usecase is that transformations (e.g. Hive Queries or
Mapreduce Jobs) require resource files during their execution. The location
of these resource files is usually configured directly in the view
definition. Let's say we have a little example view called FilteredUsers
,
which is populated by a Hive transformation which includes a UDF called
filter_users
, which takes a username as an argument, together with
a path to a user whitelist file:
case class FilteredUsers {
[...]
transformVia(() =>
HiveTransformation(
"SELECT filter_users(user_name, ${user_whitelist}) FROM users"
withFunctions(
this,
Map("filter_users" -> classOf[GenericUDFFilterUsers]))
.configureWith(
Map("user_whitelist" -> "/hdfs/path/to/user.whitelist")))
[...]
}
Now we have the situation that this transformation will run perfectly fine in the target hadoop cluster, because the user whitelist file has been (hopefully ;) ) deployed there; however, for local testing, this file is not available. For this purpose, the Schedoscope testing framework allows to specify a local resource for the given configuration property as follows:
case class FilteredUsersTest() ... {
[...]
"filtered users" should "load correctly from users" in {
new FilteredUsers() with test {
basedOn(...)
withResource(("user_whitelist", "src/test/resources/user.whitelist"))
then()
[...]
}
}
}
Using this approach, test resource files can be managed comfortably within the standard Maven project location, and are made available during the local mode testing.
In order to speed up tests, transformations are run completely in local mode if possible; this is currently supported for Hive, MapReduce and Pig Transformations. However, running e.g. an Oozie transformation requires a more infrastructure; for this purpose, we are using a predefined Hadoop minicluster, which is basically a small virtual Hadoop cluster running directly in the JVM. It ships with all necessary things to run Oozie transformations.
Setting up the minicluster is easy - just replace the with test
clause in the test case specification by with oozieTest
. Then, behind the scenes the minicluster will be launched
prior to test execution. Properties of the minicluster (e.g. the namenode URI) can be accesed by the cluster()
method:
"my cool view" should s"load correctly " in {
new CoolView() with oozieTest {
basedOn(dependency1)
withConfiguration(
("jobTracker" -> cluster().getJobTrackerUri),
("nameNode" -> cluster().getNameNodeUri),
...
)
}
}
A requirement for this to work is that the minicluster can use some specific ports; these have to be defined in a file called minicluster.properties
on your project classpath; we recommend to place it at src/test/resources/minicluster.properties
with the following content:
hadoop.namenode.port=9000
hive.server.port=10000
hive.metastore.port=30000