The goal of this library is to let Collibra users run the Catalog data profiling jobs on their own Apache Spark clusters.
By default, profiling jobs are executed in JobServer which is running Spark in local mode (single machine). Thanks to this library, Collibra customers can leverage their infrastructure and scale up profiling jobs to get more out of their Catalog.
Because the profiling library users control the data that is profiled, they can also ingest and profile data sources that are not supported out-of-the-box by Collibra Catalog. They can define their own Spark DataSet, run the profiling library and then transfer the result to Collibra Catalog.
The library is designed to be used directly inside Spark driver code. It has a similar position as libraries such as mllib or Spark SQL. The profiler jar should be added to the dependencies of your Spark application.
The entry point to the profiling library is the
com.collibra.catalog.profilers.api.Profilers
class. This class can be directly imported
and used in your Spark code. In its most simple form, you need to provide a DataSet and
define what level of profiling you want:
- basic statistics
- basic statistics and quantiles
- full profiling
Each one of those levels relies on the previous one, thus making it longer to process.
Example:
ColumnProfilesUpdate profileUpdate = Profilers.profileTable(dataset, ProfileFeatures.BASIC_STATISTICS);
After profiling is completed, the result should be transferred to Collibra Catalog using the Collibra Catalog profiling REST API. This API enables client applications to send and store profiling data in Catalog assets. Typically, Collibra Connect is used as a middleware between Collibra Data Governance Center and the Spark cluster.
The result of the profiling job is a ColumnProfilesUpdate
object. This object is
provided in a format that is close to the one used by the Collibra Catalog profiling REST API.
Only the asset information is not included. There are 2 ways to add the missing information to the profiling
results:
- After the profiling result is received, loop over the ColumnProfile objects it contains and add the AssetIdentifier object in each one of them.
- Provide a method in the
Profilers.profileTable
call and let the profiling library loop over the ColumnProfile objects for you.
Depending on your architecture, it's possible that the information required to add the AssetIdentifiers is not available in your Spark job. In that case, only the first option can be used and adding the missing information must be done in another node (e.g. in a Collibra Connect script).
Next to the ProfilesFeature
enum, which allows you to select the level of profiling,
there are also some additional parameters that can be tuned to better control the profiler
behavior. Those parameters can be passed to the profiling jobs by providing a
ProfilingConfiguration
object when calling Profilers.profileTable
.
These are the available parameters:
- CacheLevel: Tells the profiling jobs if and how to cache data when
they identify points where caching can improve performance. The levels
are the same as those defined in
org.apache.spark.api.java.StorageLevels
. Set the level to NONE to prevent caching. - MaximumValueLength: Defines how many characters are used by profiling jobs for handling long text values.
- DefaultDatePattern: Defines the default date pattern used for date detection.
The pattern format in use is described in
java.time.format.DateTimeFormatter
. - DefaultTimePattern: Defines the default time pattern used for time detection.
The pattern format in use is described in
java.time.format.DateTimeFormatter
. - DefaultDateTimePattern: Defines the default date-time pattern used for date-times detection.
The pattern format in use is described in
java.time.format.DateTimeFormatter
. - MissingValuesDefinition: A list of values that should be considered as missing or empty when counting the number of missing values in a column.
All those parameters are initialized with sensible defaults and are therefore optional.
This project showcases the use of the profiling jobs used in Collibra Catalog. The profiling result is then uploaded to an instance of Collibra Catalog through the Collibra Catalog profiling REST API.
One example covers profiling a CSV file (included in the project). A second example covers profiling a table from a database via jdbc. For the second example, the developers are expected to adapt the code to connect to their own data sources.
In order to be able to use the Collibra Catalog profiling REST API, a simple Java REST client is included in the project. This client is by no means a suggested implementation for such functionality. It is added purely for illustrative purposes. A more common strategy is to establish communication between the Spark cluster or Hadoop environment and Collibra Data Governance Center using Collibra Connect.
A key aspect of writing the data profiles to Catalog is matching columns with Column assets. The two examples show a way to add asset identification information.
Please notice the Collibra Catalog profiling REST API expects that the assets already exist
and will only try to add the profiling information. Hence, a common strategy is to
first create the relevant assets using a simple Catalog ingestion or using
a Collibra Connect script and then use Collibra Connect again to send the profiling information
to Collibra Catalog. This Collibra Connect script would also be in charge of making the link
between a column profile and a Column asset using the AssetIdentifier
data
structure.
Since the profiling library is only distributed through the Collibra Marketplace, this project does not contain the library directly. Therefore, the first steps to run the example are the following:
- Download your own copy of the Collibra Catalog Profiling Library at Collibra Marketplace
- Update the project classpath by either
- storing the profiler jar file in the libs directory of this project or
- adapting
build.gradle
dependencies to point to a valid location of that library.
Then, depending on what example you are running, you may also need to change a couple of other properties:
- the connection information for the JDBC DataSet in Jdbc.java.
- the information to connect to the Data Governance Center using REST API
(both examples). See
ProfilingApiClient
usage in the examples.
You can build the project using the gradle wrapper: ./gradlew classes
compiles the main
classes.
Calling the run
gradle command will execute the CSV example:
./gradlew run
In order to execute the jdbc example, pass the jdbc
parameter to
gradle: ./gradlew -Pjdbc run
Library version | Collibra DGC version | Apache Spark version |
---|---|---|
collibra-profiler-1.0.0.jar | 5.6.1 | 2.2.3 |
Contributions are welcome in the form of pull requests. Those are subject to approval by Collibra. We expect contributors to follow the code of conduct defined here.
The examples in this project are released under the following license: LICENSE.
Collibra Catalog profiler library available at Collibra Marketplace to Collibra Catalog licence owners under the same license terms as Collibra Catalog.
The iris dataset is originating from https://archive.ics.uci.edu/ml/datasets/iris