Skip to content

User manual

pvanliefland edited this page Jan 3, 2024 · 42 revisions

This guide will walk you through the main features of the OpenHEXA data integration platform.

It mostly covers the basic concepts as well as the user interface.

If you are a data scientist / analyst / engineer planning to deploy code on the OpenHEXA platform, you might be interested in the following guides:

About workspaces

OpenHEXA workspaces are simply a way to group code, data and users. You can think of workspaces as projects or teams depending on your use case.

In OpenHEXA, everything happens within a workspace.

Switching between workspaces

If you have received access to several workspaces, you can change the active workspace by using the contextual menu on the top-left of the screen.

Switching.between.workspaces.mov

Editing the workspace homepage

The workspace can be used to explain the workspace purpose, to document the data and processes of the workspace, or to add useful links to external resources.

If you have the editor role within the workspace, you can change the homepage content simply by clicking on the "Edit" button. You will have to use Markdown for the formatting of the content itself.

Editing.the.workspace.homepage.mov

File management in workspaces

Every workspace comes with a shared filesystem. This filesystem allows you to store and retrieve any kind of file.

Depending on your privileges within the workspace, you will be able to:

  • Browse and download files
  • Upload files
  • Create directories
  • Delete files and directories
Using.the.file.browser.mov

The workspace filesystem can be used in OpenHEXA notebooks and in OpenHEXA data pipelines.

Using the workspaces database

Every workspace is associated with a dedicated PostgreSQL relational database. This database can be used to store structured (relational) data for your workspace. It can be easily accessed from external visualization solutions such as Tableau or PowerBI.

Depending on your privileges, you will be able to:

  • Browse the database tables
  • View the database connection parameters
  • Preview the data in each table and view basic column information
The.workspace.database.mov

The workspace database can be used in OpenHEXA notebooks and in OpenHEXA data pipelines.

You can also use your OpenHEXA workspace database as a data source in data visualization / BI tools, such as Tableau or PowerBI: simply copy-paste the connection parameters displayed on the database page into your tool of choice.

Adding and managing connections

Connections are the external data sources that you want to use within a workspace. They serve two purposes in OpenHEXA:

  1. They allow you to safely store credentials for external data systems, so that they can be reused in notebooks or in data pipelines
  2. They can be used to document which data sources are used within a workspace, and how they are used
Connections.mov

Connections can be used in OpenHEXA notebooks and in OpenHEXA data pipelines.

OpenHEXA supports the following connection types:

  • PosgtreSQL
  • DHIS2
  • S3 (Amazon Web Services)
  • GCS (Google Cloud platform)
  • Custom

Using built-in connection types

Using the build-in connection types (PostgreSQL, DHIS2, S3 or GCS) is straightforward: you just need to fill the connection creation form.

These built-in connection types have two fields in common:

  • Connection name: this is what OpenHEXA will display in the connections screen, you can choose any name you want
  • Description: optional, you may use it to document the purpose of the data source within the workspace

A unique identifier will be added to the connection based on the connection's name.

Using custom connections

Custom connections can be useful to store credentials for systems for which OpenHEXA has no built-in support.

Like the built-in fields, custom connections have a name and an optional description. Apart from these standard fields, you may add any number of field to your custom connection, and flag them as secret if they are sensitive (passwords, API tokens, ...).

Using notebooks

The OpenHEXA notebooks component is built on top of the Jupyter stack.

Please refer to the wiki entry dedicated to OpenHEXA notebooks for more information.

OpenHexa.notebooks.mov

Using pipelines

OpenHEXA pipelines can be considered as small data applications. They can be used for a wide variety of scenarios, such as:

  • Data processing / ETL (Extract, Transform, Load) operations
  • Automatic generation of PDF or Microsoft word reports based on data hosted in OpenHEXA
  • Connecting of different data systems (fetch data from a system, transform it and push it to another system)

OpenHEXA pipelines are written as Python programs. This wiki has a dedicated guide related to pipelines development.

Launching and monitoring pipelines

Once a data engineer or data scientist has deployed a pipeline in a workspace, all workspace members will be able to launch it.

To run a pipeline, simply chose a pipeline within the workspace and click on the "Run" call-to-action. You can then enter the pipeline parameters, and optionally chose to receive a notification whenever the pipeline has finished running (meaning that it has either completed successfully or failed).

While the pipeline is running, you will be able to monitor its status and read the messages sent by the pipeline to the OpenHEXA. Those messages are meant to provide feedback on what is happening in the pipeline ; it is the responsibility of the pipeline developer to implement them.

If the pipeline produces outputs (such as files or entries in workspace database), you will be able to access them once the pipeline has run successfully.

Users also have the possibility to inspect the results of the previous pipeline runs.

Running.and.monitoring.pipelines.mov

Scheduling pipelines

In addition to being run manually, pipelines can also be scheduled to run automatically.

You can specify a schedule for a pipeline by accessing its detail page and editing the "Scheduling" section.

You need to enter the schedule as a cron expression. If you are unsure about the syntax to use, https://crontab.guru/ might help - otherwise you may want to contact a data engineer or a software developer in your team for assistance.

You also have the possibility to chose which workspace members (including yourself) should be notified after each scheduled run.

Scheduling.pipelines.mov

As you cannot provide parameters when scheduling a pipeline, you will only be able to schedule a pipeline if it has no parameter or if all its parameters are optional (see the relevant documentation in the pipelines writing guide).

Pipeline timeouts

All pipelines will time out after a specific duration, which means that long-running workflows might be interrupted. The exact duration depends on the configuration of your OpenHEXA instance. The standard configuration sets the timeout at 4 hours, and allows pipeline authors to change this value up to a maximum of 12 hours.

Datasets

🔬 Datasets are an experimental feature in OpenHEXA, the implementation has not been stabilized yet and might change significantly

Datasets can be used to share or highlight data in your workspace. They can be useful to:

  • Upload input data that can already be considered as consolidated / validated
  • Highlight valuable data built in your workspace
  • Share data from one workspace to another workspace

Datasets are versioned and thus allow your to track changes in your data over time.

You can create a dataset using the web interface (see below) or using the OpenHEXA SDK.

Creating.datasets.mov

Datasets can easily be shared with other workspaces:

Sharing.datasets.mov

Workspace settings

As of now, the workspace settings section allows users to :

  • Invite and manager users
  • Regenerate the workspace database password

This sections is limited to users with admin privileges within the workspace.

Inviting and managing users

Inviting users is a straightforward process: click on "Invite Member", enter the invitee email address and chose its privilege level within the workspace (see below for more details about workspace privileges).

As long as you are a workspace admin, you can also easily change the privilege level of another member, or remove a member from the workspace.

Inviting.and.managing.users.mov

Roles and permissions

The actions that a user can take within a workspace is determined by their role.

The following table summarizes the permissions for each role:

Features Viewers Editors Admins
Read & download files x x x
Write files - x x
View database content x x x
View database credentials - x x
Write to database - x x
Regenerate database password - - x
Read & download datasets x x x
Write datasets - x x
Use connections - x x
Manage connections - - x
Launch pipelines x x x
Create pipelines - x x
Use notebooks - x x
Update workspace description - x x
Manage & invite users - - x

Regenerating the workspace database password

When a workspace is created, a database is automatically provisioned, along with a username and a password.

If the password is compromised somehow, you can invalidate it and create a new one in the settings section.

Regenerating.the.workspace.database.password.mov