Skip to content

Job Scheduler

Krzysztof Balka edited this page Sep 21, 2016 · 7 revisions

The Job Scheduler allows you to import data from a SQL database into HDFS connected to TAP. Data can be imported in batch mode or by scheduling periodic, automatic updates.

Import data from a SQL database

From the TAP console main menu, navigate to Job Scheduler and then Import data.

TAP displays a form for you to fill out, starting with a job name, as shown below. Pick a unique name for your job.

  • JDBC URI - You can enter the URI directly, or you can fill in the fields above the URI field to create the required schema for the jdbc URI:
jdbc:driver://host:port/database_name

You can add optional JDBC URI parameters, for example, to turn on SSL for postgresql connection:

jdbc:postgresql://host:port/database_name?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory

(Please note that parameters are driver-specific, check database driver documentation for details)

  • Username and Password - These are the credentials to connect to the data source

  • Table - This is the name of the database table to be imported into HDFS.

  • Destination dir - This is the directory in the target HDFS where you will store the imported data. Note: Make sure you have write access rights to this directory.

  • Choose import mode - There are 3 import modes available: Append, Overwrite, and Incremental

    • Append - Each import will fetch the whole table into a separate file. Results of previous imports will not be overwritten. Files on HDFS will have names in the pattern: part-m-00000, part-m-00001, and so on.
    • Overwrite - Each import will fetch the entire source table and overwrite results of the previous import, using part-m-00000 for the filename.
    • Incremental - The import will fetch records having, in a column identified by Column name parameter, values not lower than the value provided by the Value parameter. Each subsequent import will fetch records with values in the aforementioned column higher than the previously fetched values. For this purpose (identifying), we recommend using a numeric column, which is auto-incremented.
      • Column name - The column from the database (unique numeric format), against which Value will be checked; used for unique identification of data to be imported.
      • Value - A reference value used to filter out records from the source database - only records with values (in a column identified by ‘Column name’) not smaller than this reference ‘Value’ will be imported.

  • Start time - The start time of your job.
    • Note: When you enter a Start time prior to the current time, Oozie will try to “catch up” by executing jobs from the past.
  • End time - The end time of your job.
    • End time should always be later than Start time.
  • Frequency - The frequency with which your job will be submitted.
  • Timezone - The id of the time zone for the entered start and end time.

Job browser

Selecting Job Scheduler then Job browser from the TAP main menu allows you to view scheduled jobs. There are two tabs on the Job browser page: Workflow jobs and Coordinator jobs.

  • Workflow jobs - On this tab, you can see a list of workflow jobs. Workflow jobs represent imports from databases to HDFS. Click on See details to the right of a job name for additional information (example shown below).

    • Details - This section provides additional information about the specified workflow job.
    • See logs - This section provides logs related to the specified workflow job. You can kill the job by clicking on the Kill button.
  • Coordinator jobs - This tab contains configuration information and manages workflow jobs. Click on See detailsto the right of a job for additional information (example shown below).

    • Details - Additional information about the coordinator job.
    • Started workflow jobs - List of workflow jobs spawned by the coordinator job. Each workflow job on the list has a See details link, which will redirect you to the selected workflow job details.

See also

Job Scheduler FAQ

Installing custom database drivers

Clone this wiki locally