Designing An Oozie Workflow – Gayathri R

Handling Change Request Made Easy – Venkatesh P
April 16, 2018
Implementation of Lazy Loading – Ranjith V
April 17, 2018

Designing An Oozie Workflow – Gayathri R

Data generation is at its peak in today’s scenario which mandates the need for tracking and storage. This storage location is called a data warehouse or database. We now need a tool to Extract, Load and Transform the data from one database to another or from various source systems to a database.We should frequently update the database, especially the latest incremental data for a day or week. Sqoop makes it easy to update incremental data into Hadoop. Oozie focuses on the job scheduling, where the sqoop job can be scheduled periodically to load data into hive.

Introduction

Oozie is a workflow scheduling engine for the Hadoop framework. Below are the points about the process

  • Oozie runs an Apache Sqoop job to perform an import action on the data which is in any of the databases.
  • As a result, the data is transferred directly into Hive(HQL) and Hadoop distributed file system (HDFS)

Apache Sqoop

Sqoop is a connectivity tool for moving data from non-Hadoop data storage's – such as relational databases and data warehouses into Hive / HDFS. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from MySQL Oracle, Teradata or other relational databases to the target database and vice versa.

The three types of operations are as follows:

  • Sqoop Import : Import data into Hive / HDFS
  • Sqoop Export : Export data from Hive to Other Relational Database
  • Sqoop Merge : Used to merge two files(i.e txt, csv, etc.,) which reside in HDFS

Apache Hive

Hive is a Hadoop-based data warehousing framework originally developed by Facebook. It allows users to write queries in a SQL-like language called HiveQL, which are then converted to Map Reduce. This allows SQL programmers with no Map Reduce experience to use the warehouse and makes it easier to integrate with business intelligence tools.

Apache Oozie

Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive and then intelligently link them to one another. This facilitates coordination among interdependent, recurring jobs using the Oozie coordinator, which you can trigger by either a pre scheduled time or data availability. You can submit or maintain a set of coordinator applications using the Oozie bundle system. Oozie allows users to specify queries, for example, making a particular query to be initiated only after specified previous jobs on which it relies for data, are completed.

The below configuration is needed to set-up the Oozie job and it contains the following information like Name Node, Job Tracker, Oozie Library Path and Output Directories. This file should be created and loaded into HDFS in the respective location. When we run Oozie, we need to mention this file location.

Job.Properties:

  • nameNode=namenode:30070
  • jobTracker=head node:8088
  • queueName=default
  • oozie.libpath=${nameNode}/user/oozie/share/lib
  • oozie.use.system.libpath=true
  • oozie.wf.rerun.failnodes=true
  • oozie.wf.application.path=${nameNode}/user/cloudera/examples/shellscript/workflow.xml
  • outputDir=SqoopShellOutput

Designing the Workflow

In order to run a recurring job, you’ll need to have an Oozie workflow to execute. Oozie workflows are written as an XML file representing a directed acyclic graph (DAG). Let’s look at a simple workflow example, and move onto what’s actually going on. The following file is saved as sqoop_workflow-action.xml, in a directory which is referenced as Workflow.xml

We can use xml to configure jobs like Sqoop, Hive, Pig, Map Reduce and we have configured for a Sqoop Import job in the following manner. Similarly, we can configure jobs for other tools as well.

<action name="sqoop_workflow"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <coordinator.app=”2016-09-08 01:00:00{01}” <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${s3BucketLoc}/${tableName}/incr"/> <mkdir path="${s3BucketLoc}/${tableName}"/> </prepare> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <arg>import</arg> <arg>--connect</arg> <arg>${dbURL}</arg> <arg>--</arg> <arg>${mySqlDriver}</arg> <arg>--username</arg> <arg>${user}</arg> <arg>--table</arg> <arg>${wf:actionData('timeCheck')['tableName']}</arg> <arg>--target-dir</arg> <arg>${s3BucketLoc}/${tableName}/incr</arg> <arg>--check-column</arg> <arg>LAST_UPD</arg> <arg>--incremental</arg> <arg>lastmodified</arg> <arg>--last-value</arg> <arg>${wf:actionData('timeCheck')['sqoopLstUpd']}</arg> <arg>--m</arg> <arg>1</arg> </sqoop> <ok to="sqoop_Update"/> <error to="fail"/> </action>

Oozie Hive_workflow

Before starting a Hive job, we need to specify the Hive action inside the configuration element (to create or delete HDFS files and directories). The property values specified in the configuration element override values specified in the job-xml file.

Oozie executes the Hive script specified in the script element. Input for the hive scripts can be mentioned as parameters using the argument element. The below code is used to enable Oozie access to the Hive environment.

<action name="hive_workflow"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <exec>${hiveSwitchScript}</exec> <argument>${tableName}</argument> <file>${hiveSwitchScriptPath}#${hiveSwitchScript}</file> <capture-output/> </shell> <ok to="master-decision"/> <error to="fail"/> </action> </action>

Now that we’re all set up, the oozie job can be executed using the below url. “Oozie job -oozie” is a keyword to execute an oozie job. Mention the head node url and the path of the ‘job.properties’ file.

oozie job -oozie http://headnode:11000/

oozie -config /home/user/oozie_test/sqoop/job.properties -run

Here’s a chart to understand the process better.

And that’s how incremental data is loaded using Sqoop. I am working on more improvements in Oozie scheduling and will keep you updated on the progress.

Keywords: Incremental loading, incremental data, data loading, oozie workflow, job scheduling, sqoop, Hadoop, hive, big data, workflow process, ETL

Data generation is at its peak in today’s scenario which mandates the need for tracking and storage. This storage location is called a data warehouse or database. We now need a tool to Extract, Load and Transform the data from one database to another or from various source systems to a database.We should frequently update the database, especially the latest incremental data for a day or week. Sqoop makes it easy to update incremental data into Hadoop. Oozie focuses on the job scheduling, where the sqoop job can be scheduled periodically to load data into hive.

Leave a Reply

Your email address will not be published. Required fields are marked *