Skip to content

Guide to Apache Oozie

July 10, 2014

Description

Apache Oozie is a Workflow scheduler and manager for a Apache Hadoop cluster.

 

Example Use Cases

  • You want to extract log data to get usage statistics on how people are using a Web App.
  • You want to call a service and load that data into a Data Store.
  • You want to transform data in one table into another form and load that data into a new table.

 

Structure of an Oozie Job

An Oozie Job has 3 structures (at least at the writing of this blog entry). A Bundle (which contains one to many coordinators), a Coordinator (which contains one to many workflows) and a Workflow (which contains the actual code which performs some action).

The base level is the Workflow. As mentioned above, this is element contains the actual commands and scripts for doing some operation. This can be anything from running HDFS commands to running HIVE queries to running PIG queries. You can also run other Workflows from within a Workflow to accomplish whatever you need to do. You should also think of a Workflow as a one time run type of a job. In other words, you cant schedule it to run every hour or so. this type of functionality is done at the Coordinator level.

Going up a level we get to the Coordinator. As mentioned above, this element contains one to many Workflows. A Coordinator is useful for scheduling Workflows to run at certain times. For example, you can configure a Coordinator to run a workflow every hour if you have data that needs to be refreshed or recalculated during that time.

At the final level there is the Bundle. As mentioned above, a Bundle contains one to many Coordinators. It is useful for grouping Coordinators that are related in some way. For example, you may have one Coordinator that pulls data into Hadoop from somewhere. You might then have another Coordinator that transforms it in some way and finally you have another Coordinator to push that data to some outside data source.

 

Oozie Documentation

https://oozie.apache.org/docs/3.1.3-incubating/DG_CommandLineTool.html

 

Useful Commands (from command line)

Help

$ oozie help

Validate Workflow (Ensure that there aren’t any errors in a Workflow file)

$ oozie validate /path/to/workflow.xml

Start Job (will return a job_id)

$ oozie job -config /path/to/job.properties -run

Job Status

$ oozie job -info {job_id}

Logs

$ oozie job -log {job_id}

Suspend Job

$ oozie job -suspend {job_id}

Restart Suspended Job

$ oozie job -resume {job_id} 

Kill Job

$ oozie job -kill {job_id}

 

 Oozie Error Codes Explained

While running Oozie jobs, I’ve noticed that certain error codes come up in the logs that don’t give all that much explanation as to what the problem is. Here are what i’ve found some of these error codes mean after debugging.

Error Possible Cause
Error 9 There is a SQL error. Like trying to create a table that already exists
Error 10 There are undefined param variables in your workflow, SQL run time exception like a field doesn’t exist in a temporary table
Error 11 There are comments (/**/) in the executing sql file, Parameter included in SQL was not passed into through workflow

 

Useful Links

https://oozie.apache.org/docs/3.1.3-incubating/DG_CommandLineTool.html

Advertisements
Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: