Wednesday, November 16, 2011

Scheduling and Workspace in SUGAR

With the recent addition of perspectives to the Pentaho User Console (PUC) we opened up a whole new way to integrate with the BI platform.  This will go a long way for customers and OEMs who want to add (or remove) functionality from PUC.  Having said this, we are currently developing new perspectives for the SUGAR release.  Sean Flatley has been developing a PDI admin perspective (based on CDF).  There are also plans to create an admin perspective (or add to the PDI admin) to replace the admin console (PAC and PEC).

Recently, I have been developing a total replacement for the PUC workspace, which was in dire need of TLC. When PDI added scheduling capabilities against our DI server, this was against a brand new scheduling system.  As of yet, we hadn't taken advantage of this in the BI server.  All of this changes in SUGAR, the old scheduler is completely removed, the new scheduler has taken over!  Rather than get our existing (pre-SUGAR) workspace to work against the new scheduler, we spent some time re-writing it.  The new workspace makes all scheduler interactions using REST.  This means that it will be easy for other developers to interact with the scheduler in their own interfaces.

Scheduling with REST
I mainly wanted to highlight the new workspace in this post, but I figured there might be a fair amount of outside interest in learning about scheduling + REST.  We have held up our end of REST purity in that GET, POST and DELETE HTTP methods are used where appropriate.  Simple results are returned as text/plain, while complex state (such as a list of jobs) can be returned as either XML or JSON.  Whatever your client-side technology of choice is, you can set the "accept" HTTP header to instruct the server to return the desired type back.  For example, myrequest.setHeader("accept", "application/json") will cause the scheduler REST service to return results back (if supported) as JSON.

The URLs listed in the examples below assume that your BI server is running on "localhost" port 8080.

Scheduler State
To get the state of the scheduler make a GET request to:
http://localhost:8080/pentaho/api/scheduler/state

The return type for this is text/plain and the result will be one of:
RUNNING, PAUSED or STOPPED

To control the state of the scheduler you must make a POST request.  In order to start or resume the scheduler as a whole:
http://localhost:8080/pentaho/api/scheduler/start

To pause the scheduler:
http://localhost:8080/pentaho/api/scheduler/pause

To shutdown the scheduler (must be rebooted after a shutdown):
http://localhost:8080/pentaho/api/scheduler/shutdown

Remember, these are POST requests, you cannot just paste the URL in a browser and expect them to work (would be a GET request this way).

Listing Jobs
Since listing jobs does not change any state on the server, the request for getting the list of jobs is a GET.

The following URL can be used as GET request and will return XML or JSON.
http://localhost:8080/pentaho/api/scheduler/jobs

Job State
Like the scheduler itself, we can interact with the running state of an individual job.  Getting the state of a specific job requires that you submit the jobId (which is given in the list jobs REST call).  You can only get the state of a job that you created (own), unless you have administration privileges.

To get the state of a job, the REST url is:
http://localhost:8080/pentaho/api/scheduler/jobState

You must submit JSON or XML wrapping the jobId.  For example, the JSON request payload:
{"jobId":"joe:1685214344:1321154720424"}

The request header for the "Content-Type" is also set: myrequest.setHeader("Content-Type", "application/json");

The return type for this is text/plain and the result will be one of:
NORMAL, PAUSED, COMPLETE, ERROR, BLOCKED or UNKNOWN

Altering the state of a job is not much different than getting the state except that a POST request must be made.  The REST urls are:
http://localhost:8080/pentaho/api/scheduler/resumeJob
http://localhost:8080/pentaho/api/scheduler/pauseJob

Triggering a Job Immediately
To trigger the immediate execution of a job, you can invoke the triggerNow REST endpoint (POST) with the jobId wrapped with JSON or XML.  You must be authorized to execute the job in order to trigger it.


Deleting a Job

To remove a job from the scheduler, you can invoke the removeJob REST endpoint (DELETE) with the jobId wrapped with JSON or XML.  You must be authorized (job owner or admin) to delete the job from the scheduler.

Creating a New Job

This is the most complex part of interacting with the scheduler.  In order to represent a new schedule, there are 3 "trigger" types, simple, complex and cron.

I'm just going to give some examples rather than document every possible combination.  First, let's use a simple schedule, run the Inventory.prpt every 4 hours until December 31, 2012.  The JSON payload would be:

{"inputFile":"/public/pentaho-solutions/steel-wheels/reports/Inventory.prpt", "outputFile":null, "simpleJobTrigger":{"repeatInterval":14400, "repeatCount":-1, "startTime":"2011-11-16T00:00:00.000-05:00", "endTime":"2012-12-31T23:59:59.000-05:00"}}

The inputFile is the full path to the Inventory.prpt resource.  We're using a simple trigger, meaning that we don't worry about special recurrence patterns, we just want to run every 4 hours until the "endTime" has been reached.  The repeatInterval is 14400 seconds which equals 4 hours.  If you want to repeat a specific number of times until the trigger is no longer fired (in lieu of endTime) you can give a repeatCount.  A repeatCount of -1 means forever.

Next, let's imagine we want to schedule the Produce Line Sales.prpt every Sunday at 2am with no end date.

The REST endpoint is http://localhost:8080/pentaho/api/scheduler/createJob.  The JSON payload would be something like this:

{"inputFile":"/public/pentaho-solutions/steel-wheels/reports/Product Line Sales.prpt", "outputFile":null, "complexJobTrigger":{"daysOfWeek":["0"], "startTime":"2011-11-16T02:00:00.000-05:00", "endTime":null}}

Dissecting this, we can see the inputFile is set to the full path to the scheduled resource.  We are creating a "complex" job trigger with a recurrence pattern of "daysOfWeek" including just "0" meaning Sunday, the days range from 0-6.  If the trigger was going to be for multiple days of the week, this would be given as "daysOfWeek":["0","1"]" (for Sunday/Monday).  All times are in ISO_8601 date format (this is true for dates coming out of the scheduler REST services as well).  The startTime specifies the "from" date and endTime refers to the date at which the schedule will no longer be run.  A null value for the endTime means it has no end.

Another example, "The last Friday of every month at 4am" would have a JSON payload of:

{"inputFile":"/public/pentaho-solutions/steel-wheels/reports/Income Statement.prpt", "outputFile":null, "complexJobTrigger":{"weeksOfMonth":["4"], "daysOfWeek":["5"], "startTime":"2011-11-16T04:00:00.000-05:00", "endTime":null}}

Finally, a yearly schedule, "Every January 1st at midnight":

{"inputFile":"/public/pentaho-solutions/steel-wheels/reports/Invoice Statements.prpt", "outputFile":null, "complexJobTrigger":{"monthsOfYear":["0"], "daysOfMonth":["1"], "startTime":"2011-11-16T00:00:00.000-05:00", "endTime":null}}


The Workspace
With all the REST details behind me now, I can finally cover some new UI work that I've been working on the past few weeks. As I said before, the new workspace interacts with the server exclusively through REST web services, meaning that it is possible for someone with better UI skills to replace it (by removing the default one from the default-plugin/plugin.xml).


The old workspace listed all content for each schedule, this was unbelievably unmanageable, it was also rather clunky when it came to starting/stopping/removing schedules and their output content. It also lacked the ability to manage the scheduler as a whole (start/stop).


The new workspace lists schedules (aka jobs), not content (output) from those jobs. You can start/stop the entire scheduler or pause/resume individual jobs. A human readable description of each schedule is provided. Each column in the table view can be sorted. If there are many schedules, the table will enter a "paging" mode. If there are still too many schedules to find what you are looking for you can easily add a filter. You can multi-select (with the help of CTRL or SHIFT keys) and manage many schedules at once. Selected jobs can be triggered to run immediately, paused, resumed or removed permanently. When you click on a cell in the file (resource) column you can view and manage (TBD) content from previous executions of that schedule.


Workspace View showing multi-select "pause" (notice state of selected items)

You can filter the list of jobs by file, state, user, schedule type and execution times

Selecting a file link will show past execution history and allow content to be viewed.

We're not done with the scheduling yet, but we've been making incredible progress. We still need to finish (WIP) parameter support and define (TBD) what content management can be done from the history (generated content dialog).



No comments:

Post a Comment