TaskForest
A simple, expressive, open-source, text-file-based Job Scheduler with console, HTTP, and RESTful API interfaces.
Documentation
  1. Downloading TaskForest
  2. Installing TaskForest
  3. Configuring TaskForest
    1. Jobs & Families
    2. Calendars
    3. Automatic Retries
    4. Sending Emails
    5. Options
    6. Configuration File
  4. Running TaskForest
  5. Running the TaskForest Web Server
  6. Web Server Security
  7. Checking TaskForest Status
  8. Rerunning a Job
  9. Marking a Job
  10. Tokens
  11. Releasing all Dependencies from a Job
  12. Putting a Job on Hold
  13. Releasing a Hold Off a Job
  14. HOWTO
  15. The RESTful Web Service
  16. Frequently Asked Questions (FAQ)
  17. Bugs
  18. Change Log
  19. Author
  20. Acknowledgements
  21. Copyright

Jobs & Families

A job is defined as any executable program that resides on the file system. It is represented as a file in the files system whose name is the same as the job name. Jobs can depend on each other. Jobs can also have start times before which a job may not by run.

When a job is run by the run wrapper (bin/run_with_log), two status semaphore files are created in the log directory. The first is created when a job starts and has a name of $FamilyName.$JobName.pid. This file contains some attributes of the job. When the job completes, more attributes are written to this file.

When the job completes, another semaphore file is written to the log directory. The name of this file will be $FamilyName.$JobName.0 if the job ran successfully, and $FamilyName.$JobName.1 if the job failed. In either case, the file will contain the exit code of the job (0 in the case of success and non-zero otherwise).

When a job is run by the run_with_log run wrapper, any output the job sends to stdout or stderr will be captured and stored in a file called $FamilyName.$JobName.$pid.$start_time.stdout in the log directory.

Within TaskForest, every job has a status, which is one of the following values:

Jobs & Families

Jobs can be grouped together into ``Families.'' A family has a start time associated with it before which none of its jobs may run. A family also has a either (a) a list of days-of-the-week or (b) a calendar associated with it. Jobs within a family may only run on the days specified by the days-of-the-week or the calendar.

Jobs and families are given simple names. A family is described in a family file whose name is the family name. Each family file is a text file that contains 1 or more job names. The layout of the job names within a family file determine the dependencies between the jobs (if any). There are several reasons why text files are a good choice for Family files.

Family names and job names should contain only the characters shown below:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_

Let's see a few examples. In these examples the dashes (-), pipes (|) and line numbers are not parts of the files. They're only there for illustration purposes. The main script expects environment variables or command line options or configuration file settings that specify the locations of the directory that contain family files, the directory that contains job files, and the directory where the logs will be written. The directory that contains family files should contain only family files.

EXAMPLE 1 - Family file named F_ADMIN

   +-------------------------------------------------------
01 |start => '02:00', tz => 'GMT', days => 'Mon,Wed,Fri'
02 |
03 | J_ROTATE_LOGS()
04 |
   +-------------------------------------------------------
  

The first line in any family file always contains 3 bits of information about the family: the start time, the time zone, and the days on which this jobs in this family are run, or the calendar that specifies on which dates jobs in this family are run.

In this case, this family starts at 2:00 a.m. Chicago time. The time is adjusted for daylight savings time. This family 'runs' on Monday, Wednesday and Friday only. Pay attention to the format: it's important.

Valid days are 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'. Days must be separated by commas.

All start times (for families and jobs) are in 24-hour format. '00:00' is midnight, '12:00' is noon, '13:00' is 1:00 p.m. and '23:59' is one minute before midnight.

There is only one job in this family - J_ROTATE_LOGS. This family will start at 2:00 a.m., at which time J_ROTATE_LOGS will immediately be run. Note the empty parentheses [()]. These are required.

What does it mean to say that J_ROTATE_LOGS will be run? It means that the system will look for a file called J_ROTATE_LOGS in the directory that contains job files. That file should be executable. The system will execute that file (run that job) and keep track of whether it succeeded or failed. The J_ROTATE_LOGS script can be any executable file: a shell script, a perl script, a C program etc.

To run the program, the system actually runs a wrapper script that invokes the job script. The location of the wrapper script is specified on the command line or in an environment variable.

Now, let's look at a slightly more complicated example:

EXAMPLE 2 - Job Dependencies

This family file is named WEB_ADMIN.


   +-------------------------------------------------------
01 |start => '02:00', tz => 'GMT', calendar => 'Weekdays'
02 |
03 |               J_ROTATE_LOGS()
04 |
05 | J_RESOLVE_DNS()       Delete_Old_Logs()
06 |
07 |               J_WEB_REPORTS()      
08 |
09 |    J_EMAIL_WEB_RPT_DONE()  # send me a notification
10 |
   +-------------------------------------------------------
  

A few things to point out here:

It is possible to have a dependency on a job that's in another family. If, for example, J_ROTATE_LOGS was in the Family named LOGS, then the family above would look like this:


   +-------------------------------------------------------
01 |start => '02:00', tz => 'GMT', calendar => 'Weekdays'
02 |
03 |            LOGS::J_ROTATE_LOGS()
04 |
05 | J_RESOLVE_DNS()       Delete_Old_Logs()
06 |
07 |               J_WEB_REPORTS()      
08 |
09 |    J_EMAIL_WEB_RPT_DONE()  # send me a notification
10 |
   +-------------------------------------------------------
  

An external job dependency is different from 'normal' job dependencies, because unlike 'normal' dependencies, it specifies only the dependency, and not when the external job should run. This means that looking at the above family file, we cannot say when J_ROTATE_LOGS will run. More accurately, we cannot say when LOGS::J_ROTATE_LOGS will run. All we know is that after it runs, J_RESOLVE_DNS and Delete_Old_Logs can run (after 2:00 GMT).

This also means that external job dependencies may only be specified on the first line of a Family, or the first line of a group of jobs (see example 4). Therefore the following is not allowed:


   +-------------------------------------------------------
01 |start => '02:00', tz => 'GMT', calendar => 'Weekdays'
02 |
03 |            LOGS::J_ROTATE_LOGS()
04 |
05 | J_RESOLVE_DNS()       Delete_Old_Logs()
06 |
07 |            REPORTS::J_WEB_REPORTS()  # BAD!
08 |
09 |    J_EMAIL_WEB_RPT_DONE()  # send me a notification
10 |
   +-------------------------------------------------------
  

To see how this should be written, we need to know about Job Forests. Since that's described in example 4, we'll defer the solution until then.

One last thing about external job dependencies: just because we're waiting on a job in another family, that doesn't mean that the same job cannot be run in this family. For example, the following is permitted:


   +-------------------------------------------------------
01 |start => '02:00', tz => 'GMT', calendar => 'Weekdays'
02 |
03 |            LOGS::J_ROTATE_LOGS()
04 |
05 | J_RESOLVE_DNS()       Delete_Old_Logs()
06 |
07 |               J_WEB_REPORTS()
08 |
09 |    J_EMAIL_WEB_RPT_DONE()  # send me a notification
10 |
11 |                J_ROTATE_LOGS()   # This is a different
12 |                                  # job!
13 |
   +-------------------------------------------------------
  

The family will not start until J_ROTATE_LOGS has run from the LOGS family. The last job run by this family will be J_ROTATE_LOGS. It has nothing to do with the instance of the job that ran in the LOGS family. Line 11 will actually run the job, while line 3 only checks whether the job has run (by another family). That's what I mean when I say that external dependencies only specify the dependencies, while normal dependencies also specify when the job should run.

EXAMPLE 3 - Time Dependencies

Let's say that we don't want J_RESOLVE_DNS to start before 9:00 a.m. because it's very IO-intensive and we want to wait until the relatively quiet time of 9:00 a.m. In that case, we can put a time dependency of the job. This adds a restriction to the job, saying that it may not run before the time specified. We would do this as follows:


   +-------------------------------------------------------
01 |start => '02:00', tz => 'GMT', calendar => 'Weekdays'
02 |
03 |               J_ROTATE_LOGS()
04 |
05 | J_RESOLVE_DNS(start => '09:00')  Delete_Old_Logs()
06 |
07 |               J_WEB_REPORTS()      
08 |
09 |    J_EMAIL_WEB_RPT_DONE()  # send me a notification
10 |
   +-------------------------------------------------------
  

J_ROTATE_LOGS will still start at 2:00, as always. As soon as it succeeds, Delete_Old_Logs is started. If J_ROTATE_LOGS succeeds before 09:00, the system will wait until 09:00 before starting J_RESOLVE_DNS. It is possible that Delete_Old_Logs would have started and complete by then. J_WEB_REPORTS would not have started in that case, because it is dependent on two jobs, and both of them have to run successfully before it can run.

For completeness, you may also specify a timezone for a job's time dependency as follows:

05 | J_RESOLVE_DNS(start=>'10:00', tz=>'America/New_York') ...
EXAMPLE 4 - Job Forests

You can see in the example above that line 03 is the start of a group of dependent jobs. No job on any other line can start unless the job on line 03 succeeds. What if you wanted two or more groups of jobs in the same family that start at the same time (barring any time dependencies) and proceed independently of each other?

To do this you would separate the groups with a line containing one or more dashes (only). Consider the following family:


   +-------------------------------------------------------
01 |start => '02:00', tz => 'GMT', calendar => 'Weekdays'
02 |
03 |               J_ROTATE_LOGS()
04 |
05 | J_RESOLVE_DNS(start => '09:00')    Delete_Old_Logs()
06 |
07 |               J_WEB_REPORTS()      
08 |
09 |    J_EMAIL_WEB_RPT_DONE()  # send me a notification
10 |
11 |-------------------------------------------------------
12 |
13 | J_UPDATE_ACCOUNTS_RECEIVABLE()
14 |
15 | J_ATTEMPT_CREDIT_CARD_PAYMENTS()
16 |
17 |-------------------------------------------------------
18 |
19 | J_SEND_EXPIRING_CARDS_EMAIL()
20 |
   +-------------------------------------------------------

Because of the lines of dashes on lines 11 and 17, the jobs on lines 03, 13 and 19 will all start at 02:00. These jobs are independent of each other. J_ATTEMPT_CREDIT_CARD_PAYMENT will not run if J_UPDATE_ACCOUNTS_RECEIVABLE fails. That failure, however will not prevent J_SEND_EXPIRING_CARDS_EMAIL from running.

Finally, you can specify a job to run repeatedly every 'n' minutes, as follows:


   +-------------------------------------------------------
01 |start => '02:00', tz => 'GMT', calendar => 'Weekdays'
02 |
03 | J_CHECK_DISK_USAGE(every=>'30', until=>'23:00')
04 |
   +-------------------------------------------------------

This means that J_CHECK_DISK_USAGE will be called every 30 minutes and will not run on or after 23:00. By default, the 'until' time is 23:59. If the job starts at 02:00 and takes 25 minutes to run to completion, the next occurance will still start at 02:30, and not at 02:55. By default, every repeat occurrance will only have one dependency - the time - and will not depend on earlier occurances running successfully or even running at all. If line 03 were:

J_CHECK_DISK_USAGE(every=>'30', until=>'23:00', chained=>1)

...then each repeat job will be dependent on the previous occurance.

Now, let's get back to our discussion of external dependencies from example 3. I said that an external dependency may only be specified on the first line of the file or the first line of a group of jobs. This way of specifying a family is not allowed by TaskForest:


   +-------------------------------------------------------
01 |start => '02:00', tz => 'GMT', calendar => 'Weekdays'
02 |
03 |            LOGS::J_ROTATE_LOGS()
04 |
05 | J_RESOLVE_DNS()       Delete_Old_Logs()
06 |
07 |            REPORTS::J_WEB_REPORTS()  # BAD!
08 |
09 |    J_EMAIL_WEB_RPT_DONE()  # send me a notification
10 |
   +-------------------------------------------------------
  

With a few minor modifications, the family can be specified correctly:


   +-------------------------------------------------------
01 |start => '02:00', tz => 'GMT', calendar => 'Weekdays'
02 |
03 |            LOGS::J_ROTATE_LOGS()
04 |
05 | J_RESOLVE_DNS()       Delete_Old_Logs()
06 |
07 |--------------------------------------------
08 |
09 | J_RESOLVE_DNS() Delete_Old_Logs() REPORTS::J_WEB_REPORTS()
10 |  
11 |   J_EMAIL_WEB_RPT_DONE()  # send me a notification
12 |
   +-------------------------------------------------------
  

We've moved the external dependency to the first line of it's own section. Now J_EMAIL_WEB_RPT_DONE relies on all 3 jobs, 2 that run in this Family, and one from the REPORTS family.

EXAMPLE 5 - Tokens

A token is a dependency. It is something that a job must 'possess' before it can run, if that job needs that token. You can create different types of tokens, giving each type a common name. You can also specify how many instances of tokens of each type are to exist. For example, if the configuration file contained the following lines:


   +-------------------------------------------------------
01 | ...   
02 | <token T>
03 |   number = 1
04 | </token>
05 | <token U>
06 |   number = 2
07 | </token>
08 | ...
   +-------------------------------------------------------

...it means that there are two types of tokens: 'T' and 'U'. There is only one instance of token type 'T', and two of type 'U'.

Given the above configuration, if your Family file looked as follows:


   +-------------------------------------------------------
01 |start => '00:00', tz => 'GMT', days => 'Mon,Wed,Fri'
02 |
03 | J1( token => 'T')  J2 ( token => 'T' ) J3()
04 |
05 |-------------------------------------------------------
06 |
07 | J6(token => 'U') J5(token => 'U') J4(token => 'U')
08 | J8(token => 'T,U')
09 | 
   +-------------------------------------------------------

...then that means that job J1 and J2 both need a token of type 'T' to run. But, there's only one instance of token T, so J1 and J2 cannot both run at the same time (even though they would, if they didn't rely on tokens). The system will sort jobs alphabetically by name and choose the first in the list. In other words, in this case, J1 will run first and J2 will only run after J1 completes (if no other job has taken the token first). To be more accurate, J1 and J3 will run simultaneously, since J3 does not need any tokens.

To be even more accurate, J1, J3, J4 and J5 will run simultaneously. This is because J4, J5 and J6 all rely on token U, but there are only 2 instances of token U. Even though J6 appears on the line before J5 and J4, the system will choose J4 and J5 first, because they appear first in alphabetical order, and J6 will run after one of the other two have completed.

Because the system always chooses the job with the smallest name (alphabetically), it is possible to experience 'resource starvation' - where a job with a 'larger' name could never get an opportunity to run, because there are too many other jobs with smaller names that get to run first by virtue of their names. Future versions of TaskForest will implement heuristics to prevent resource starvations.

Note that J8 relies on two tokens: T and U. It will only run when it can acquire one of both tokens. If it can acquire one, but not the other, it will release the first and try to acquire both at a later time.

Finally, tokens can also be used to control the load on the machine on which taskforest is running. If you've got several independent jobs that don't depend on each other, but which use a fair amount of resources, you can have all the jobs use the same token. Then you can tweak the maximum number of instances of that token to a value that maximizes the number of simultaneous jobs without putting too much strain on the server.

EXAMPLE 6 - Calendars

A calendar is a set of rules that defines on what days a job may run. The rules that make up a calendar are specified in the configuration file and the calendars themselves are associated with a Family in the Family file.

Calendar names should contain only the characters shown below:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-

Let's start with the configuration file. You can have zero or more calendars in the configuration file. Each calendar may be associated with zero or more Families. A calendar consists of one or more rules. All rules belonging to a calendar are consulted in order to determine whether or not the Family should run today. The rules are consulted in the order in which they are specified in the configuration file. Rules may contradict each other. A rule may not conclusively determine whether or not the Family should run today. The last rule that determines whether or not a Family should run today will override any earlier rules. Rules are case insensitive.

Let's see some examples. First, let's see how a Family file specifies a calendar:

Example 1 - One day only
   +-------------------------------------------------------
01 |start => '00:00', tz => 'GMT', calendar => 'NY2010'
02 |
   | ...

You can see here that instead of the days => '...', we have calendar => 'NY2010'. This tells the system that this family will rely on the calendar named 'NY2010.' This calendar is defined in its own file. The file name should be the same as the calendar name. The directory in which the file exists is specified by the calendar_dir option:

   +-------------------------------------------------------
   | ...
   | calendar_dir = "/foo/bar/calendars"
   | ...

The file /foo/bar/calendars/NY2010 looks like this (the # symbols and anything after them are comments, just like in the configuration file):

   +-------------------------------------------------------
01 | # NY2010
02 | + 2010/01/01  # Only valid on New Years Day, 2010
   +-------------------------------------------------------

This calendar only has one rule. It is "+ 2010/01/01". The '+' in the rule says that if the date specified in this rule matches, then the Family must run on that day. The '+' is optional. This calendar will allow the Family that uses it to run on Jan 1, 2010, and on no other day. Dates in rules should be in the YYYY/MM/DD format.

Example 2 - One month only

What if we want a job to run on every day in November 2010? You can use a rule like this:

# Nov2010
        
 2010/11/*

The '*' in the DD part of the date is like a wildcard. It means that the DD part of the rule will match any number. In other words, if today's date is November DD, 2010, then this rule will match, for all values of DD. Note that the '+' is missing here. That's ok. It's optional, and if missing, the system will assume that you meant to put in a '+'.

Example 3 - Rejecting days

If, on the other hand, you wanted this Families that use this calendar to run on all days in 2010 except all of November, you would use the '-' sign:

# All_But_Nov2010
        
 + 2010/*/*
 - 2010/11/*

The first line matches all days in 2010. The MM and DD parts are both wildcards. The optional plus tells the system that if the date matches this pattern, it should count as a valid run date. The next line, on the other, hand adds an exception to this rule: If the date falls within November 2010, the date should not be a valid run date - note the '-' sign that tells the system to exclude this date.

Example 4 - Daily Calendar

To specify a daily calendar, use this:

# Daily

*/*/*

Of course, you don't have to name the calendar 'Daily.' You can name it whatever you want. Using this calendar is equivalent to having days=>'Mon,Tue,Wed,Thu,Fri,Sat,Sun' in the Family file.

Example 5 - Specifying days of the week

You can also specify rules that specify days of the week with a qualifier. For example, to run a job on the first Monday of every month in 2009, you should use a rule like this:

# FirstMon09

+ first Mon 2009/*

A couple of points to mention here: First, the '+' is optional here as well. Second, the date part of the rule only has the year and the month (in YYYY/MM format). When you use a qualifier like 'first,' it makes no sense to say things like 'The first Monday of the 1st of every month.'

The word 'first' in line 2 above is called a qualifier. Valid qualifiers are:

The qualifiers 'first last,', 'last last' and 'every last' also work in version 1.25, but they may stop working in a future version, so don't get into the habit of using those qualifiers.

Unlike the 'days' specifier in the family file, the days of the week in calendar rules may be spelled out. Only the first 3 characters are significant.

Calendar Recipes

The following 'recipes' show you some useful calendar rules:

# ############################################################
# Run every day
#
+ */*/*
# ############################################################
# Run on weekdays only
#
*/*/*
- every Saturday */*
- every Sunday */*

# You could also replace the 3 lines above
# with 5 '+' lines, one for each weekday.
# ############################################################
# 'Thanksgiving Day' observed in the U.S.
#
fourth Thursday */11
# ############################################################
# 'Thanksgiving Day' observed in Canada
#
second Monday */10
# ############################################################
# 'Memorial Day' observed in the U.S.
#
last Monday */5
# ############################################################
# The day Daylight Saving Time starts in the U.S.
#
second Sun */03  # this rule is valid for dates
                 # after 2007, but not earlier
# ############################################################
# The day Daylight Saving Time ends in the U.S.
#
first Sun */11   # this rule is valid for dates
                 # after 2007, but not earlier
# ############################################################
# The day Daylight Saving Time starts in Europe
#
last Sun */03    # tested for 2009
# ############################################################
# The day Daylight Saving Time ends in Europe
#
last Sun */10    # tested for 2009
EXAMPLE 7 - Automatic Retries

TaskForest can be configured so that when a job fails, it will automatically retry running the job. There is a system-wide option called num_retries that specifies how many times the job will be retried. The retry_sleep option specifies how many seconds the system will wait before trying to rerun the job.

The num_retries and retry_sleep options may optionally be specified for each job as well. For example, if the configuration file contains this...

# This is the number of times to automatically
# retry running a job that fails 
num_retries              = 1

# Wait these many seconds before automatically
# retrying running a job that fails 
retry_sleep              = 300

...then if any job fails, the run wrapper will sleep for 300 seconds and then retry the job once. If the retry fails as well, then the job will be considered to have failed. If the retry succeeds, then job will have considered to have run successfully. During 300 second sleep period, and during the retries, the official status of the job will still be 'Running.'

The responsibility of implementing the auto-retries falls on the run wrapper. Even though TaskForest ships with two run wrappers, you really should use run_with_log and not run. When a job is being retried, run_with_log will note it the log file like this:

*****************************************************************
Start Time:   Mon Mar 22 17:57:21 2010
Family:       RETRY
Job:          J_Retry
Job File:     J_Retry
Log Dir:      logs/20090503
Script Dir:   jobs
Pid File:     logs/20090503/RETRY.J_Retry.pid
Success File: logs/20090503/RETRY.J_Retry.0
Failure File: logs/20090503/RETRY.J_Retry.1
Out/Err File: logs/20090503/RETRY.J_Retry.19258.1269298641.stdout

*****************************************************************

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Current Time: Mon Mar 22 17:57:21 2010
!! Exit Code:    256
!! 
!! Job failed.  Sleeping 2 seconds and then retrying (retry 1 of 1).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

*****************************************************************
Start Time: Mon Mar 22 17:57:21 2010
End Time:   Mon Mar 22 17:57:23 2010
Duration:   2 seconds
Exit Code:  0
*****************************************************************

You can see that in this case, TaskForest had been instructed to sleep for 2 seconds before retrying, and that there was to be only one retry.

You can also override these configuration settings for individual jobs. Let's say you have your configuration file set up as shown above, with one retry after 300 seconds. Now suppose you want job J_Retry to be retried 10 times, with a two-minute sleep between retries. You could specify the overrides directly in the family file as follows:


   +-------------------------------------------------------
01 | ...
02 |
03 | J_Retry (num_retries => 10, retry_sleep => 120)
04 |
05 | ...
   +-------------------------------------------------------
  

This local specification of ten retries spaced two minutes apart will override what you have specified in the configuration file. You can override these configuration options on as many jobs as you wish.

EXAMPLE 8 - Emails

You can configure TaskForest to send you an email in any of these three situations:

You can configure this feature system-wide via the configuration file, and you can also override it for individual jobs in the Family files (the same way retries are configured). You can also completely configure the emails going out in each of the three cases, from the SMTP envelope information to MIME headers to body text. Again, this can be configured across the entire system and overridden for individual jobs and families. It really is a very powerful system that, with the proper email filters, can make managing your job workflow very easy. Let's see how it works.

In order to be able to send emails, you need to provide TaskForest some information in the configuration file:

# This is the SMTP server that will be used to send 
# emails out when a job fails, for example
smtp_server              = "localhost"

# The default SMTP port is 25.
smtp_port                = 25

# In a production environment this should be 60 or 120
smtp_timeout             = 60

# This is the SMTP envelope sender 
# (the text after "MAIL FROM:")
smtp_sender                = "user1@example.com"

# This is the email address that appears in the From: 
# mail header
mail_from                = "user1@example.com"

# If a user replies to a received email, the reply
# will go to this address instead of the From: address. 
# This address is set in the Reply-To mail header.
mail_reply_to            = "user2@example.com"

# This is the address to which bounces will be sent if 
# they occur at the SMTP server (as opposed to the 
# receiving Mail Transfer Agent).
mail_return_path         = "user3@example.com"

# This is the directory that stores the contents of
# the emails that are sent by the system. 
instructions_dir         = "instructions"

Then there are the following three settings that control who receives the emails. They can be set system-wide in the configuration file:

# When a job fails, emails are sent to this address
email                    = "test@example.com"

# When a job fails, but is being automatically retried,
# emails are sent to this address, as opposed to the 
# one stored in the 'email' setting.  If no_retry_mail 
# is set, then no email will be sent in this case
retry_email              = "test2@example.com"

# When a job fails, is automatically retried one or more 
# times and then suceededs, emails are sent to this 
# address, as opposed to any of the others.  If 
# no_retry_success_email is set, then no email will be sent
# in this case.
retry_success_email      = "test3@example.com"

Given the setup shown above, the email generated when a job is being retried is shown below. What you see below is a transcript of a SMTP session between TaskForest and a fake SMTP server used for testing. Lines that start with 'S: >' denote text sent from the SMTP server to the client (TaskForest). Lines that start with 'C: <' denote text sent from the SMTP client (TaskForest) to the SMTP server.

S: > 200 OK TaskForest Fake SMTP Server
C: < EHLO user1@example.com
S: > 200 OK
C: < MAIL FROM:<user1@example.com>
S: > 200 OK
C: < RCPT TO:<test2@example.com>
S: > 200 OK
C: < DATA
S: > 200 OK
C: < From: user1@example.com
C: < Return-Path: user3@example.com
C: < Reply-To: user2@example.com
C: < To: test2@example.com
C: < Subject: RETRY RETRY::J_Retry
C: < 
C: < This is the TaskForest system at your_machine_name
C: < ------------------------------------------------------
C: < 
C: < 
C: < The following job failed and will be rerun automatically.
C: < 
C: < Family:         RETRY
C: < Job:            J_Retry
C: < Exit Code:      256
C: < Retry After:    2 seconds
C: < No. of Retries: 1 of 1 
C: < 
C: < Instructions that apply to all jobs in the Family named RETRY.
C: < 
C: < Instructions that apply to the jobs named J_Retry.
C: < 
C: < 
C: < 
C: < ------------------------------------------------------------
C: < For help, please see http://www.taskforest.com/
C: < .
S: > 200 OK
C: < QUIT

What's left to be explained is how TaskForest chose the email's subject and the body of the email. The subject of the email will contain the Family name and Job name preceded by 'RETRY', 'FAIL' or 'RETRY_SUCCESS', depending on whether the job failed and is about to be retried (after the sleep time), or failed for the final time, or succeeded after failing and being retried one or more times.

As for the body, TaskForest concatenates the contents of several files found in instructions_dir. If the appropriate file does not exist, it will ignore it and move on to the next one. As it retrieves the contents of each file, TaskForest does some simple substitutions to replace placeholders found within the file's contents with the the value of that variable in the run_wrapper's environment. TaskForest makes all of the environment variables available to the taskforest program available to the run wrapper (run_with_log) and also adds the following variables to the environment:

TASKFOREST_FAMILY_NAME
This is the name of the family in which this job was run.
TASKFOREST_JOB_NAME
This is the name of the job that was run.
TASKFOREST_LOG_DIR
This is the full path of the TaskForest log directory
TASKFOREST_JOB_DIR
This is the full path of the TaskForest job directory
TASKFOREST_PID_FILE
This is the name of the pid file that's used internally by TaskForest
TASKFOREST_SUCCESS_FILE
This is the name of the file that's created if the job succeeded.
TASKFOREST_FAILURE_FILE
This is the name of the file that's created if the job failed.
TASKFOREST_UNIQUE_ID
This is an internal identifier used by TaskForest to refer to the job.
TASKFOREST_NUM_RETRIES
This is the value of the num_retries configuration variable.
TASKFOREST_RETRY_SLEEP
This is the value of the retry_sleep configuration variable.
TASKFOREST_EMAIL
This is the value of the email configuration variable.
TASKFOREST_RETRY_EMAIL
This is the value of the retry_email configuration variable.
TASKFOREST_NO_RETRY_EMAIL
This is the value of the no_retry_email configuration variable.
TASKFOREST_INSTRUCTIONS_DIR
This is the value of the instructions_dir configuration variable.
TASKFOREST_SMTP_SERVER
This is the value of the smtp_server configuration variable.
TASKFOREST_SMTP_PORT
This is the value of the smtp_port configuration variable.
TASKFOREST_SMTP_SENDER
This is the value of the smtp_sender configuration variable.
TASKFOREST_MAIL_FROM
This is the value of the mail_from configuration variable.
TASKFOREST_MAIL_REPLY_TO
This is the value of the mail_reply_to configuration variable.
TASKFOREST_MAIL_RETURN_PATH
This is the value of the mail_return_path configuration variable.
TASKFOREST_SMTP_TIMEOUT
This is the value of the smtp_timeout configuration variable.
TASKFOREST_RETRY_SUCCESS_EMAIL
This is the value of the retry_success_email configuration variable.
TASKFOREST_NO_RETRY_SUCCESS_EMAIL
This is the value of the no_retry_success_email configuration variable.

Let $instructions_dir refer to the value of the instructions_dir option. Let's call the reason for the email $reason. It's value is one of 'RETRY', 'FAIL', or 'RETRY_SUCCESS' (as described above). Let's also refer to the job name as $job_name, and the family name as $family_name. TaskForest will look for the following files in the following order, inserting their contents into the email body.

$instructions_dir/HEADER
This is a header that will show up in every email. In the example that generated the above email, the contents of this file were:
This is the TaskForest system at $HOSTNAME
------------------------------------------------------
$instructions_dir/HEADER.$reason
This is a header that will be displayed just for that reason. In other words, it will be one of three files: HEADER.RETRY, HEADER.FAIL, and HEADER.RETRY_SUCCESS. In the example that generated the above email, the name of this file was HEADER.RETRY and it's contents were:
The following job failed and will be rerun automatically.

Family:         $TASKFOREST_FAMILY_NAME
Job:            $TASKFOREST_JOB_NAME
Exit Code:      $TASKFOREST_RC
Retry After:    $TASKFOREST_RETRY_SLEEP seconds
No. of Retries: $TASKFOREST_RETRY of $TASKFOREST_NUM_RETRIES
$instructions_dir/FAMILY.$family_name.$reason
This is file that can contain specific instructions for that family and reason. In the example that generated the above email, the name of this file was FAMILY.RETRY.retry and its contents were:
Instructions that apply to all jobs in the Family named RETRY.
$instructions_dir/JOB.$job_name.$reason
This is file that can contain specific instructions for that job and reason. In the example that generated the above email, the name of this file was JOB.J_Retry.retry and its contents were:
Instructions that apply to the jobs named J_Retry.
$instructions_dir/FOOTER.$reason
This is a footer that will be displayed just for that reason. In other words, it will be one of three files: FOOTER.RETRY, FOOTER.FAIL, and FOOTER.RETRY_SUCCESS. In the example that generated the above email, the name of this file would have been FOOTER.RETRY, but that file did not exist, so TaskForest skipped over it.
$instructions_dir/FOOTER
This is a footer that will show up in every email. In the example that generated the above email, the contents of this file were:
------------------------------------------------------------
For help, please see http://www.taskforest.com/