Pycbc’s workflow module configuration file(s) and command line interface

Introduction

The workflow module at its core is designed to be flexible and allow the user to do what they want to create the pipeline that they want to run. One of the ways to allow this is by having a, sometimes large, configuration file that serves two purposes

Tell the workflow planner, how to run the various stages specified in the top-level workflow script.
Specify, as completely as possible, all command line options that will be sent to every executable that is run in the pipeline. Tags are used to identify options sent a subset of jobs, as described more fully later.

The idea is that the only input that a user needs is the configuration file. However, it may often be useful for certain options, such as user-specific locations and analysis start/end times, to be supplied on the command line. To allow this we allow a method by which configuration file options can be supplied, or overridden, on the command line.

Ihope used similar .ini files in every analysis. However, it was noted that these files grew huge and it becomes difficult for a novice to understand which options can be safely changed and which ones to leave well alone. It is also difficult so see which options are going to which job, inspiral.c for example looks for options in > 10 places and it isn’t clear where those places are.

To attempt to solve this the workflow module has a number of features

Multiple configuration files: You can now supply multiple configuration files to, for e.g. identify a file containing only injection generation parameters, which a user may want to change often. It is even possible to have sections split across files, so one could have a configuration file of key options, ones that might be changed, and another file of “leave alone” options.
Direct command line options: In the workflow module command line options are not drawn from obscure sections, they correspond one-to-one with the executables. Options in the [inspiral] section will be sent to the inspiral executable and only to the inspiral executable.
Combined sections: To avoid the issue of specifiying common options repeatedly we have allowed the ability of combined sections. So if you have two executables with a large set of shared options you can specify a [exe1&exe2] section to provide the shared options and [exe1] and [exe2] sections to supply the individual options. One can also use the [sharedoptions-NAME] sections to acheive the same thing.
Interpolation: As in configparser 3.0+ we have the ability to specify an option in one place and use an interpolation string to also provide it in other places, this is described below.
Tags/subsections: In some cases options may only need to be sent to certain jobs, or you may want to call individual modules multiple times and do different things. To accomodate this the workflow module includes a tagging (or subsections) system to provide options to only a subset of jobs, or to a specific call to a module. For example, options in [inspiral] are sent to all inspiral jobs, options in [inspiral-h1] would be sent to inspiral jobs running only on h1 data.
Executable expanding: The workflow module includes macros to enable the user to more easily specify executable paths. For example $(which:exe1} will be expanded to the location of exe1 in the users path automatically.

most of these features will be applied directly after reading in the configuration file. The workflow module will then dump the parser configuration back to disk so the user/reviewer can more easily see what the analysis is actually doing.

In this page we describe the layout of the workflow module .ini configuration file and what the various sections mean, how they are used, and how an ini file should be set out.

NOTE: A number of features that have been put in here, are available in the python 3.X version of ConfigParser. In addition this version also has a duplicate option check. In python 2.X if I do:

[inspiral]
detect-gravitational-waves = True
LOTS OF GARBAGE
detect-gravitational-waves = False

it will set the value to False, and proceed happily. THERE IS NO WAY TO CATCH THIS! There is a python 2.X backport of this new version, it is available in pypi, but not in macports. It would be good to pick up this new version and have some of these features available natively.

Supplying the config file on the command line and overriding options

The workflow module only uses two command line options, one to specify the configuration files and one to specify and overriding options. First the config files:

–config-files FILE1 [FILE2 FILE3 ….]

where FILEX corresponds to the configuration files. Second the overriding options:

–config-overrides section1:option1:value1 [section2:option2:value2 …]

These specify options that should be added to the config files, or if already present overwritten. The section, option and value refer to the section option and value to be added. If the section doesn’t already exist in the configuration file it will be added. In some cases you will want to supply an option without a value. This can be done with either

section:option:

or

section:option

Example

Here is an example of running a workflow from the command line:

python weekly_ahope.py --config-files weekly_ahope.ini pipedown.ini inj.ini --config-overrides workflow:start-time:${GPS_START_TIME} workflow:end-time:${GPS_END_TIME}

Here the analysis start and end times are being overriden with values from the user’s environment.

Global options - the [workflow] section

The [workflow] section and [workflow-XXX] subsections should appear at the top of a configuration file.

The [workflow] section and [workflow-XXX] subsections of the configuration file are used to store options that the workflow module uses to make decisions on what paths to take when deciding how to construct the workflow. Options in here are not going to end up supplied to any executable on the command line.

The [workflow] section must contain two entries

start-time=START
end-time=END

which are used to tell the workflow that is only to consider times in [START,END) for analysis. These will often be supplied as override options directly on the command line.

Another optional entry in the [workflow] section, that we recommend be used is the:

file-retention-level = all_files

entry. This can take one of 4 values: “all_files”, “all_triggers”, “merged_triggers” or “results”. These specify how many files produced during the workflow should be stored after the workflow finishes. With “all_files”, which is the default value, everything produced in the workflow will be stored. With “results” only the critical result files are stored. “all_triggers” and “merged_triggers” store some subset of the full set of files. Defining whether a file should be stored under each of these levels is the job of the Executable class, which carries a current_retention_level attribute (one of executable.INTERMEDIATE_PRODUCT, executable.ALL_TRIGGERS, executable.MERGED_TRIGGERS or executable.FINAL_RESULT). When building workflows one can set this atrribute when creating executable instances to set under what conditions a file should be stored.

It is okay to store other important and widely used values in here. You might often see cases where channel names are given here as these are sent to a number of codes on the command line, and it is easier to refer to them here, at the very top of the .ini file, so that the user can more easily see and change such values.

[workflow-XXX] subsections

Each module that you use when setting up your workflow will need an [workflow-XXX] subsection. The name of the subsection and the particular options needed can be found in each module’s documentation page.

If you want to call any module more than once you will need to use the workflow module’s tagging system. As an example let’s say I want to call the template bank module twice, once to set up a pycbc template bank and once to set up a SVD template bank. I could then create [workflow-tmpltbank-pycbc] and [workflow-tmpltbank-svd] sections to provide options that are unique to each tag. I could also use [exename-pycbc] and [exename-svd] sections if the two methods are using the same executable, but need different options. In both cases options in [workflow-tmpltbank] and [exename] would be used for both tags. (If the two codes were using different executables then [exename1] and [exename2] sections would suffice.)

An example of where this section might be used is in the template bank stage where one can either run with a pre-generated bank or generate banks within the workflow. This information would be provided in this section.

Requirements

The [workflow] section in every .ini file should contain a link to this page to see what options are needed.

The [workflow-XXX] sections in every .ini file should start with a link to that module’s documentation to see what options/values are relevant for that section.

Example

Here is an example of the [workflow] section of a .ini file:

[workflow]
; https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/workflow/initialization.html
; provides details of how to set up a pycbc workflow configuration .ini file
h1-channel-name = H1:LDAS-STRAIN
l1-channel-name = L1:LDAS-STRAIN
;h2-channel-name = H2:LDAS-STRAIN
workflow-html-basedir = /home/spxiwh/public_html/workflow/development/weekly_ahope/test

[workflow-ifos]
; This is the list of ifos to analyse
h1 =
l1 =

[workflow-datafind]
; See https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/workflow/datafind.html
datafind-method = AT_RUNTIME_SINGLE_FRAMES
datafind-h1-frame-type = H1_LDAS_C02_L2
datafind-l1-frame-type = L1_LDAS_C02_L2
;datafind-h2-frame-type = H2_LDAS_C02_L2
datafind-check-segment-gaps = update_times
datafind-check-frames-exist = raise_error
datafind-check-segment-summary = no_test
; Set this to sepcify the datafind server. If this is not set the code will
; use the value in ${LIGO_DATAFIND_SERVER}
;datafind-ligo-datafind-server = ""

[workflow-segments]
; See https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/workflow/segments.html
; PIPEDOWN demands we use AT_RUNTIME
segments-method = AT_RUNTIME
segments-H1-science-name = H1:DMT-SCIENCE:4
segments-L1-science-name = L1:DMT-SCIENCE:4
;segments-V1-science-name = V1:ITF_SCIENCEMODE:6
segments-database-url = https://segdb.ligo.caltech.edu
segments-veto-definer-url = https://www.lsc-group.phys.uwm.edu/ligovirgo/cbc/public/segments/S6/H1L1V1-S6_CBC_LOWMASS_B_OFFLINE-937473702-0.xml
segments-veto-categories = 2,3,4
segments-minimum-segment-length = 2000
segments-generate-coincident-segments =

[workflow-tmpltbank]
; See https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/workflow/template_bank.html
tmpltbank-method=WORKFLOW_INDEPENDENT_IFOS
; Remove the option below to disable linking with matchedfilter_utils
tmpltbank-link-to-matchedfltr=

[workflow-injections]
; See https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/workflow/injections.html
injections-method=IN_WORKFLOW

[workflow-timeslides]
; See https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/workflow/time_slides.html
timeslides-method=AT_RUNTIME

Executable locations - the [executables] section

This section should contain the names of each of the executables that will be used in the workflow and their locations. The section might look something like:

[executables]
tmpltbank = /full/path/to/lalapps_tmpltbank
inspiral = /full/path/to/lalapps_inspiral

Note that one can give remote URLs here and the workflow generator will download the code to the workflow directory when it is run.

One can also give a URL indicating singularity as the scheme. This will indicate that the executable will be run within a singularity container, and therefore the executables would not be directly accessible from the head node:

[executables]
tmpltbank = https://github.com/full/url/to/lalapps_tmpltbank
inspiral = singularity:///full/path/to/lalapps_inspiral

executable macros

The following macros can be used only within this section to automatically fill in full path names

which(executable)

In the following example tmpltbank’s value will be replaced with the output of which(lalapps_tmpltbank):

[executables]
tmpltbank = ${which:lalapps_tmpltbank}
inspiral = /full/path/to/lalapps_inspiral

Requirements

All executables used in the workflow should be supplied in this section, and only in this section.

Example

Here is an example of the [executables] section of a pycbc workflow .ini file:

[executables]
tmpltbank         = /home/cbc/opt/s6b/ab577e4e5dad14e46fce511cffdb04917836ba36/bin/lalapps_tmpltbank
inspiral          = /home/cbc/opt/s6b/ab577e4e5dad14e46fce511cffdb04917836ba36/bin/lalapps_inspiral
inspinj           = /home/cbc/opt/s6b/ab577e4e5dad14e46fce511cffdb04917836ba36/bin/lalapps_inspinj
thinca            = ${which:ligolw_thinca}

Executable options

For each of the executables in the [executables] section, options for that executable should be listed under the section corresponding to that executable. Options in the [tmpltbank] section are sent to lalapps_tmpltbank, options in the [inspiral] section are sent to lalapps_inspiral etc.

It is possible to have more than one [tmpltbank] section, ConfigParser will simply combine them together when reading in. Therefore ‘’’important options’’’ and ‘’’options that a novice user might want to change’’’ could be supplied in a first [tmpltbank] section near the top of the .ini file. This section could be commented accordingly. The modules documentation page should also include instructions for each of the supported executables (usually the code’s own help message). Options that are not so important and ones that a novice user would not want to change could be placed in a second [tmpltbank] section at the bottom of the ini file, this section would be labelled accordingly and also contain a link to documentation for that executable.

Some options are only sent to a subset of jobs using a given executable. For example those running on H1 data. Options like these will be provided in sections labelled [executable_name-subset_tag]. So for the H1 example the section would be called [tmpltbank-H1]. As well as obeying the rules above these section must clearly state ‘’which’’ jobs will be sent those options. This can also be used when calling a section multiple times with different tags. Nested tags are not supported (ie [tmpltbank-H1-pycbc])

Some options need to be sent to more than one executable, for example the channel names are used by any code that reads the data. Such sections should be given as the combination of executable names separated by the & token. So options sent to tmpltbank ‘’’and’’’ inspiral would go in a section called [tmpltbank&inspiral]. The code parsing the .ini file will automatically separate and duplicate these options in memory. All of the above rules apply. If I want to send an option to all tmpltbank and inspiral jobs running on H1 data, I might do something like [tmpltbank-H1&inspiral-H1].

If an option is given in more than one section (ie. if I specify –time-window 0.5 in [inspiral] and –time-window 1.0 in another [inspiral] or [inspiral&tmpltbank] or [inspiral-H1] the code will throw an error. Specifying –time-window 1.0 in [inspiral-H1] and –time-window 0.5 in [inspiral-L1] is valid as long as the subset of H1 jobs and the subset of L1 jobs do not overlap.

If a particular code (let’s say inspiral) wants to use an option supplied in the [workflow] section (for e.g. the channel names) it can do this by using:

[inspiral-h1]
channel-name = ${workflow|h1-channel}

[inspiral-l1]
channel-name = ${workflow|l1-channel}

[inspiral-v1]
channel-name = ${workflow|v1-channel}

Similar macros can be added as needed, but these should be limited to avoid namespace confusion.

Example complete workflow .ini file

Please see individual workflow documentation pages for some examples of complete .ini files and example workflows.

Other special sections

[environment] section

We have access to environment variables present when generating the workflow (with the exception of any variable containing a $ or a % as these are special characters). These are automatically accessed and stored in the [environment] section of the config file when creating a PyCBC ConfigParser object.

Values in this section can be accessed in the configuration file like this:

[inspiral-h1]
channel-name = ${environment|H1_CHANNEL_NAME}

which would take the value from ${H1_CHANNEL_NAME} in the environment.

These values will also be written out for later reference in the config file produced when generating a workflow.

[sharedoptions] section

An alternative to the [exe1&exe2] section, especially when options are split well into groups of options, is to use the [sharedoptions] section. An example of this follows:

[sharedoptions]
massranges = exe1,exe2,exe3-mass
metric = exe1,exe2-range,exe3-metric, exe5

[sharedoptions-massranges]
min-mass1 = 2.0
max-mass1 = 48.0
min-mass2 = 2.0
max-mass2 = 48.0
max-total-mass = 4.2
min-total-mass = 4.0
max-eta = 0.25
max-ns-spin-mag = 0.9899
max-bh-spin-mag = 0.9899

[sharedoptions-metric]
pn-order = threePointFivePN
f0 = 70.0
f-low = 30.0
f-upper = 1100.0
delta-f = 0.01

This will ensure that all options in [sharedoptions-massranges] are added to the [exe1], [exe2] and [exe3-mass] sections. All options in [sharedoptions-metric] are added to [exe1], [exe2-range], [exe-metric] and [exe5].

Code documentation

The parsing of .ini files and command line parsing is done from within the pycbc.workflow.configuration module. The functions in this module are shown below

`pycbc.workflow.configuration` Module

This module provides a wrapper to the ConfigParser utilities for pycbc workflow construction. This module is described in the page here: https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/ahope/initialization_inifile.html

class pycbc.workflow.configuration.WorkflowConfigParser(configFiles=None, overrideTuples=None, parsedFilePath=None, deleteTuples=None, copy_to_cwd=False)[source]

Bases: InterpolatingConfigParser

This is a sub-class of InterpolatingConfigParser, which lets us add a few additional helper features that are useful in workflows.

get_cli_option(section, option_name, **kwds)[source]

Return option using CLI action parsing

Parameters:

section (str) – Section to find option to parse
option_name (str) – Name of the option to parse from the config file
kwds (keywords) – Additional keywords are passed directly to the argument parser.

Returns:

The parsed value for this option

Return type:

value

interpolate_exe(testString)[source]

Replace testString with a path to an executable based on the format.

If this looks like

${which:lalapps_tmpltbank}

it will return the equivalent of which(lalapps_tmpltbank)

Otherwise it will return an unchanged string.

Parameters:: testString (string) – The input string
Returns:: newString – The output string.
Return type:: string

perform_exe_expansion()[source]

This function will look through the executables section of the ConfigParser object and replace any values using macros with full paths.

For any values that look like

${which:lalapps_tmpltbank}

will be replaced with the equivalent of which(lalapps_tmpltbank)

Otherwise values will be unchanged.

resolve_file_url(test_string)[source]

Replace test_string with a path to an executable based on the format.

If this looks like

${which:lalapps_tmpltbank}

it will return the equivalent of which(lalapps_tmpltbank)

Otherwise it will return an unchanged string.

Parameters:: test_string (string) – The input string
Returns:: new_string – The output string.
Return type:: string

resolve_urls()[source]

This function will look through all sections of the ConfigParser object and replace any URLs that are given the resolve magic flag with a path on the local drive.

Specifically for any values that look like

${resolve:https://git.ligo.org/detchar/SOME_GATING_FILE.txt}

the file will be replaced with the output of resolve_url(URL)

Otherwise values will be unchanged.

section_to_cli(section, skip_opts=None)[source]

Converts a section into a command-line string.

For example:

[section_name]
foo =
bar = 10

yields: ‘–foo –bar 10’.

Parameters:

section (str) – The name of the section to convert.
skip_opts (list, optional) – List of options to skip. Default (None) results in all options in the section being converted.

Returns:

The options as a command-line string.

Return type:

str

pycbc.workflow.configuration.add_workflow_command_line_group(parser)[source]

The standard way of initializing a ConfigParser object in workflow will be to do it from the command line. This is done by giving a

–local-config-files filea.ini fileb.ini filec.ini

command. You can also set config file override commands on the command line. This will be most useful when setting (for example) start and end times, or active ifos. This is done by

–config-overrides section1:option1:value1 section2:option2:value2 …

This can also be given as

–config-overrides section1:option1

where the value will be left as ‘’.

To remove a configuration option, use the command line argument

–config-delete section1:option1

which will delete option1 from [section1] or

–config-delete section1

to delete all of the options in [section1]

Deletes are implemented before overrides.

This function returns an argparse OptionGroup to ensure these options are parsed correctly and can then be sent directly to initialize an WorkflowConfigParser.

Parameters:: parser (argparse.ArgumentParser instance) – The initialized argparse instance to add the workflow option group to.

pycbc.workflow.configuration.hash_compare(filename_1, filename_2, chunk_size=None, max_chunks=None)[source]

Calculate the sha1 hash of a file, or of part of a file

Parameters:

filename_1 (string or path) – the first file to be hashed / compared
filename_2 (string or path) – the second file to be hashed / compared
chunk_size (integer) – This size of chunks to be read in and hashed. If not given, will read the whole file (may be slow for large files).
max_chunks (integer) – This many chunks to be compared. If all chunks so far have been the same, then just assume its the same file. Default 10

Returns:

hash – The hexdigest() after a sha1 hash of (part of) the file

Return type:

string

pycbc.workflow.configuration.resolve_url(url, directory=None, permissions=None, copy_to_cwd=True, hash_max_chunks=None, hash_chunk_size=None)[source]

Resolves a URL to a local file, and returns the path to that file.

If a URL is given, the file will be copied to the current working directory. If a local file path is given, the file will only be copied to the current working directory if copy_to_cwd is True (the default).

pycbc.workflow.configuration.resolve_url_http(url, u, filename)[source]: Helper function used by resolve_url() to handle HTTP and HTTPS URLs.

Pycbc’s workflow module configuration file(s) and command line interface

Introduction

Supplying the config file on the command line and overriding options

Example

Global options - the [workflow] section

[workflow-XXX] subsections

Requirements

Example

Executable locations - the [executables] section

executable macros

which(executable)

Requirements

Example

Executable options

Example complete workflow .ini file

Other special sections

[environment] section

[sharedoptions] section

Code documentation

pycbc.workflow.configuration Module

`pycbc.workflow.configuration` Module