The workflow table splitting module

Introduction

This module is used when you want to split a file into multiple parts, normally to enable analysis to proceed in parallel. The most common example of this is to split the list of templates output by a template bank generation code to enable a set of matched-filter jobs to analyse that bank in parallel. If you want to do something similar this module is the place to do it.

The return of the table splitting module is a pycbc FileList of the split files generated by this module.

Usage

Using this module requires a number of things

  • A configuration file (or files) containing the information needed to tell this module how to generate (or gather) the template banks (described below).

  • An initialized instance of the pycbc Workflow class, containing the ConfigParser.

  • A FileList of the files that are to be split.

This module is then called according to

pycbc.workflow.setup_splittable_workflow(workflow, input_tables, out_dir=None, tags=None)[source]

This function aims to be the gateway for code that is responsible for taking some input file containing some table, and splitting into multiple files containing different parts of that table. For now the only supported operation is using lalapps_splitbank to split a template bank xml file into multiple template bank xml files.

Parameters:
Returns:

split_table_outs – The list of split up files as output from this job.

Return type:

pycbc.workflow.core.FileList

Configuration file setup

Here we describe the options given in the configuration file used in the workflow that will be needed in this section

[workflow-splittable] section

The configuration file must have a [workflow-splittable] section, which is used to tell the workflow how to construct the split output files. The first option to choose and provide is

  • splittable-method = VALUE

The choices here and their description are as described below

  • IN_WORKFLOW - The file splitting jobs will be added as jobs in the workflow and will be generated after submission of the workflow.

  • NOOP - Do nothing and return the input file list. It is better not to call the module at all if you do not want to split files, but this can be useful if you want to use an existing script and do not need the splittable functionality.

When using IN_WORKFLOW the following additional option is needed:

  • splittable-num-banks = VALUE - Specifies how many parts to split each input file into.

[executables]

In this section, if not using NOOP, you need to supply the executable that will be used to generate the time slide files. This is done in the [executables] section by adding something like:

splittable = /path/to/pycbc_splitbank

The option, in this case ‘splittable’, will be used to specify the constant command line options that are sent to all pycbc_splitbank jobs. These will need to be put in a section called [splittable] and the options themselves are discussed below. The tag ‘splittable’ cannot be changed currently.

FIXME: Tag support is not yet present in splittable, the following is currently untrue, but should be fixed. As with other modules tagged subsections [splittable-TAG] [workflow-splittable-TAG] sub-sections are supported, if this module needs to be run in different configurations

Supported splittable executables and instructions for using them

The following splittable executables are currently supported:

  • pycbc_splitbank

  • lalapps_splitbank - NOTE: The output of this code can be unpredicatable, or broken. We strongly recommend using pycbc_splitbank. For this reason we do not give any further details about running this code.

Adding a new executable is not too hard, please ask a developer for some pointers on how to do this if you want to add a new code.

pycbc_splitbank

pycbc_splitbank is a pycbc python code that can be used from splitting any table in an input xml file. Normally this splits the sngl_inspiral table that holds the template bank. The help message for pycbc_splitbank is as follows

$ pycbc_splitbank --help
No CuPy
No CuPy or GPU PhenomHM module.
No CuPy or GPU response available.
No CuPy or GPU interpolation available.
usage: pycbc_splitbank [-h] [--version] [-v]
                       (--templates-per-bank SAMPLES | -n N | -O [OUTPUT_FILENAME [OUTPUT_FILENAME ...]])
                       [-o OUTPUT_PREFIX] -t INPUT_FILE
                       [--sort-frequency-cutoff SORT_FREQUENCY_CUTOFF]
                       [--sort-mchirp] [--random-sort]
                       [--random-seed RANDOM_SEED]

Splits a table in an xml file into multiple pieces.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --templates-per-bank SAMPLES
                        number of templates in the output banks
  -n N, --number-of-banks N
                        Split template bank into N files
  -O [OUTPUT_FILENAME [OUTPUT_FILENAME ...]], --output-filenames [OUTPUT_FILENAME [OUTPUT_FILENAME ...]]
                        Directly specify the names of the output files. The
                        number of files specified here will dictate how to
                        split the bank. It will be split equally between all
                        specified files.
  -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
                        Prefix to add to the template bank name (name becomes
                        output#.xml[.gz])
  -t INPUT_FILE, --bank-file INPUT_FILE
                        Template bank to split
  --sort-frequency-cutoff SORT_FREQUENCY_CUTOFF
                        Frequency cutoff to use for sorting the sub banks
  --sort-mchirp         Sort templates by chirp mass before splitting
  --random-sort         Sort templates randomly before splitting
  --random-seed RANDOM_SEED
                        Random seed to use when sorting randomly

PyCBC common options:
  Common options for PyCBC executables.

  -v, --verbose         Add verbosity to logging. Adding the option multiple
                        times makes logging progressively more verbose, e.g.
                        --verbose or -v provides logging at the info level,
                        but -vv or --verbose --verbose provides debug logging.

An example of a pycbc_splitbank call is given below

/home/spxiwh/lscsoft_git/executables_master/bin/pycbc_splitbank --random-sort  --bank-file /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK-961585551-2048.xml.gz --output-filenames /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK0-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK1-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK2-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK3-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK4-961585551-2048.xml.gz

The following options are added by the workflow module and must not be provided in the configuration file

  • –bank-file

  • –output-filenames

pycbc.workflow.splittable Module

This is complete documentation of this module’s code

This module is responsible for setting up the splitting output files stage of workflows. For details about this module and its capabilities see here: https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/NOTYETCREATED.html

pycbc.workflow.splittable.select_splitfilejob_instance(curr_exe)[source]

This function returns an instance of the class that is appropriate for splitting an output file up within workflow (for e.g. splitbank).

Parameters:
  • curr_exe (string) – The name of the Executable that is being used.

  • curr_section (string) – The name of the section storing options for this executble

Returns:

exe class – The class that holds the utility functions appropriate for the given Executable. This class must contain * exe_class.create_job() and the job returned by this must contain * job.create_node()

Return type:

sub-class of pycbc.workflow.core.Executable

pycbc.workflow.splittable.setup_splittable_dax_generated(workflow, input_tables, out_dir, tags)[source]

Function for setting up the splitting jobs as part of the workflow.

Parameters:
Returns:

split_table_outs – The list of split up files as output from this job.

Return type:

pycbc.workflow.core.FileList

pycbc.workflow.splittable.setup_splittable_workflow(workflow, input_tables, out_dir=None, tags=None)[source]

This function aims to be the gateway for code that is responsible for taking some input file containing some table, and splitting into multiple files containing different parts of that table. For now the only supported operation is using lalapps_splitbank to split a template bank xml file into multiple template bank xml files.

Parameters:
Returns:

split_table_outs – The list of split up files as output from this job.

Return type:

pycbc.workflow.core.FileList