The workflow table splitting module
Introduction
This module is used when you want to split a file into multiple parts, normally to enable analysis to proceed in parallel. The most common example of this is to split the list of templates output by a template bank generation code to enable a set of matched-filter jobs to analyse that bank in parallel. If you want to do something similar this module is the place to do it.
The return of the table splitting module is a pycbc FileList of the split files generated by this module.
Usage
Using this module requires a number of things
A configuration file (or files) containing the information needed to tell this module how to generate (or gather) the template banks (described below).
An initialized instance of the pycbc Workflow class, containing the ConfigParser.
A FileList of the files that are to be split.
This module is then called according to
- pycbc.workflow.setup_splittable_workflow(workflow, input_tables, out_dir=None, tags=None)[source]
This function aims to be the gateway for code that is responsible for taking some input file containing some table, and splitting into multiple files containing different parts of that table. For now the only supported operation is using lalapps_splitbank to split a template bank xml file into multiple template bank xml files.
- Parameters:
workflow (pycbc.workflow.core.Workflow) – The Workflow instance that the jobs will be added to.
input_tables (pycbc.workflow.core.FileList) – The input files to be split up.
out_dir (path) – The directory in which output will be written.
- Returns:
split_table_outs – The list of split up files as output from this job.
- Return type:
Configuration file setup
Here we describe the options given in the configuration file used in the workflow that will be needed in this section
[workflow-splittable] section
The configuration file must have a [workflow-splittable] section, which is used to tell the workflow how to construct the split output files. The first option to choose and provide is
splittable-method = VALUE
The choices here and their description are as described below
IN_WORKFLOW - The file splitting jobs will be added as jobs in the workflow and will be generated after submission of the workflow.
NOOP - Do nothing and return the input file list. It is better not to call the module at all if you do not want to split files, but this can be useful if you want to use an existing script and do not need the splittable functionality.
When using IN_WORKFLOW the following additional option is needed:
splittable-num-banks = VALUE - Specifies how many parts to split each input file into.
[executables]
In this section, if not using NOOP, you need to supply the executable that will be used to generate the time slide files. This is done in the [executables] section by adding something like:
splittable = /path/to/pycbc_splitbank
The option, in this case ‘splittable’, will be used to specify the constant command line options that are sent to all pycbc_splitbank jobs. These will need to be put in a section called [splittable] and the options themselves are discussed below. The tag ‘splittable’ cannot be changed currently.
FIXME: Tag support is not yet present in splittable, the following is currently untrue, but should be fixed. As with other modules tagged subsections [splittable-TAG] [workflow-splittable-TAG] sub-sections are supported, if this module needs to be run in different configurations
Supported splittable executables and instructions for using them
The following splittable executables are currently supported:
pycbc_splitbank
lalapps_splitbank - NOTE: The output of this code can be unpredicatable, or broken. We strongly recommend using pycbc_splitbank. For this reason we do not give any further details about running this code.
Adding a new executable is not too hard, please ask a developer for some pointers on how to do this if you want to add a new code.
pycbc_splitbank
pycbc_splitbank is a pycbc python code that can be used from splitting any table in an input xml file. Normally this splits the sngl_inspiral table that holds the template bank. The help message for pycbc_splitbank is as follows
$ pycbc_splitbank --help
No CuPy
No CuPy or GPU PhenomHM module.
No CuPy or GPU response available.
No CuPy or GPU interpolation available.
usage: pycbc_splitbank [-h] [-v] [--version [VERSION]]
(--templates-per-bank SAMPLES | -n N | -O [OUTPUT_FILENAME ...])
[-o OUTPUT_PREFIX] -t INPUT_FILE
[--sort-frequency-cutoff SORT_FREQUENCY_CUTOFF]
[--sort-mchirp] [--random-sort]
[--random-seed RANDOM_SEED]
Splits a table in an xml file into multiple pieces.
options:
-h, --help show this help message and exit
--templates-per-bank SAMPLES
number of templates in the output banks
-n N, --number-of-banks N
Split template bank into N files
-O [OUTPUT_FILENAME ...], --output-filenames [OUTPUT_FILENAME ...]
Directly specify the names of the output files. The
number of files specified here will dictate how to
split the bank. It will be split equally between all
specified files.
-o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
Prefix to add to the template bank name (name becomes
output#.xml[.gz])
-t INPUT_FILE, --bank-file INPUT_FILE
Template bank to split
--sort-frequency-cutoff SORT_FREQUENCY_CUTOFF
Frequency cutoff to use for sorting the sub banks
--sort-mchirp Sort templates by chirp mass before splitting
--random-sort Sort templates randomly before splitting
--random-seed RANDOM_SEED
Random seed to use when sorting randomly
PyCBC common options:
Common options for PyCBC executables.
-v, --verbose Add verbosity to logging. Adding the option multiple
times makes logging progressively more verbose, e.g.
--verbose or -v provides logging at the info level,
but -vv or --verbose --verbose provides debug logging.
--version [VERSION] Display PyCBC version information and exit. Can
optionally supply a modifier integer to control the
verbosity of the version information. 0 and 1 are the
same as --version; 2 provides more detailed PyCBC
library information; 3 provides information about
PyCBC, LAL and LALSimulation packages (if installed)
An example of a pycbc_splitbank call is given below
/home/spxiwh/lscsoft_git/executables_master/bin/pycbc_splitbank --random-sort --bank-file /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK-961585551-2048.xml.gz --output-filenames /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK0-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK1-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK2-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK3-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK4-961585551-2048.xml.gz
The following options are added by the workflow module and must not be provided in the configuration file
–bank-file
–output-filenames
pycbc.workflow.splittable
Module
This is complete documentation of this module’s code
This module is responsible for setting up the splitting output files stage of workflows. For details about this module and its capabilities see here: https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/NOTYETCREATED.html
- pycbc.workflow.splittable.select_splitfilejob_instance(curr_exe)[source]
This function returns an instance of the class that is appropriate for splitting an output file up within workflow (for e.g. splitbank).
- Parameters:
curr_exe (string) – The name of the Executable that is being used.
curr_section (string) – The name of the section storing options for this executble
- Returns:
exe class – The class that holds the utility functions appropriate for the given Executable. This class must contain * exe_class.create_job() and the job returned by this must contain * job.create_node()
- Return type:
sub-class of pycbc.workflow.core.Executable
- pycbc.workflow.splittable.setup_splittable_dax_generated(workflow, input_tables, out_dir, tags)[source]
Function for setting up the splitting jobs as part of the workflow.
- Parameters:
workflow (pycbc.workflow.core.Workflow) – The Workflow instance that the jobs will be added to.
input_tables (pycbc.workflow.core.FileList) – The input files to be split up.
out_dir (path) – The directory in which output will be written.
- Returns:
split_table_outs – The list of split up files as output from this job.
- Return type:
- pycbc.workflow.splittable.setup_splittable_workflow(workflow, input_tables, out_dir=None, tags=None)[source]
This function aims to be the gateway for code that is responsible for taking some input file containing some table, and splitting into multiple files containing different parts of that table. For now the only supported operation is using lalapps_splitbank to split a template bank xml file into multiple template bank xml files.
- Parameters:
workflow (pycbc.workflow.core.Workflow) – The Workflow instance that the jobs will be added to.
input_tables (pycbc.workflow.core.FileList) – The input files to be split up.
out_dir (path) – The directory in which output will be written.
- Returns:
split_table_outs – The list of split up files as output from this job.
- Return type: