The workflow datafind and validation module

Introduction

This page is designed to give you an introduction to the capabilities of the workflow datafind and validation module and how to use this as part of a pycbc workflow.

This module is designed to be able to support multiple ways of obtaining data (different codes/interfaces whatever). Currently we only support datafind through the gwdatafind module (which is equivalent to using gw_data_find).

This module will run the necessary queries to the datafind server to obtain locations for frame files at the specfied times for each interferometer.

Optionally, it can also run a set of tests to verify this output and act accordingly. This includes

A check that the all times in the input segment lists are covered with frames, and methods for dealing with cases where this is not true.
A check that all returned frame files actually exist and are accessible on the cluster.
A check that segment_summary flags are defined for all frames that have been returned.

Usage

Using this module requires a number of things

A configuration file (or files) containing the information needed to tell this module how to generate the segments (described below).
An initialized instance of the pycbc workflow class, containing the ConfigParser.
An ifo-keyed dictionary of igwn_segments.segmentlist instances containing the times that should be analysed for each ifo. See The workflow segment generation module for documentation of the segments module, which in most cases should be used to obtain this input.

Configuration file setup

Here we describe the options given in the configuration file used in the workflow that will be needed in this section

[workflow-datafind] section

The configuration file must have an [workflow-datafind] section, which is used to tell the workflow how to generate the datafind calls. The first option to choose and provide is

datafind-method = VALUE

The choices here and their description are as described below

AT_RUNTIME_SINGLE_FRAMES - Find frame files at runtime using setup_datafind_runtime_frames_single_call_perifo, use one datafind query for every interferometer. This may be the quickest approach, but some of the frames returned will not be suitable for analysis. The output list will contain a single entry for every frame file returned by the datafind queries, which will enable pegasus to more easily track input files for jobs that need to read from frame files.
AT_RUNTIME_MULTIPLE_FRAMES - Find frame files at runtime using setup_datafind_runtime_frames_multi_calls_perifo, use one datafind query for every science segment. This may be a little slower than the above, but only frames that overlap analysable data stretches will be returned. The output list will contain a single entry for every frame file returned by the datafind queries, which will enable pegasus to more easily track input files for jobs that need to read from frame files.
AT_RUNTIME_SINGLE_CACHES - Find frame files at runtime using setup_datafind_runtime_cache_single_call_perifo, use one datafind query for every interferometer. This may be the quickest approach, but some of the frames returned will not be suitable for analysis. The output list will contain a single entry for every call made to the datafind server, which will correspond to a .lcf “frame cache” file in the output directory.
AT_RUNTIME_MULTIPLE_CACHES - Find frame files at runtime using setup_datafind_runtime_cache_multi_calls_perifo, use one datafind query for every science segment. This may be a little slower than the above, but only frames that overlap analysable data stretches will be returned. The output list will contain a single entry for every call made to the datafind server, which will correspond to a .lcf “frame cache” file in the output directory.
FROM_PREGENERATED_LCF_FILES - Supply a set of pregenerated .lcf files containing a list of frame files to use for analysis. This option is intended to be used in cases where a datafind server is not available. Be warned that data does move around on clusters so on standard LDG clusters the AT_RUNTIME options are recommended.

Each of these options will describe which subfunction to use. These are described here

pycbc.workflow.setup_datafind_runtime_cache_multi_calls_perifo(cp, scienceSegs, outputDir, tags=None)[source]

This function uses the gwdatafind library to obtain the location of all the frame files that will be needed to cover the analysis of the data given in scienceSegs. This function will not check if the returned frames cover the whole time requested, such sanity checks are done in the pycbc.workflow.setup_datafind_workflow entry function. As opposed to setup_datafind_runtime_single_call_perifo this call will one call to the datafind server for every science segment. This function will return a list of output files that correspond to the cache .lcf files that are produced, which list the locations of all frame files. This will cause problems with pegasus, which expects to know about all input files (ie. the frame files themselves.)

Parameters:

cp (ConfigParser.ConfigParser instance) – This contains a representation of the information stored within the workflow configuration files
scienceSegs (Dictionary of ifo keyed igwn_segments.segmentlist instances) – This contains the times that the workflow is expected to analyse.
outputDir (path) – All output files written by datafind processes will be written to this directory.
tags (list of strings, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniqueify the actual filename. FIXME: Filenames may not be unique with current codes!

Returns:

datafindcaches (list of glue.lal.Cache instances) – The glue.lal.Cache representations of the various calls to the datafind server and the returned frame files.
datafindOuts (pycbc.workflow.core.FileList) – List of all the datafind output files for use later in the pipeline.

pycbc.workflow.setup_datafind_runtime_cache_single_call_perifo(cp, scienceSegs, outputDir, tags=None)[source]

This function uses the gwdatafind library to obtain the location of all the frame files that will be needed to cover the analysis of the data given in scienceSegs. This function will not check if the returned frames cover the whole time requested, such sanity checks are done in the pycbc.workflow.setup_datafind_workflow entry function. As opposed to setup_datafind_runtime_generated this call will only run one call to datafind per ifo, spanning the whole time. This function will return a list of output files that correspond to the cache .lcf files that are produced, which list the locations of all frame files. This will cause problems with pegasus, which expects to know about all input files (ie. the frame files themselves.)

Parameters:

cp (ConfigParser.ConfigParser instance) – This contains a representation of the information stored within the workflow configuration files
scienceSegs (Dictionary of ifo keyed igwn_segments.segmentlist instances) – This contains the times that the workflow is expected to analyse.
outputDir (path) – All output files written by datafind processes will be written to this directory.
tags (list of strings, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniqueify the actual filename. FIXME: Filenames may not be unique with current codes!

Returns:

datafindcaches (list of glue.lal.Cache instances) – The glue.lal.Cache representations of the various calls to the datafind server and the returned frame files.
datafindOuts (pycbc.workflow.core.FileList) – List of all the datafind output files for use later in the pipeline.

pycbc.workflow.setup_datafind_runtime_frames_multi_calls_perifo(cp, scienceSegs, outputDir, tags=None)[source]

This function uses the gwdatafind library to obtain the location of all the frame files that will be needed to cover the analysis of the data given in scienceSegs. This function will not check if the returned frames cover the whole time requested, such sanity checks are done in the pycbc.workflow.setup_datafind_workflow entry function. As opposed to setup_datafind_runtime_single_call_perifo this call will one call to the datafind server for every science segment. This function will return a list of files corresponding to the individual frames returned by the datafind query. This will allow pegasus to more easily identify all the files used as input, but may cause problems for codes that need to take frame cache files as input.

Parameters:

cp (ConfigParser.ConfigParser instance) – This contains a representation of the information stored within the workflow configuration files
scienceSegs (Dictionary of ifo keyed igwn_segments.segmentlist instances) – This contains the times that the workflow is expected to analyse.
outputDir (path) – All output files written by datafind processes will be written to this directory.
tags (list of strings, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniqueify the actual filename. FIXME: Filenames may not be unique with current codes!

Returns:

datafindcaches (list of glue.lal.Cache instances) – The glue.lal.Cache representations of the various calls to the datafind server and the returned frame files.
datafindOuts (pycbc.workflow.core.FileList) – List of all the datafind output files for use later in the pipeline.

pycbc.workflow.setup_datafind_runtime_frames_single_call_perifo(cp, scienceSegs, outputDir, tags=None)[source]

This function uses the gwdatafind library to obtain the location of all the frame files that will be needed to cover the analysis of the data given in scienceSegs. This function will not check if the returned frames cover the whole time requested, such sanity checks are done in the pycbc.workflow.setup_datafind_workflow entry function. As opposed to setup_datafind_runtime_generated this call will only run one call to datafind per ifo, spanning the whole time. This function will return a list of files corresponding to the individual frames returned by the datafind query. This will allow pegasus to more easily identify all the files used as input, but may cause problems for codes that need to take frame cache files as input.

Parameters:

cp (ConfigParser.ConfigParser instance) – This contains a representation of the information stored within the workflow configuration files
scienceSegs (Dictionary of ifo keyed igwn_segments.segmentlist instances) – This contains the times that the workflow is expected to analyse.
outputDir (path) – All output files written by datafind processes will be written to this directory.
tags (list of strings, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniqueify the actual filename. FIXME: Filenames may not be unique with current codes!

Returns:

datafindcaches (list of glue.lal.Cache instances) – The glue.lal.Cache representations of the various calls to the datafind server and the returned frame files.
datafindOuts (pycbc.workflow.core.FileList) – List of all the datafind output files for use later in the pipeline.

pycbc.workflow.setup_datafind_from_pregenerated_lcf_files(cp, ifos, outputDir, tags=None)[source]

This function is used if you want to run with pregenerated lcf frame cache files.

Parameters:

cp (ConfigParser.ConfigParser instance) – This contains a representation of the information stored within the workflow configuration files
ifos (list of ifo strings) – List of ifos to get pregenerated files for.
outputDir (path) – All output files written by datafind processes will be written to this directory. Currently this sub-module writes no output.
tags (list of strings, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniqueify the actual filename.

Returns:

datafindcaches (list of glue.lal.Cache instances) – The glue.lal.Cache representations of the various calls to the datafind server and the returned frame files.
datafindOuts (pycbc.workflow.core.FileList) – List of all the datafind output files for use later in the pipeline.

When using any of the AT_RUNTIME sub-modules the following other configuration options apply in the [workflow-datafind] section

datafind-X1-frame-type = NAME - REQUIRED. Where X1 is replaced by the ifo name for each ifo. The NAME should be the full frame type, which is used when querying the database.
datafind-ligo-datafind-server = URL - OPTIONAL. If provided use this server when querying for frames. If not provided, which is recommended for most applications, then the LIGO_DATAFIND_SERVER environment variable will be used to determine this.
datafind-backup-datafind-server = URL - OPTIONAL. This option is only available when using AT_RUNTIME_SINGLE_FRAMES or AT_RUNTIME_MULTIPLE_FRAMES. If given it will query a second datafind server (ie. a remote server) using gsiftp urltypes. This will then allow frames to be associated with both a file:// and gsiftp:// url, in the case that your local site is missing a frame file, or the file is not accessible, pegasus will copy the file from gsiftp://. NOTE This will not catch the case that the frame file is available at the start of a workflow but goes missing later. Pegasus can copy all frame files around at the start of the workflow, but you may not want this (remove symlink option from the basic_pegasus.conf if you want this). WARNING gsiftp copying is largely deprecated. This option can still work, if you have a valid X509 certificate, but if this is used in a production setting we should investigate the use case. Please contact Ian if using this!

When using the PREGENERATED sub-module the following configuartion options apply in the [workflow-datafind] section:

datafind-pregenerated-cache-file-x1 = Path/to/file.lcf. This should be specified independently for each ifo and points to the pregenerated files.

The following configuration options apply in the [workflow-datafind] section for all sub-modules and can be used as sanity checks:

datafind-check-segment-gaps = STRING - OPTIONAL (default = “no_test”). If this option takes any value other than ‘no_test’ the workflow module will check that the local datafind server has returned frames covering all of the listed science times. Its behaviour is then as follows:
- ‘no_test’: Do not perform this test. Any discrepancies will cause later failures.
- ‘warn’: Perform the test, print warnings covering any discrepancies but do nothing about them. Discrepancies will cause failures later in the workflow.
- ‘update_times’: Perform the test, print warnings covering any discrepancies and update the input science times to remove times that are not present on the host cluster.
- ‘raise_error’: Perform the test. If any discrepancies occur, raise a ValueError.
datafind-check-frames-exist = STRING - OPTIONAL (default = “no_test”). If this options takes any value other than ‘no_test’ the workflow module will check that the frames returned by the local datafind server are accessible from the machine that is running the workflow generation. Its behaviour is then as follows:
- ‘no_test’: Do not perform this test. Any discrepancies will cause later failures.
- ‘warn’: Perform the test, print warnings covering any discrepancies but do nothing about them. Discrepancies will cause failures later in the workflow.
- ‘update_times’: Perform the test, print warnings covering any discrepancies and update the input science times to remove times that are not present on the host cluster.
- ‘raise_error’: Perform the test. If any discrepancies occur, raise a ValueError.
datafind-check-segment-summary = STRING - OPTIONAL (default = “no_test”). If this option takes any value other than ‘no_test’ the workflow module will check that all frames returned by datafind are covered by the segment_summary table (for the science flag). Its behaviour is then as follows:
- ‘no_test’: Do not perform this test.
- ‘warn’: Perform the test, print warnings covering any discrepancies but do nothing about them.
- ‘raise_error’: Perform the test. If any discrepancies occur, raise a ValueError.

[executables]

Currently no executables are needed in the datafind section. Workflow will use the gwdatafind module to run the calls to the datafind server.

Other sections

[datafind]

The other section that can be used in the datafind module is the [datafind] section. This section contains option,value pairs that will be send as key-word arguments when calling the datafind server. Valid options here

match=STRING - If given return only those frames matching the given regular expression.
urltype=TYPE - If given restrict the returned frames to the given scheme (e.g. “file”).

the on_gaps keyword argument is not supported as sanity checking is handled by the workflow module. This is always set to ‘ignore’ (this can be overwritten, we don’t recommend this).

`pycbc.workflow.datafind` Module

This is complete documentation of this module’s code

This module is responsible for querying a datafind server to determine the availability of the data that the code is attempting to run on. It also performs a number of tests and can act on these as described below. Full documentation for this function can be found here: https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/ahope/datafind.html

pycbc.workflow.datafind.convert_cachelist_to_filelist(datafindcache_list)[source]

Take as input a list of glue.lal.Cache objects and return a pycbc FileList containing all frames within those caches.

Parameters:: datafindcache_list (list of glue.lal.Cache objects) – The list of cache files to convert.
Returns:: datafind_filelist – The list of frame files.
Return type:: FileList of frame File objects

pycbc.workflow.datafind.datafind_keep_unique_backups(backup_outs, orig_outs)[source]

This function will take a list of backup datafind files, presumably obtained by querying a remote datafind server, e.g. CIT, and compares these against a list of original datafind files, presumably obtained by querying the local datafind server. Only the datafind files in the backup list that do not appear in the original list are returned. This allows us to use only files that are missing from the local cluster.

Parameters:

backup_outs (FileList) – List of datafind files from the remote datafind server.
orig_outs (FileList) – List of datafind files from the local datafind server.

Returns:

List of datafind files in backup_outs and not in orig_outs.

Return type:

FileList

pycbc.workflow.datafind.get_missing_segs_from_frame_file_cache(datafindcaches)[source]

This function will use os.path.isfile to determine if all the frame files returned by the local datafind server actually exist on the disk. This can then be used to update the science times if needed.

Parameters:

datafindcaches (OutGroupList) – List of all the datafind output files.

Returns:

missingFrameSegs (Dict. of ifo keyed igwn_segments.segmentlist instances) – The times corresponding to missing frames found in datafindOuts.
missingFrames (Dict. of ifo keyed lal.Cache instances) – The list of missing frames

pycbc.workflow.datafind.get_science_segs_from_datafind_outs(datafindcaches)[source]

This function will calculate the science segments that are covered in the OutGroupList containing the frame files returned by various calls to the datafind server. This can then be used to check whether this list covers what it is expected to cover.

Parameters:: datafindcaches (OutGroupList) – List of all the datafind output files.
Returns:: newScienceSegs – The times covered by the frames found in datafindOuts.
Return type:: Dictionary of ifo keyed igwn_segments.segmentlist instances

pycbc.workflow.datafind.get_segment_summary_times(scienceFile, segmentName)[source]

This function will find the times for which the segment_summary is set for the flag given by segmentName.

Parameters:

scienceFile (SegFile) – The segment file that we want to use to determine this.
segmentName (string) – The DQ flag to search for times in the segment_summary table.

Returns:

summSegList – The times that are covered in the segment summary table.

Return type:

igwn_segments.segmentlist

pycbc.workflow.datafind.log_datafind_command(observatory, frameType, startTime, endTime, outputDir, **dfKwargs)[source]: This command will print an equivalent gw_data_find command to disk that can be used to debug why the internal datafind module is not working.

pycbc.workflow.datafind.run_datafind_instance(cp, outputDir, observatory, frameType, startTime, endTime, ifo, tags=None)[source]

This function will query the datafind server once to find frames between the specified times for the specified frame type and observatory.

Parameters:

cp (ConfigParser instance) – Source for any kwargs that should be sent to the datafind module
outputDir (Output cache files will be written here. We also write the) – commands for reproducing what is done in this function to this directory.
observatory (string) – The observatory to query frames for. Ex. ‘H’, ‘L’ or ‘V’. NB: not ‘H1’, ‘L1’, ‘V1’ which denote interferometers.
frameType (string) – The frame type to query for.
startTime (int) – Integer start time to query the datafind server for frames.
endTime (int) – Integer end time to query the datafind server for frames.
ifo (string) – The interferometer to use for naming output. Ex. ‘H1’, ‘L1’, ‘V1’. Maybe this could be merged with the observatory string, but this could cause issues if running on old ‘H2’ and ‘H1’ data.
tags (list of string, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniquify the actual filename. FIXME: Filenames may not be unique with current codes!

Returns:

dfCache (glue.lal.Cache instance) – The glue.lal.Cache representation of the call to the datafind server and the returned frame files.
cacheFile (pycbc.workflow.core.File) – Cache file listing all of the datafind output files for use later in the pipeline.

pycbc.workflow.datafind.setup_datafind_from_pregenerated_lcf_files(cp, ifos, outputDir, tags=None)[source]

This function is used if you want to run with pregenerated lcf frame cache files.

Parameters:

cp (ConfigParser.ConfigParser instance) – This contains a representation of the information stored within the workflow configuration files
ifos (list of ifo strings) – List of ifos to get pregenerated files for.
outputDir (path) – All output files written by datafind processes will be written to this directory. Currently this sub-module writes no output.
tags (list of strings, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniqueify the actual filename.

Returns:

datafindcaches (list of glue.lal.Cache instances) – The glue.lal.Cache representations of the various calls to the datafind server and the returned frame files.
datafindOuts (pycbc.workflow.core.FileList) – List of all the datafind output files for use later in the pipeline.

pycbc.workflow.datafind.setup_datafind_runtime_cache_multi_calls_perifo(cp, scienceSegs, outputDir, tags=None)[source]

This function uses the gwdatafind library to obtain the location of all the frame files that will be needed to cover the analysis of the data given in scienceSegs. This function will not check if the returned frames cover the whole time requested, such sanity checks are done in the pycbc.workflow.setup_datafind_workflow entry function. As opposed to setup_datafind_runtime_single_call_perifo this call will one call to the datafind server for every science segment. This function will return a list of output files that correspond to the cache .lcf files that are produced, which list the locations of all frame files. This will cause problems with pegasus, which expects to know about all input files (ie. the frame files themselves.)

Parameters:

cp (ConfigParser.ConfigParser instance) – This contains a representation of the information stored within the workflow configuration files
scienceSegs (Dictionary of ifo keyed igwn_segments.segmentlist instances) – This contains the times that the workflow is expected to analyse.
outputDir (path) – All output files written by datafind processes will be written to this directory.
tags (list of strings, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniqueify the actual filename. FIXME: Filenames may not be unique with current codes!

Returns:

datafindcaches (list of glue.lal.Cache instances) – The glue.lal.Cache representations of the various calls to the datafind server and the returned frame files.
datafindOuts (pycbc.workflow.core.FileList) – List of all the datafind output files for use later in the pipeline.

pycbc.workflow.datafind.setup_datafind_runtime_cache_single_call_perifo(cp, scienceSegs, outputDir, tags=None)[source]

This function uses the gwdatafind library to obtain the location of all the frame files that will be needed to cover the analysis of the data given in scienceSegs. This function will not check if the returned frames cover the whole time requested, such sanity checks are done in the pycbc.workflow.setup_datafind_workflow entry function. As opposed to setup_datafind_runtime_generated this call will only run one call to datafind per ifo, spanning the whole time. This function will return a list of output files that correspond to the cache .lcf files that are produced, which list the locations of all frame files. This will cause problems with pegasus, which expects to know about all input files (ie. the frame files themselves.)

Parameters:

cp (ConfigParser.ConfigParser instance) – This contains a representation of the information stored within the workflow configuration files
scienceSegs (Dictionary of ifo keyed igwn_segments.segmentlist instances) – This contains the times that the workflow is expected to analyse.
outputDir (path) – All output files written by datafind processes will be written to this directory.
tags (list of strings, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniqueify the actual filename. FIXME: Filenames may not be unique with current codes!

Returns:

datafindcaches (list of glue.lal.Cache instances) – The glue.lal.Cache representations of the various calls to the datafind server and the returned frame files.
datafindOuts (pycbc.workflow.core.FileList) – List of all the datafind output files for use later in the pipeline.

pycbc.workflow.datafind.setup_datafind_runtime_frames_multi_calls_perifo(cp, scienceSegs, outputDir, tags=None)[source]

This function uses the gwdatafind library to obtain the location of all the frame files that will be needed to cover the analysis of the data given in scienceSegs. This function will not check if the returned frames cover the whole time requested, such sanity checks are done in the pycbc.workflow.setup_datafind_workflow entry function. As opposed to setup_datafind_runtime_single_call_perifo this call will one call to the datafind server for every science segment. This function will return a list of files corresponding to the individual frames returned by the datafind query. This will allow pegasus to more easily identify all the files used as input, but may cause problems for codes that need to take frame cache files as input.

Parameters:

cp (ConfigParser.ConfigParser instance) – This contains a representation of the information stored within the workflow configuration files
scienceSegs (Dictionary of ifo keyed igwn_segments.segmentlist instances) – This contains the times that the workflow is expected to analyse.
outputDir (path) – All output files written by datafind processes will be written to this directory.
tags (list of strings, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniqueify the actual filename. FIXME: Filenames may not be unique with current codes!

Returns:

datafindcaches (list of glue.lal.Cache instances) – The glue.lal.Cache representations of the various calls to the datafind server and the returned frame files.
datafindOuts (pycbc.workflow.core.FileList) – List of all the datafind output files for use later in the pipeline.

pycbc.workflow.datafind.setup_datafind_runtime_frames_single_call_perifo(cp, scienceSegs, outputDir, tags=None)[source]

This function uses the gwdatafind library to obtain the location of all the frame files that will be needed to cover the analysis of the data given in scienceSegs. This function will not check if the returned frames cover the whole time requested, such sanity checks are done in the pycbc.workflow.setup_datafind_workflow entry function. As opposed to setup_datafind_runtime_generated this call will only run one call to datafind per ifo, spanning the whole time. This function will return a list of files corresponding to the individual frames returned by the datafind query. This will allow pegasus to more easily identify all the files used as input, but may cause problems for codes that need to take frame cache files as input.

Parameters:

cp (ConfigParser.ConfigParser instance) – This contains a representation of the information stored within the workflow configuration files
scienceSegs (Dictionary of ifo keyed igwn_segments.segmentlist instances) – This contains the times that the workflow is expected to analyse.
outputDir (path) – All output files written by datafind processes will be written to this directory.
tags (list of strings, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniqueify the actual filename. FIXME: Filenames may not be unique with current codes!

Returns:

datafindcaches (list of glue.lal.Cache instances) – The glue.lal.Cache representations of the various calls to the datafind server and the returned frame files.
datafindOuts (pycbc.workflow.core.FileList) – List of all the datafind output files for use later in the pipeline.

pycbc.workflow.datafind.setup_datafind_workflow(workflow, scienceSegs, outputDir, seg_file=None, tags=None)[source]

Setup datafind section of the workflow. This section is responsible for generating, or setting up the workflow to generate, a list of files that record the location of the frame files needed to perform the analysis. There could be multiple options here, the datafind jobs could be done at run time or could be put into a dag. The subsequent jobs will know what was done here from the OutFileList containing the datafind jobs (and the Dagman nodes if appropriate. For now the only implemented option is to generate the datafind files at runtime. This module can also check if the frameFiles actually exist, check whether the obtained segments line up with the original ones and update the science segments to reflect missing data files.

Parameters:

workflow (pycbc.workflow.core.Workflow) – The workflow class that stores the jobs that will be run.
scienceSegs (Dictionary of ifo keyed igwn_segments.segmentlist instances) – This contains the times that the workflow is expected to analyse.
outputDir (path) – All output files written by datafind processes will be written to this directory.
seg_file (SegFile, optional (default=None)) – The file returned by get_science_segments containing the science segments and the associated segment_summary. This will be used for the segment_summary test and is required if, and only if, performing that test.
tags (list of string, optional (default=None)) – Use this to specify tags. This can be used if this module is being called more than once to give call specific configuration (by setting options in [workflow-datafind-${TAG}] rather than [workflow-datafind]). This is also used to tag the Files returned by the class to uniqueify the Files and uniqueify the actual filename. FIXME: Filenames may not be unique with current codes!

Returns:

datafindOuts (OutGroupList) – List of all the datafind output files for use later in the pipeline.
sci_avlble_file (SegFile) – SegFile containing the analysable time after checks in the datafind module are applied to the input segment list. For production runs this is expected to be equal to the input segment list.
scienceSegs (Dictionary of ifo keyed igwn_segments.segmentlist instances) – This contains the times that the workflow is expected to analyse. If the updateSegmentTimes kwarg is given this will be updated to reflect any instances of missing data.
sci_avlble_name (string) – The name with which the analysable time is stored in the sci_avlble_file.

The workflow datafind and validation module

Introduction

Usage

Configuration file setup

[workflow-datafind] section

[executables]

Other sections

[datafind]

pycbc.workflow.datafind Module

`pycbc.workflow.datafind` Module