Catalog Scan

The command requake scan_catalog is the core of Requake’s catalog-based repeater search. It compares every event in an earthquake catalog against its neighbours and identifies pairs with highly similar waveforms — the building blocks of repeating-earthquake families.

The scan is single-station: each event pair is compared using one seismic trace at a time (catalog_trace_id). If multiple trace IDs are configured, the station closest to the event pair is selected.

The scan is parallelized across multiple CPU cores to handle large catalogs efficiently.

This page describes how the scan works, what configuration parameters control it, and how to get the best performance out of it.

Overview

Given a catalog of \(N\) events, the scan proceeds in three stages:

  1. Spatial grouping — events are compared only if their epicentres lie within a configurable search radius (catalog_search_range). This reduces the naive \(\mathcal{O}(N^2)\) pair count to a much smaller set of candidates.

  2. Waveform retrieval — for each candidate pair, the required waveform windows are fetched (from FDSN web services, a local archive, or the on-disk cache) and cut around the P arrival.

  3. Cross-correlation — the two waveforms are band-pass filtered between cc_freq_min and cc_freq_max Hz to isolate the frequency band of interest, then cross-correlated in the time domain. The maximum correlation coefficient \(CC_\mathrm{max}\) is computed. The result, together with the optimal lag, trace ID, and inter-event distance, is stored for every candidate pair in the output database.

All the parameters mentioned above are described in the configuration file.

Spatial grouping

For each event in the catalog, Requake builds a list of neighbouring events whose epicentral distance is within catalog_search_range kilometres. Only these candidate pairs are passed to the waveform stage.

The spatial search uses a k-d tree (implemented via scipy.spatial.cKDTree) over the 3‑D Cartesian coordinates of the events on the unit sphere, which makes the neighbour lookup very fast even for large catalogs.

If catalog_search_range is set to zero (or negative), every possible event pair is considered — useful for small catalogs, but impractical for large ones.

Waveform retrieval

For each candidate pair, Requake retrieves a short waveform window around the P‑wave arrival for one or more seismic traces (catalog_trace_id). The window starts cc_pre_P seconds before the theoretical P arrival and lasts cc_trace_length seconds.

Waveforms can come from three sources, in order of priority:

  1. On-disk SQLite cache (OUTDIR/waveform_cache.sqlite) — if catalog_waveform_disk_cache_enabled is true, previously fetched waveforms are reused. Use requake wfcache prefetch to populate this cache before the scan.

  2. In-memory cache (sized by catalog_waveform_cache_size) — waveforms for recently processed events are kept in RAM to avoid repeated fetches.

  3. FDSN web services or local archives — if a waveform is not found in either cache, it is downloaded from the configured FDSN dataselect service or read from a local SDS directory or per-event data folder.

If multiple trace IDs are configured (comma-separated), Requake selects the station closest to the event pair for each comparison.

Cross-correlation and pair selection

The cross-correlation step is controlled by the Processing parameters section of the configuration file. The relevant parameters are:

cc_pre_P

seconds of signal before the P arrival

cc_trace_length

total waveform window length in seconds

cc_freq_min, cc_freq_max

band-pass filter corners in Hz

cc_max_shift

maximum allowed lag in seconds

cc_allow_negative

when true, the largest absolute value of the correlation function is returned — be it positive (correlation) or negative (anti-correlation)

Each waveform pair undergoes the following processing:

Every candidate pair is written to the output database with its \(CC_\mathrm{max}\), optimal lag, trace ID, and inter-event distance.

When cc_allow_negative is true, the largest value — positive or negative — of the cross-correlation function is stored, so both correlated and anti-correlated pairs are recorded.

Parallel execution

The scan uses multiple worker processes to process pairs in parallel. The number of workers is controlled by catalog_scan_nprocs: set it to 0 (the default) for automatic selection — one fewer than the number of available CPU cores, 1 to run serially, or to a specific number.

Each worker maintains its own in-memory waveform cache (sized by catalog_waveform_cache_size_parallel, or derived automatically) to minimise inter-process communication.

Resuming an interrupted scan

If the scan is interrupted (e.g., by a network outage or a user pressing Ctrl‑C), running requake scan_catalog again will detect the existing pairs in the database and offer to resume from where it stopped. Use --force-continue to skip the prompt in scripts.

Slurm clusters

Requake automatically detects when it is running inside a Slurm job (via the SLURM_JOB_ID environment variable) and adapts its behaviour:

  • Worker count. When catalog_scan_nprocs is set to 0 (automatic) and Slurm is detected, the number of workers is taken from the Slurm allocation (SLURM_CPUS_PER_TASK, SLURM_CPUS_ON_NODE, or SLURM_JOB_CPUS_PER_NODE, checked in that order). Unlike local runs, the full allocated count is used without subtracting one.

  • Progress logging. Progress messages include the Slurm job ID, process ID, and node list, making it easier to monitor jobs in cluster logs.

  • Non-interactive mode. If a scan is interrupted on a cluster, Requake will not prompt for input (since stdin is typically not available). Use --force to overwrite an existing scan or --force-continue to resume it.

Note

Slurm integration has been developed and tested on the IPGP S-CAPAD platform, which we gratefully acknowledge.

Performance tips

  • Prefetch waveforms. For large catalogs relying on FDSN sources, run requake wfcache prefetch before the scan. This downloads all required waveform windows once and stores them in the local SQLite cache, letting the scan read from disk instead of the network.

  • Tune the search radius. A larger catalog_search_range finds more candidate pairs but increases runtime. Choose a value that reflects the maximum expected distance between repeating events in your study area.

  • Limit memory. If you hit memory limits, reduce catalog_waveform_cache_size and let the on-disk cache handle persistence instead.