.. _scan_catalog: ############ Catalog Scan ############ The command ``requake scan_catalog`` is the core of Requake's catalog-based repeater search. It compares every event in an earthquake :doc:`catalog ` against its neighbours and identifies pairs with highly similar waveforms — the building blocks of repeating-earthquake families. The scan is *single-station*: each event pair is compared using one seismic trace at a time (``catalog_trace_id``). If multiple trace IDs are configured, the station closest to the event pair is selected. The scan is parallelized across multiple CPU cores to handle large catalogs efficiently. This page describes how the scan works, what configuration parameters control it, and how to get the best performance out of it. Overview -------- Given a :doc:`catalog ` of :math:`N` events, the scan proceeds in three stages: 1. **Spatial grouping** — events are compared only if their epicentres lie within a configurable search radius (``catalog_search_range``). This reduces the naive :math:`\mathcal{O}(N^2)` pair count to a much smaller set of candidates. 2. **Waveform retrieval** — for each candidate pair, the required waveform windows are fetched (from FDSN web services, a local archive, or the on-disk cache) and cut around the P arrival. 3. **Cross-correlation** — the two waveforms are band-pass filtered between ``cc_freq_min`` and ``cc_freq_max`` Hz to isolate the frequency band of interest, then cross-correlated in the time domain. The maximum correlation coefficient :math:`CC_\mathrm{max}` is computed. The result, together with the optimal lag, trace ID, and inter-event distance, is stored for every candidate pair in the :doc:`output database `. All the parameters mentioned above are described in the :doc:`configuration file `. Spatial grouping ---------------- For each event in the catalog, Requake builds a list of neighbouring events whose epicentral distance is within ``catalog_search_range`` kilometres. Only these candidate pairs are passed to the waveform stage. The spatial search uses a `k-d tree `_ (implemented via `scipy.spatial.cKDTree `_) over the 3‑D Cartesian coordinates of the events on the unit sphere, which makes the neighbour lookup very fast even for large catalogs. If ``catalog_search_range`` is set to zero (or negative), *every* possible event pair is considered — useful for small catalogs, but impractical for large ones. Waveform retrieval ------------------ For each candidate pair, Requake retrieves a short waveform window around the P‑wave arrival for one or more seismic traces (``catalog_trace_id``). The window starts ``cc_pre_P`` seconds before the theoretical P arrival and lasts ``cc_trace_length`` seconds. Waveforms can come from three sources, in order of priority: 1. **On-disk SQLite cache** (``OUTDIR/waveform_cache.sqlite``) — if ``catalog_waveform_disk_cache_enabled`` is ``true``, previously fetched waveforms are reused. Use :doc:`requake wfcache prefetch ` to populate this cache before the scan. 2. **In-memory cache** (sized by ``catalog_waveform_cache_size``) — waveforms for recently processed events are kept in RAM to avoid repeated fetches. 3. **FDSN web services** or **local archives** — if a waveform is not found in either cache, it is downloaded from the configured FDSN dataselect service or read from a local SDS directory or per-event data folder. If multiple trace IDs are configured (comma-separated), Requake selects the station closest to the event pair for each comparison. Cross-correlation and pair selection ------------------------------------- The cross-correlation step is controlled by the ``Processing parameters`` section of the :doc:`configuration file `. The relevant parameters are: ``cc_pre_P`` seconds of signal before the P arrival ``cc_trace_length`` total waveform window length in seconds ``cc_freq_min``, ``cc_freq_max`` band-pass filter corners in Hz ``cc_max_shift`` maximum allowed lag in seconds ``cc_allow_negative`` when ``true``, the largest absolute value of the correlation function is returned — be it positive (correlation) or negative (anti-correlation) Each waveform pair undergoes the following processing: - Both traces are band-pass filtered between ``cc_freq_min`` and ``cc_freq_max`` Hz. - They are cross-correlated in the time domain using `obspy.signal.cross_correlation.correlate `_, allowing a maximum lag of ``cc_max_shift`` seconds to account for travel-time differences. - The maximum normalised cross-correlation coefficient :math:`CC_\mathrm{max}` is extracted with `obspy.signal.cross_correlation.xcorr_max `_. Every candidate pair is written to the :doc:`output database ` with its :math:`CC_\mathrm{max}`, optimal lag, trace ID, and inter-event distance. When ``cc_allow_negative`` is ``true``, the largest value — positive or negative — of the cross-correlation function is stored, so both correlated and anti-correlated pairs are recorded. Parallel execution ------------------ The scan uses multiple worker processes to process pairs in parallel. The number of workers is controlled by ``catalog_scan_nprocs``: set it to ``0`` (the default) for automatic selection — one fewer than the number of available CPU cores, ``1`` to run serially, or to a specific number. Each worker maintains its own in-memory waveform cache (sized by ``catalog_waveform_cache_size_parallel``, or derived automatically) to minimise inter-process communication. Resuming an interrupted scan ---------------------------- If the scan is interrupted (e.g., by a network outage or a user pressing Ctrl‑C), running ``requake scan_catalog`` again will detect the existing pairs in the database and offer to resume from where it stopped. Use ``--force-continue`` to skip the prompt in scripts. Slurm clusters -------------- Requake automatically detects when it is running inside a `Slurm `_ job (via the ``SLURM_JOB_ID`` environment variable) and adapts its behaviour: * **Worker count.** When ``catalog_scan_nprocs`` is set to ``0`` (automatic) and Slurm is detected, the number of workers is taken from the Slurm allocation (``SLURM_CPUS_PER_TASK``, ``SLURM_CPUS_ON_NODE``, or ``SLURM_JOB_CPUS_PER_NODE``, checked in that order). Unlike local runs, the full allocated count is used without subtracting one. * **Progress logging.** Progress messages include the Slurm job ID, process ID, and node list, making it easier to monitor jobs in cluster logs. * **Non-interactive mode.** If a scan is interrupted on a cluster, Requake will not prompt for input (since ``stdin`` is typically not available). Use ``--force`` to overwrite an existing scan or ``--force-continue`` to resume it. .. note:: Slurm integration has been developed and tested on the `IPGP S-CAPAD `_ platform, which we gratefully acknowledge. Performance tips ---------------- * **Prefetch waveforms.** For large catalogs relying on FDSN sources, run :doc:`requake wfcache prefetch ` before the scan. This downloads all required waveform windows once and stores them in the local SQLite cache, letting the scan read from disk instead of the network. * **Tune the search radius.** A larger ``catalog_search_range`` finds more candidate pairs but increases runtime. Choose a value that reflects the maximum expected distance between repeating events in your study area. * **Limit memory.** If you hit memory limits, reduce ``catalog_waveform_cache_size`` and let the on-disk cache handle persistence instead.