Article

GDAL Batch Operations with multiprocessing.Pool

Safely parallelising GDAL with multiprocessing.Pool requires three things: forcing the spawn process start method so workers receive no inherited C-level state, running a per-worker initializer that calls gdal.UseExceptions() and caps internal threads to 1, and explicitly setting dataset references to None when a task finishes. Without these steps, fork-based pools produce silent raster corruption and segmentation faults. This page is part of the Multiprocessing Geospatial Tasks guide inside the broader Spatial Batch Processing & Async Workflows reference.

Prerequisites

  • Python 3.10 or later
  • gdal from python3-gdal or a conda/mamba GDAL package (GDAL 3.4+)
  • No additional pip install — multiprocessing is in the standard library

For broader context on why geospatial workloads need explicit parallelism strategies, read the Spatial Batch Processing & Async Workflows overview first.

Why Fork Breaks GDAL

GDAL maintains global C-level state for driver registration, configuration options, error handlers, and virtual filesystem (VSI) connection pools. When Python’s multiprocessing defaults to fork on Linux, each child process inherits a snapshot of the parent’s memory — including open file descriptors and fully initialized GDAL drivers. Multiple workers then race to mutate this shared state, which produces:

  • Segmentation faults from concurrent driver registry writes
  • Silent raster corruption when two workers share a write file descriptor
  • Memory growth without bound when VSI caches are duplicated across N workers

The spawn and forkserver start methods solve this by launching a fresh Python interpreter that imports nothing from the parent. GDAL initializes cleanly inside each worker with no inherited state.

Fork vs Spawn: GDAL state inheritance Two side-by-side diagrams. On the left, a parent process with initialized GDAL forks into two workers that share state, with a collision marker. On the right, a parent spawns two workers that each initialize GDAL independently. fork (default on Linux) spawn (required for GDAL) Parent process GDAL initialized Worker 1 inherited state Worker 2 inherited state race condition / segfault Parent process dispatches tasks Worker 1 init_gdal_worker() Worker 2 init_gdal_worker() isolated, deterministic

Complete Working Implementation

The script below batches GeoTIFF reprojection across a directory. Copy it, adjust --crs and paths, and run directly. All GDAL configuration happens inside init_gdal_worker so the main process remains clean:

#!/usr/bin/env python3
"""
Batch raster reprojection with isolated multiprocessing.Pool workers.
Usage: python reproject_batch.py ./input ./output --crs EPSG:32633 --workers 4
"""
import os
import sys
import argparse
import logging
from pathlib import Path
from multiprocessing import Pool, cpu_count, set_start_method
from osgeo import gdal

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(processName)s: %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
)


def init_gdal_worker() -> None:
    """Run once per worker process before any task is dispatched.

    Calling gdal.UseExceptions() turns silent C-level NULL returns into
    Python RuntimeError, making failures visible to the pool error handler.
    Setting GDAL_NUM_THREADS=1 prevents GDAL from spawning its own thread
    pool inside each worker, which would multiply CPU usage by O(workers * cores).
    """
    gdal.UseExceptions()
    os.environ["GDAL_NUM_THREADS"] = "1"       # cap internal GDAL threads
    os.environ["CPL_NUM_THREADS"] = "1"        # cap CPL/CURL threads
    os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "YES"  # skip sibling-dir scan
    os.environ["VSI_CACHE"] = "FALSE"          # avoid duplicated VSI cache per worker
    if "GDAL_DATA" not in os.environ:
        os.environ["GDAL_DATA"] = "/usr/share/gdal"


def reproject_raster(task: tuple) -> dict:
    """Warp a single raster to the target CRS.

    Returns a dict so the main process can aggregate structured results
    for logging or downstream error-handling pipelines.
    """
    src_path, dst_path, target_crs = task
    try:
        src_ds = gdal.Open(str(src_path))
        if src_ds is None:
            raise RuntimeError(f"gdal.Open returned None for {src_path}")

        dst_ds = gdal.Warp(
            str(dst_path),
            src_ds,
            dstSRS=target_crs,          # e.g. "EPSG:32633"
            format="GTiff",
            creationOptions=[
                "TILED=YES",            # tiled layout for partial reads
                "COMPRESS=LZW",         # lossless, good ratio for elevation data
                "BIGTIFF=YES",          # avoids 4 GB limit on large mosaics
            ],
            numThreads=1,               # must match GDAL_NUM_THREADS
            resampleAlg="bilinear",
            errorThreshold=0.125,       # max warp error in output pixels
        )
        if dst_ds is None:
            raise RuntimeError(f"gdal.Warp produced no output for {src_path}")

        # Assign None to trigger GDALClose() immediately — do not rely on GC.
        # Skipping this leaks file descriptors in pools that process 1000+ files.
        dst_ds = None
        src_ds = None

        logging.info("OK %s", dst_path.name)
        return {"src": str(src_path), "ok": True, "error": None}
    except Exception as exc:
        logging.error("FAIL %s: %s", src_path, exc)
        return {"src": str(src_path), "ok": False, "error": str(exc)}


def main() -> None:
    parser = argparse.ArgumentParser(
        description="Batch raster reprojection with isolated pool workers"
    )
    parser.add_argument("input_dir", type=Path, help="Directory containing source GeoTIFFs")
    parser.add_argument("output_dir", type=Path, help="Directory for reprojected outputs")
    parser.add_argument("--crs", default="EPSG:4326", help="Target CRS (default: EPSG:4326)")
    parser.add_argument(
        "--workers", type=int, default=cpu_count(),
        help="Worker process count (default: cpu_count)"
    )
    args = parser.parse_args()
    args.output_dir.mkdir(parents=True, exist_ok=True)

    tasks = [
        (src, args.output_dir / f"{src.stem}_reprojected.tif", args.crs)
        for src in sorted(args.input_dir.glob("*.tif"))
    ]
    if not tasks:
        logging.warning("No .tif files found in %s", args.input_dir)
        sys.exit(0)

    logging.info("Submitting %d tasks to %d workers", len(tasks), args.workers)

    # initializer= runs init_gdal_worker() once per worker before any task.
    # This is the correct hook — do NOT call GDAL init inside reproject_raster.
    with Pool(processes=args.workers, initializer=init_gdal_worker) as pool:
        results = pool.map(reproject_raster, tasks)

    ok = sum(1 for r in results if r["ok"])
    logging.info("Completed %d/%d successfully.", ok, len(tasks))
    sys.exit(0 if ok == len(tasks) else 1)


if __name__ == "__main__":
    # set_start_method must be called before Pool is created.
    # Placing it under __main__ prevents accidental re-invocation when
    # a spawned worker imports this module during startup.
    set_start_method("spawn", force=True)
    main()

Step Annotations

  1. init_gdal_worker() as a Pool initializerPool(initializer=init_gdal_worker) guarantees GDAL is configured before any task runs in that process. Calling the same setup code inside reproject_raster would work but re-runs on every call, wasting cycles and making state management error-prone.

  2. GDAL_NUM_THREADS=1 and CPL_NUM_THREADS=1 — GDAL defaults both to ALL_CPUS. With 8 Python workers on an 8-core machine, the actual thread count becomes 64. This oversubscription stalls all workers in the OS scheduler. Setting both to 1 keeps the CPU budget predictable and matches the one-task-per-process model.

  3. GDAL_DISABLE_READDIR_ON_OPEN=YES — By default, gdal.Open() scans the source directory for auxiliary files (.aux.xml, .ovr, .prj). On an NFS mount or S3-backed FUSE filesystem this can take several seconds per file. Disabling it cuts gdal.Open() latency by 60–90% for cold-cache workloads.

  4. dst_ds = None / src_ds = None — Python’s garbage collector does not call GDALClose() on a predictable schedule inside a long-running worker. Explicit None assignment triggers __del__ immediately, flushing write buffers and releasing the OS file handle. This matters especially with COMPRESS=LZW because GDAL must write the compressed block table on close.

  5. errorThreshold=0.125 — Controls the maximum allowed deviation in output pixels during the reprojection warp. Lower values increase accuracy at the cost of CPU; 0.125 is the GDAL default and appropriate for most geospatial workflows. Reduce to 0.05 for high-precision DEMs.

  6. Return dict instead of bool — Returning a structured dict from each task allows the calling code to aggregate failures, log CRS mismatches, or feed results into the error-handling pipelines pattern without re-parsing log output.

Named Gotcha: Calling set_start_method Too Late

The single most common breakage pattern is placing set_start_method("spawn") inside main() after argument parsing, or calling it at module level outside the if __name__ == "__main__" guard. On the first run this appears to work. On the second Pool creation within the same interpreter it raises RuntimeError: context already set. Worse, a spawned worker imports the module during startup and hits the bare set_start_method call, triggering an import-time error that surfaces as a cryptic Process ... exited with exitcode 1 in the pool.

The fix is shown in the implementation above: call set_start_method("spawn", force=True) as the very first statement under if __name__ == "__main__":, before calling main(). The force=True flag allows it to override an already-set context when running under test harnesses.

Verification

After the run completes, confirm the output files are valid and carry the correct CRS:

# Check all outputs exist and are non-zero bytes
find ./output -name "*_reprojected.tif" -size 0 -print

# Confirm CRS on the first output file
python3 - <<'EOF'
from osgeo import gdal, osr
ds = gdal.Open("./output/scene_001_reprojected.tif")
srs = osr.SpatialReference()
srs.ImportFromWkt(ds.GetProjection())
print(srs.GetAuthorityCode(None))   # should print "32633" (or your target EPSG)
EOF

# Spot-check pixel count matches source dimensions after reprojection
gdalinfo ./output/scene_001_reprojected.tif | grep "Size is"

Exit code 0 from the script and a matching EPSG authority code confirm the run completed without silent failures. For structured JSON logging of per-file outcomes, pipe the returned results list to json.dumps before the process exits.

Troubleshooting

Symptom Root Cause Fix
Segmentation fault (core dumped) Forked worker inherits initialized GDAL drivers set_start_method("spawn", force=True)
CPU at 100% but low throughput GDAL internal threads + Python workers = oversubscription GDAL_NUM_THREADS=1, CPL_NUM_THREADS=1
Memory grows until OOM kill Unclosed datasets or duplicated VSI cache ds = None after each task; VSI_CACHE=FALSE
Silent failures / empty output files gdal.Warp returns None without raising gdal.UseExceptions() in initializer
Slow open on networked or cloud storage Directory scanning on every gdal.Open() GDAL_DISABLE_READDIR_ON_OPEN=YES
RuntimeError: context already set set_start_method called after first Pool Move it to if __name__ == "__main__": entry point

FAQ

Can I use forkserver instead of spawn?

Yes. forkserver also avoids inheriting GDAL state by launching a dedicated server process that forks clean children on demand. It is slightly faster than spawn on Linux because it avoids re-importing the full module tree per worker. Use set_start_method("forkserver") in the same __main__ guard. Note that forkserver is not available on Windows.

How many workers should I use for cloud-storage inputs (S3, GCS)?

For object storage accessed via GDAL’s VSI layer (/vsis3/, /vsigs/), network round-trips dominate latency rather than CPU. Start with workers=8 regardless of CPU count, then benchmark with iostat and htop. Increase until either CPU or network throughput plateaus. Also set VSI_CACHE=TRUE and GDAL_CACHEMAX=256 in init_gdal_worker to cache HTTP responses across tasks within the same worker process.

Why does my pool hang instead of producing errors?

Deadlocks in multiprocessing.Pool most often come from a worker process being killed by the OS (OOM, segfault) while holding a lock on the result queue. Run with maxtasksperchild=50 to recycle workers periodically: Pool(processes=N, initializer=init_gdal_worker, maxtasksperchild=50). This limits memory accumulation per worker and prevents a single leaked dataset from growing indefinitely.

Does this pattern apply to rasterio batch jobs too?

Yes. Rasterio wraps GDAL and inherits the same C-level state risks under fork. Use the same spawn start method and a rasterio.Env context manager inside the worker initializer instead of setting raw environment variables. See the async I/O for raster processing pattern for how asyncio and rasterio compare to multiprocessing for I/O-bound workloads.