Use PyYAML with Pydantic v2 to load, validate, and apply a YAML config in a Python GIS CLI: parse the file with yaml.safe_load, construct a typed BatchConfig model, apply GDAL environment variables before any I/O, and inject CLI flag overrides into the raw dict before model construction so that the flag always wins. This self-contained pattern — part of the Configuration File Management guide — prevents silent CRS errors and coordinate corruptions that only surface after hours of raster processing.
Prerequisites
pip install "pyyaml>=6.0" "pydantic>=2.0" "click>=8.1"
You need Python 3.11+, a working GDAL installation visible to the shell, and a basic grasp of CLI Architecture & Design Patterns. If you are still choosing between Click and Typer for your project, review Click vs Typer for Geospatial Workflows before committing to the CLI layer shown here.
Config Precedence: How the Override Chain Works
Before writing any code it helps to see how the four layers compose. CLI flags override environment variables, which override the YAML file, which overrides schema defaults — the same four-layer model described in Configuration File Management.
The implementation below encodes this chain directly: CLI flag values are merged into the raw dict before Pydantic ever sees it, so the model has no special-case logic for “did the user pass --threads?”.
Complete Working Implementation
The YAML file your team authors looks like this:
# pipeline.yaml
workspace: /data/rasters
input_glob: "**/*.tif"
output_dir: /data/output
gdal:
GDAL_CACHEMAX: 512
GDAL_NUM_THREADS: 8
OGR_ENABLE_PARTIAL_REPROJECTION: true
spatial:
src_crs: "EPSG:4326"
dst_crs: "EPSG:3857"
resampling: bilinear
tile_size: 512
The Python module that loads and validates it:
#!/usr/bin/env python3
"""geo_pipeline.py — validated YAML config loader for geospatial batch CLIs."""
import glob
import os
from pathlib import Path
from typing import List, Optional
import click
import yaml
from pydantic import BaseModel, Field, ValidationError, field_validator
# ── Schema models ─────────────────────────────────────────────────────────────
class GdalEnv(BaseModel):
"""GDAL/OGR runtime settings injected via os.environ before any I/O."""
GDAL_CACHEMAX: int = Field(default=256, ge=64, le=4096,
description="Raster block cache in MB")
GDAL_NUM_THREADS: int = Field(default=4, ge=1, le=64)
OGR_ENABLE_PARTIAL_REPROJECTION: bool = Field(default=True)
class SpatialParams(BaseModel):
src_crs: str = Field(default="EPSG:4326")
dst_crs: str = Field(default="EPSG:3857")
resampling: str = Field(default="bilinear")
tile_size: int = Field(default=512, ge=128, le=4096)
@field_validator("src_crs", "dst_crs")
@classmethod
def validate_crs(cls, v: str) -> str:
if not (v.upper().startswith("EPSG:") or v.upper().startswith("PROJ:")):
raise ValueError(f"CRS must begin with EPSG: or PROJ:, got {v!r}")
return v
@field_validator("resampling")
@classmethod
def validate_resampling(cls, v: str) -> str:
valid = {"nearest", "bilinear", "cubic", "cubicspline", "lanczos",
"average", "mode"}
if v.lower() not in valid:
raise ValueError(f"resampling must be one of {sorted(valid)}")
return v.lower()
class BatchConfig(BaseModel):
workspace: Path
input_glob: str
output_dir: Path
gdal: GdalEnv = Field(default_factory=GdalEnv)
spatial: SpatialParams = Field(default_factory=SpatialParams)
@field_validator("workspace", "output_dir", mode="before")
@classmethod
def resolve_paths(cls, v: object) -> Path:
return Path(str(v)).expanduser().resolve()
def apply_gdal_env(self) -> None:
"""Inject GDAL settings into os.environ before any rasterio or GDAL call."""
for key, value in self.gdal.model_dump().items():
os.environ[key] = str(value)
def resolve_inputs(self) -> List[Path]:
"""Expand input_glob relative to workspace into a sorted, deduplicated list."""
pattern = str(self.workspace / self.input_glob)
return sorted({Path(p) for p in glob.glob(pattern, recursive=True)})
# ── CLI entry-point ────────────────────────────────────────────────────────────
@click.command()
@click.option(
"--config", "cfg_path",
type=click.Path(exists=True, path_type=Path),
required=True,
help="Path to the YAML pipeline config file.",
)
@click.option(
"--threads", type=int, default=None,
help="Override GDAL_NUM_THREADS from the config file.",
)
@click.option(
"--dst-crs", "dst_crs", default=None,
help="Override spatial.dst_crs (e.g. EPSG:32633).",
)
def run_pipeline(cfg_path: Path, threads: Optional[int], dst_crs: Optional[str]) -> None:
"""Load, validate, and execute a geospatial batch pipeline from a YAML config."""
with cfg_path.open("r") as fh:
raw: dict = yaml.safe_load(fh)
# ① Merge CLI overrides into raw dict BEFORE constructing the model.
# This is the key pattern: the model never needs to know about Click.
if threads is not None:
raw.setdefault("gdal", {})["GDAL_NUM_THREADS"] = threads
if dst_crs is not None:
raw.setdefault("spatial", {})["dst_crs"] = dst_crs
# ② Validate everything in one shot — fail fast with a clear message.
try:
config = BatchConfig(**raw)
except ValidationError as exc:
click.echo(f"Config validation failed:\n{exc}", err=True)
raise click.exceptions.Exit(code=2)
# ③ Inject GDAL env before any I/O (rasterio, pyogrio, osgeo.gdal).
config.apply_gdal_env()
# ④ Audit log: print the resolved config so CI logs are self-documenting.
click.echo(config.model_dump_json(indent=2))
# ⑤ Resolve input files and abort early if the glob matches nothing.
inputs = config.resolve_inputs()
if not inputs:
click.echo(
f"No files matched '{config.input_glob}' under {config.workspace}",
err=True,
)
raise click.exceptions.Exit(code=2)
click.echo(
f"Processing {len(inputs)} files: "
f"{config.spatial.src_crs} → {config.spatial.dst_crs} "
f"@ {config.spatial.resampling} resampling"
)
# ⑥ Pipeline execution continues here with fully validated config.
if __name__ == "__main__":
run_pipeline()
Step Annotations
① Merge CLI overrides before model construction. raw.setdefault("gdal", {})["GDAL_NUM_THREADS"] = threads mutates the raw dict so the Pydantic model sees a single consistent input. This avoids the anti-pattern of building a model and then patching it — a patched model may bypass validators.
② Single validation gateway. Constructing BatchConfig(**raw) is the only place validation runs. Catching ValidationError here and printing it before calling raise click.exceptions.Exit(code=2) gives the operator an actionable error on stderr and returns a POSIX-compliant exit code (2 = usage error).
③ apply_gdal_env() before any I/O. GDAL reads its configuration variables at the time a dataset handle is opened, not at import time. Calling this method before rasterio.open() or any pyogrio read guarantees that GDAL_CACHEMAX and GDAL_NUM_THREADS take effect on every I/O call in the process.
④ Audit log with model_dump_json(). Printing the full resolved config to stdout at startup means every CI log contains a complete record of the exact parameters used — CRS strings, thread counts, cache sizes. This is essential for debugging spatial discrepancies in distributed pipelines.
⑤ Glob resolution in resolve_inputs(). glob.glob(pattern, recursive=True) with the ** wildcard finds GeoTIFFs in nested subdirectories. Wrapping the result in a set before sorting deduplicates paths that could appear twice when patterns overlap.
⑥ field_validator for CRS strings. Checking that src_crs and dst_crs begin with EPSG: or PROJ: at parse time prevents GDAL from silently falling back to WGS84 when it encounters an unrecognised authority string — a failure mode that produces geometrically wrong output with no error.
Named Gotcha: GDAL Environment Variables Set After Dataset Open Have No Effect
The most common failure when adopting this pattern is placing apply_gdal_env() after the first rasterio.open() call, or inside a lazy-loading code path that triggers after the module initialises its GDAL context.
# WRONG — GDAL_CACHEMAX is ignored for this handle
with rasterio.open(src) as ds:
config.apply_gdal_env() # too late; cache is already allocated
data = ds.read()
# CORRECT — inject before any open() call
config.apply_gdal_env()
with rasterio.open(src) as ds:
data = ds.read()
The fix is always to call apply_gdal_env() immediately after model construction and before any code that touches rasterio, GDAL, or pyogrio. A unit test that mocks os.environ and checks the injected values before calling rasterio.open() will catch regressions during CI.
Verification Snippet
After running the pipeline, verify that GDAL received the correct settings and that the output matches the expected CRS:
# 1. Confirm GDAL env variables were injected (visible in process env)
python - <<'EOF'
import os, yaml, pydantic
# Quick smoke-test: construct config and check env after apply
from geo_pipeline import BatchConfig
import yaml
raw = yaml.safe_load(open("pipeline.yaml"))
cfg = BatchConfig(**raw)
cfg.apply_gdal_env()
assert os.environ["GDAL_CACHEMAX"] == str(cfg.gdal.GDAL_CACHEMAX)
print("GDAL env OK:", os.environ["GDAL_CACHEMAX"], "MB")
EOF
# 2. Check that an output GeoTIFF is in the expected CRS (requires GDAL CLI tools)
gdalinfo output/tile_0001.tif | grep -E "EPSG|CoordSys"
For a deeper test that validates CRS round-trips and resampling fidelity, load the output with rasterio and assert ds.crs.to_epsg() == 3857.
FAQ
Why use PyYAML instead of tomllib for geospatial CLIs?
YAML’s multi-line strings and inline comments make it readable when documenting spatial parameters such as bounding boxes, PROJ strings, and glob patterns that span many characters. tomllib (Python 3.11+) is a good choice for flat key-value configurations, but YAML is the ecosystem default for tools like GDAL’s own virtual format files and many open-source GIS utilities, so operators are already familiar with its syntax.
How do I override a nested YAML key from a Click flag without breaking validation?
Mutate the raw dict before constructing the model: raw.setdefault("spatial", {})["dst_crs"] = cli_value. Pydantic then sees one consistent input dict and validates the overridden value the same way it validates a file-sourced value. Never build the model and then mutate its fields — this bypasses validators.
When exactly should apply_gdal_env() be called?
Immediately after BatchConfig(**raw) succeeds and before any call that touches rasterio, pyogrio, or osgeo.gdal. GDAL reads environment variables when it first allocates a resource (cache block, dataset handle, driver registry). Injecting after that point has no effect on already-opened handles.
Can a Pydantic validator check that the input glob actually matches files?
Yes — use a @model_validator(mode="after") (not field_validator) because the check needs both workspace and input_glob to be resolved. Raise ValueError listing the zero-match pattern so the operator sees it before the batch job allocates any workers.
How do I handle YAML configs written for an older schema version?
Add schema_version: int = Field(default=1) to your model. In a @model_validator(mode="before") inspect the raw dict and remap legacy keys — for example, renaming a top-level crs string to spatial.src_crs — before Pydantic validates anything. This keeps migration logic isolated and testable without forking the validation code.
Related
- Configuration File Management — the parent guide covering TOML vs YAML, schema evolution, and
pydantic-settingsenvironment-variable precedence - Managing YAML configs for geospatial CLI workflows — you are here
- Click vs Typer for Geospatial Workflows — choosing the right CLI framework before wiring in a config loader
- Handling Missing Dependencies Gracefully in Click Apps — how to guard optional GDAL/rasterio imports that your config may activate