Geospatial
Core helpers
Caching & XDG Configuration
HEALPix grids are expensive to compute. This module supports caching boundaries in parquet files using XDG Base Directory standards and a persistent configuration.
Precedence for directory resolution (highest to lowest): 1. CLI argument (e.g., --cache-dir /tmp) 2. Environment variable (e.g., HEALPYXEL_CACHE=/fast/disk) 3. XDG spec: $XDG_CACHE_HOME or $XDG_CONFIG_HOME 4. Fallback: ~/.cache/healpyxel/healpix_grids or ~/.config/healpyxel
Configuration file: $XDG_CONFIG_HOME/healpyxel/settings.ini (or ~/.config/healpyxel/settings.ini) - Controls precomputed nsides, antimeridian handling, cache location override - Auto-created on first use; can be edited manually
init_user_config
init_user_config (config_dir:Optional[pathlib.Path]=None)
*Create default ~/.config/healpyxel/settings.ini if it doesn’t exist.
Args: config_dir: optional override for config directory
Returns: Path to config file (whether newly created or already existed)
Example: >>> config_file = init_user_config() >>> config_file.exists() True*
Cache Management Core Logic
Central dispatch for all cache operations: generate, list, clean, view configuration. Called by thin CLI wrapper in 05_cli.ipynb with no Click dependencies.
manage_healpix_cache
manage_healpix_cache (action:str='list', nsides:Optional[List[int]]=None, cache_dir:Optional[pathlib.Path]=None, config_dir:Optional[pathlib.Path]=None, force:bool=False)
*Core cache management logic with precedence awareness.
No Click dependencies; called by CLI wrapper in 05_cli.ipynb. Uses _get_cache_dir() and _get_config_dir() for proper precedence.
Args: action: ‘list’, ‘generate’, ‘verify’, ‘clean’, ‘info’, or ‘config’ nsides: list of nside values for ‘generate’ or ‘verify’ actions cache_dir: explicit CLI override (highest precedence) config_dir: explicit CLI override (highest precedence) force: whether to overwrite existing cache files during ‘generate’
Returns: dict with keys: ‘action’: str, action performed ‘cache_dir’: str, resolved cache directory ‘config_dir’: str, resolved config directory ‘status’: ‘ok’ or ‘error’ ‘count’/‘files’/‘deleted’/‘generated’/etc: action-specific data
Raises: ValueError for invalid action or missing required args*
Caching Tests
Verify XDG precedence logic and cache I/O roundtrip.
test_cache_verification_corrupt_nans
test_cache_verification_corrupt_nans ()
Test cache verification with NaN values in coordinates.
test_cache_verification_incomplete
test_cache_verification_incomplete ()
Test cache verification with incomplete cache (missing pixels).
test_cache_verification_missing
test_cache_verification_missing ()
Test cache verification with missing cache file.
test_cache_verification_complete
test_cache_verification_complete ()
Test cache verification with a complete valid cache.
test_cache_mode_require_missing_cache
test_cache_mode_require_missing_cache ()
Verify that cache_mode=‘require’ raises ValueError when cache is missing.
test_spherical_conversion
test_spherical_conversion ()
Verify spherical to lon/lat conversion.
test_cache_key_generation
test_cache_key_generation ()
Verify cache key generation.
test_xdg_precedence
test_xdg_precedence ()
Verify XDG directory resolution with full precedence.
HEALPix Grid Caching System
This module provides a robust caching system for HEALPix cell geometries to accelerate repeated conversions.
Key Features
- XDG Base Directory Compliant — Follows freedesktop.org standards for cross-platform cache storage
- Explicit Precedence — Clear resolution order: CLI arg > env var > config file > XDG defaults
- Spherical Coordinate Storage — Caches
(theta, phi)radians in parquet for format-agnostic reuse - Smart Subsetting — For sparse aggregates, loads only the required pixels from cache
- Strict Cache Modes — Prevents accidental full-grid computation with explicit cache policies
Cache Modes
The cache_mode parameter provides strict control over caching behavior:
| Mode | Behavior | Use Case | Safety Guarantee |
|---|---|---|---|
use |
Opportunistic: load cache if available, compute missing pixels on demand | Development, interactive analysis | ⚠️ May trigger expensive computation if cache incomplete |
require |
Strict: fail immediately if cache missing or incomplete | CI/CD pipelines, production ETL | ✅ Never computes full grid silently |
off |
Ignore cache entirely, always compute from scratch | Testing, benchmarking, one-off analysis | ⚠️ Always pays full computation cost |
Production Recommendation: Use cache_mode='require' in automated pipelines to prevent accidental multi-hour computations when cache is stale or missing.
Usage Examples:
# Development: opportunistic cache use (default)
healpyxel_to_geoparquet -a data.parquet
# Production: fail-fast if cache missing (recommended for CI/CD)
healpyxel_to_geoparquet -a data.parquet --cache-mode require
# Ignore cache entirely
healpyxel_to_geoparquet -a data.parquet --cache-mode offProduction Pipeline Workflow
For reliable CI/CD and production ETL, follow this pattern:
# 1. Generate cache for required nsides (run once or in setup stage)
healpyxel-cache --generate 256 --generate 512 --generate 1024
# 2. Verify cache integrity (recommended in CI)
healpyxel-cache --verify 256 --verify 512 --verify 1024
# 3. Process data with strict cache policy (fail if cache missing)
healpyxel_to_geoparquet -a batch_001.parquet --cache-mode require -n 256
healpyxel_to_geoparquet -a batch_002.parquet --cache-mode require -n 512
healpyxel_to_geoparquet -a batch_003.parquet --cache-mode require -n 1024
# 4. Cache becomes stale? Regenerate with --force
healpyxel-cache --generate 256 --forceWhy This Matters: - cache_mode='use' (default) will silently compute 50M+ boundaries if cache is missing at nside=2048 - A sparse aggregate with 10 pixels at nside=2048 + missing cache = catastrophic performance regression - cache_mode='require' makes this an explicit error instead of a silent 3-hour job
Directory Resolution Precedence
Both cache and config use XDG Base Directory Specification with explicit precedence:
| Rank | Method | Example | Scope |
|---|---|---|---|
| 1 | CLI argument | --cache-dir /tmp |
This command only |
| 2 | Environment variable | HEALPYXEL_CACHE=/mnt/ssd |
This shell session |
| 3 | Config file | ~/.config/healpyxel/settings.ini |
Persistent (all sessions) |
| 4 | XDG env var | $XDG_CACHE_HOME or $XDG_CONFIG_HOME |
System-wide (multi-user systems) |
| 5 | XDG defaults | ~/.cache or ~/.config |
Fallback (POSIX standard) |
Effective paths:
# Default (nothing configured)
Cache: $HOME/.cache/healpyxel/healpix_grids
Config: $HOME/.config/healpyxel
# With XDG_CACHE_HOME set
Cache: $XDG_CACHE_HOME/healpyxel/healpix_grids
# With HEALPYXEL_CACHE env var (overrides XDG)
Cache: $HEALPYXEL_CACHE
# CLI arg (overrides everything)
healpyxel-cache --cache-dir /custom/path --listConfiguration File
Location: $XDG_CONFIG_HOME/healpyxel/settings.ini (or ~/.config/healpyxel/settings.ini)
Auto-created on first use. Edit manually to customize:
# ~/.config/healpyxel/settings.ini
[cache]
# Cache directory for HEALPix grids (parquet files with spherical coordinates)
# Special value 'auto' means use XDG resolution
cache_dir = auto
# Precomputed nsides (comma-separated) to generate/cache automatically
precomputed_nsides = 32,64,128,256
[general]
# Whether to fix antimeridian-crossing polygons during boundary computation
fix_antimeridian = true
# Tolerance in degrees for antimeridian detection (advanced)
antimeridian_tolerance = 1.0Environment Variables
| Variable | Purpose | Example |
|---|---|---|
HEALPYXEL_CACHE |
Cache directory (session override) | export HEALPYXEL_CACHE=/fast/disk |
HEALPYXEL_CONFIG |
Config directory (session override) | export HEALPYXEL_CONFIG=~/.healpyxel_alt |
XDG_CACHE_HOME |
XDG cache root (system-wide) | Standard: leave unset (defaults to ~/.cache) |
XDG_CONFIG_HOME |
XDG config root (system-wide) | Standard: leave unset (defaults to ~/.config) |
Cache Management Commands
List cached grids:
healpyxel-cache --list
# Output:
# Cached grids (3):
# nside_032_nest_spherical.parquet 786432 cells (3.2 MB)
# nside_256_nest_spherical.parquet 49152 cells (25.6 MB)
# nside_512_nest_spherical.parquet 196608 cells (102.4 MB)Generate cache for specific nsides:
healpyxel-cache --generate 32 --generate 256 --generate 512
# Computes and caches all three at onceVerify cache integrity (recommended for CI):
healpyxel-cache --verify 256 --verify 512
# Checks:
# ✓ All expected pixels present (no missing cells)
# ✓ No NaN values in coordinate columns
# ✓ Correct schema (theta_0...3, phi_0...3, healpix_id)
# ✓ healpix_id values in valid range [0, npix)
# Returns non-zero exit code if any check failsShow configuration and precedence:
healpyxel-cache --config
# Output:
# Config file: /home/user/.config/healpyxel/settings.ini
# Exists: true
#
# Current Settings:
# cache_dir: auto (XDG)
# precomputed_nsides: [32, 64, 128, 256]
# fix_antimeridian: true
# antimeridian_tolerance: 1.0
#
# Precedence Resolution:
# cache_dir_resolved: /home/user/.cache/healpyxel/healpix_gridsClean cache (remove all files):
healpyxel-cache --clean
# WARNING: Deletes all cached grids. Use with caution!Troubleshooting
Cache not found:
ValueError: Cache required but not found: nside_256_nest_spherical.parquet
→ Solution: Generate cache first: healpyxel-cache --generate 256
Performance regression with sparse aggregates:
# Sparse aggregate with 10 pixels at nside=2048
# Takes 3 hours instead of 10 seconds
→ Root cause: Missing cache forces full 50M pixel computation
→ Solution: Use cache_mode='require' to fail fast, then generate cache
Cache verification failed:
healpyxel-cache --verify 256
# ERROR: Expected 786432 pixels, found 786000 (432 missing)
→ Solution: Regenerate: healpyxel-cache --generate 256 --force
Wrong cache directory:
healpyxel-cache --list
# Shows 0 files but you know cache exists
→ Diagnosis: Check precedence: healpyxel-cache --config
→ Solution: Set HEALPYXEL_CACHE env var or use --cache-dir explicitly
Polygon creation and antimeridian handling
Main API: build GeoDataFrame and save geoparquet
healpix_to_geodataframe
healpix_to_geodataframe (nside:int, order:str='nested', lon_convention:str='0_360', pixels:Optional[Iterable[int]]=None, fix_antimeridian:bool=True, chunk_size:int=65536, cache_mode:str='use', cache_dir:Optional[pathlib.Path]=None)
*Create a GeoDataFrame of HEALPix cell polygons.
Args: nside: HEALPix nside order: ‘nested’ or ‘ring’ lon_convention: ‘0_360’ or ‘-180_180’ (affects polygon coordinates) pixels: optional iterable of pixel indices; default = all pixels fix_antimeridian: whether to call antimeridian.fix_polygon on polygons crossing the meridian chunk_size: number of pixels to process per chunk for memory control cache_mode: one of {‘use’,‘require’,‘off’} - ‘use’: load cache if available, otherwise compute requested pixels only - ‘require’: require cache; if missing, raise error (no computation) - ‘off’: ignore cache entirely cache_dir: optional cache directory override
Returns: GeoDataFrame with columns: ‘healpix_id’ and ‘geometry’ (EPSG:4326)*
save_healpix_to_geoparquet
save_healpix_to_geoparquet (nside:int, output_path:Union[str,pathlib.Path], order:str='nested', lon_convention:str='0_360', fix_antimeridian:bool=True, chunk_size:int=65536, parquet_kwargs:Optional[dict]=None)
*Build HEALPix vector layer and save as GeoParquet. This will create a GeoParquet file containing one polygon per HEALPix cell. For large nsides consider increasing memory or using chunked processing.
Args: nside: HEALPix nside output_path: path to output geoparquet file order: ‘nested’ or ‘ring’ lon_convention: ‘0_360’ or ‘-180_180’ fix_antimeridian: whether to fix antimeridian-wrapping chunk_size: pixels per chunk when building geometries parquet_kwargs: forwarded to GeoDataFrame.to_parquet Returns: Path to written file*
export_healpix_to_geotiff
export_healpix_to_geotiff (df:pandas.core.frame.DataFrame, column:str, output_path:Union[str,pathlib.Path], nside:int, order:str='nested', crs:str='IAU:19900', width:int=1440, height:int=720)
*Export a HEALPix column to GeoTIFF (requires rasterio + healpy).
Args: df: DataFrame with healpix_id index or healpix_id column column: data column to export output_path: GeoTIFF output path nside: HEALPix nside order: ‘nested’ or ‘ring’ crs: CRS string for GeoTIFF width: output raster width (pixels) height: output raster height (pixels)
Returns: Path to written GeoTIFF*
| Type | Default | Details | |
|---|---|---|---|
| df | DataFrame | ||
| column | str | ||
| output_path | Union | ||
| nside | int | ||
| order | str | nested | |
| crs | str | IAU:19900 | Mercury IAU CRS |
| width | int | 1440 | |
| height | int | 720 | |
| Returns | Path |
Quick test
CLI with Metadata Auto-Detection
The CLI now supports intelligent parameter inference from metadata sidecars:
Metadata Sidecar Pattern: - For aggregate sample_50k_nside256_aggregate.parquet, place metadata at sample_50k_nside256_aggregate.meta.json - The CLI automatically loads and extracts: nside, order, lon_convention
Parameter Resolution Precedence: 1. CLI args (highest priority) — explicit user override 2. Metadata — from .meta.json sidecar (if present) 3. Defaults — fallback values or inference from aggregate
lon_convention Behavior: - --lon-convention auto (default) → searches metadata, falls back to 0_360 - --lon-convention 0_360 or -180_180 → explicit override - Prevents user confusion about which convention was used in aggregation
Usage Examples:
# Zero-config: metadata has all parameters
healpyxel_to_geoparquet -a sample_50k_nside256_aggregate.parquet
# Override metadata
healpyxel_to_geoparquet -a sample_50k_nside256_aggregate.parquet -l -180_180 -O ring
# Batch mode with metadata
healpyxel_to_geoparquet -a data.parquet -y # Auto-confirm overwritesmain
main ()
*CLI entry point for healpyxel_to_geoparquet.
Converts aggregate parquet output with HEALPix geometry to GeoParquet. Automatically infers nside from aggregate row count (dense mode) or filename (sparse mode). Output filename is constructed as: {input_stem}{suffix}.parquet Default suffix is ‘.geo’ so ‘sample_50k_nside256_aggregate.parquet’ → ‘sample_50k_nside256_aggregate.geo.parquet’*
Comparison: Old vs. New UX
| Scenario | Old | New |
|---|---|---|
| With metadata sidecar | healpyxel_to_geoparquet -a data.parquet -l 0_360 -O nested |
healpyxel_to_geoparquet -a data.parquet ✓ Zero-config |
| Sparse aggregate | Must pass -n 256 explicitly |
Can pass -n 256 OR use metadata |
| Different lon convention | Defaults to 0_360, must override |
Auto-detects from metadata |
| Error on parameter mismatch | No validation (risk of wrong geometry) | Metadata enforces consistency |
Key Benefits: - ✅ Reduced UX friction: One argument instead of 3–4 - ✅ Consistency: Geometry respects aggregation parameters from metadata - ✅ Backward compatible: All explicit args still work and override metadata - ✅ Safe defaults: -180_180 lon convention now automatically used if that’s what data was processed with
Implementation: Why This Approach Wins
Architecture Decision: Metadata Sidecar Pattern
You proposed three approaches; here’s why option 2 (metadata sidecar) is best:
| Approach | Trade-offs | Winner? |
|---|---|---|
| Option 1: Auto mode for lon_convention | Only solves one param; nside/order still require explicit args | ❌ Partial solution |
| Option 2: Pass metadata directly | Higher UX friction (need to know metadata path); metadata is parallel to aggregate | ✅ Best |
| Option 3: Flexible input (parquet OR metadata) | Complex parsing logic; confusing precedence | ❌ Overengineered |
Why We Chose Option 2 (Enhanced): - Metadata .meta.json files are already generated alongside aggregates by the pipeline → zero user effort to provide it - Single metadata file contains all context: nside, order, lon_convention, timestamps, processing params - Sidecar pattern is industry-standard (e.g., .sidecar.json in STAC, .meta in scientific tools) - Auto-discovery: User only needs to pass aggregate path; CLI looks for {aggregate_stem}.meta.json - Backward compatible: Explicit CLI args still override when needed (e.g., testing with different parameters)
Why This Beats Manual Overrides: - Old way: healpyxel_to_geoparquet -a data.parquet -n 256 -O nested -l 0_360 (remember 4 params) - New way: healpyxel_to_geoparquet -a data.parquet (metadata does the work) - Problem solved: User can’t accidentally build geometries with wrong lon_convention → no more coordinate mismatches
Summary: Metadata Auto-Detection Workflow
You asked: How to handle --lon-convention which is stored in metadata?
Answer: Implement metadata sidecar auto-detection with parameter precedence.
What Changed
New Behavior: 1. CLI automatically discovers {aggregate_stem}.meta.json in the same directory 2. Extracts: nside, order, lon_convention from metadata keys: - ["sidecar_metadata"]["healpix"]["nside"] - ["sidecar_metadata"]["healpix"]["order"] - ["sidecar_metadata"]["coordinates"]["lon_convention"] 3. Default for --lon-convention: Changed from '0_360' to 'auto' - 'auto' → search metadata, fallback to '0_360' if not found - '0_360' or '-180_180' → explicit override (ignores metadata)
Parameter Precedence (highest to lowest):
CLI args > metadata > defaults
Code Changes
Two new helper functions: - _load_metadata_for_aggregate(agg_path) → loads .meta.json sidecar (quiet fail if missing) - _extract_healpix_params_from_metadata(metadata) → extracts nside, order, lon_convention
Updated main() CLI: - Option --lon-convention now accepts ['0_360', '-180_180', 'auto'] - Error message improved for sparse aggregates (mentions metadata option) - Logs which source was used: “Using lon_convention=0_360 from metadata” or “Using default…”
Usage
Zero-config (best case):
healpyxel_to_geoparquet -a sample_50k_nside256_aggregate.parquet
# Auto-detects: nside, order, lon_convention from metadataOverride metadata (for testing/validation):
healpyxel_to_geoparquet -a data.parquet -l -180_180 -n 256
# -l -180_180 overrides metadata, nside still from metadataBatch mode with metadata:
healpyxel_to_geoparquet -a data.parquet -y
# -y auto-confirms overwrites, metadata provides all paramsTesting ✓
- Metadata extraction logic verified
- Precedence (CLI > metadata > defaults) tested
- Helper functions properly exported for nbdev