Healpyxel
  • Home
  • Quickstart
  • Source Code
  • Report a Bug
  1. API Reference
  2. HEALPix Sidecar
  • HealPyxel
  • Examples
    • Quickstart
    • Complete workflow
    • Gaussian PSF - WIP!
    • Streaming Accumulation - WIP!
    • Streaming - WIP!
  • API Reference
    • Package Structure
    • HEALPix Sidecar
    • HEALPix Aggregate
    • Accumulator
    • Usage Example
    • Generate HEALPix sidecar
    • Optional Dependencies
    • Development: opportunistic cache use (default)

On this page

  • HEALPix Sidecar: PSF Weighting Extensions
    • compute_healpix_ids_from_lonlat
    • format_geo_statistics
    • compute_geo_statistics
    • detect_lonlat_columns
    • process_partition
    • add_psf_weights_to_sidecar
    • normalize_weights_per_cell
    • compute_assignment_weight
    • write_sidecar_metadata
    • build_output_path
    • parse_arguments
    • validate_nside
    • get_psf
    • GaussianPSF
    • PSF
    • write_partitioned_output
    • write_coalesced_output
    • main
    • get_healpix_cell_geometry
  • Usage Example
  • Geographical Statistics Feature
    • Key Design
    • Usage
  • Report an issue

Other Formats

  • CommonMark
  1. API Reference
  2. HEALPix Sidecar

HEALPix Sidecar

Generate HEALPix cell assignments for spatial data

HEALPix Sidecar: PSF Weighting Extensions

This section introduces support for data point spread functions (PSF) and cell spread functions (CSF) in the sidecar generation process.

  • Data PSF: Models the spatial response of each data geometry (e.g., a 2D Gaussian).
  • Cell PSF: Models the spatial response of each HEALPix cell (e.g., a 2D Gaussian centered on the cell).
  • Combination: The final weight for each (source, cell) assignment is computed by combining the two (default: multiplication).
  • Normalization: Weights are normalized per cell so that their sum is 1, preserving compatibility with unweighted aggregation.

The implementation is modular and ready for future extension to custom/user-provided PSFs.


compute_healpix_ids_from_lonlat

 compute_healpix_ids_from_lonlat (nside:int, lons:numpy.ndarray,
                                  lats:numpy.ndarray)

*Compute HEALPix indices for arrays of lon,lat in degrees.

Tries to use cdshealpix if available, otherwise falls back to healpy. Returns a 1D integer numpy array of same length as inputs.*


format_geo_statistics

 format_geo_statistics (stats:dict)

*Format geo-statistics for display using rich tables.

Args: stats: Statistics dictionary from compute_geo_statistics

Returns: Formatted string representation*


compute_geo_statistics

 compute_geo_statistics (input_path:pathlib.Path, lon_col:str|None=None,
                         lat_col:str|None=None, sample_size:int=10000,
                         lon_convention:str|None=None)

*Compute geographical statistics for a GeoParquet file using DuckDB for efficiency.

This function can analyze raw data or apply filtering based on longitude convention.

Args: input_path: Path to input GeoParquet file lon_col: Name of longitude column (if None, will auto-detect or extract from geometry) lat_col: Name of latitude column (if None, will auto-detect or extract from geometry) sample_size: Number of rows to sample for geometry-based extraction (if needed) lon_convention: Optional longitude convention for filtering: ‘0_360’ for [0,360) × [-90,90] ‘minus_plus180’ for [-180,180) × [-90,90] None (default) for no filtering (raw data)

Returns: Dictionary with statistics: { ‘lon’: {‘min’, ‘max’, ‘mean’, ‘std’, ‘count’}, ‘lat’: {‘min’, ‘max’, ‘mean’, ‘std’, ‘count’}, ‘source’: ‘columns’ or ‘geometry’, ‘lon_col’: column name or None, ‘lat_col’: column name or None, ‘filtered’: bool (True if convention filtering was applied), ‘total_count’: int (total records before filtering, if filtered), ‘filtered_count’: int (records after filtering, if filtered) }*


detect_lonlat_columns

 detect_lonlat_columns (gdf_sample)

*Auto-detect longitude and latitude columns from a GeoDataFrame sample.

Returns: Tuple of (lon_column, lat_column) or (None, None) if not found*


process_partition

 process_partition (gdf, nside:int, mode:str, base_index:int|None=None,
                    lon_convention:str='minus_plus180',
                    lon_col:str|None=None, lat_col:str|None=None,
                    data_psf=None, cell_psf=None,
                    combine_method='multiply')

*Process a single dask partition and return DataFrame of assignments.

Supports two workflows: 1. Scalar lon/lat columns (efficient for strict mode): pass lon_col, lat_col 2. Geometry-based (for fuzzy mode): pass geometries via gdf.geometry

Args: gdf: GeoDataFrame or DataFrame partition nside: HEALPix nside parameter mode: ‘strict’ or ‘fuzzy’ assignment mode base_index: Base index for global source_id generation lon_convention: ‘minus_plus180’ or ‘0_360’ lon_col: Longitude column name (if None, use geometry) lat_col: Latitude column name (if None, use geometry) data_psf, cell_psf: Optional PSF functions combine_method: How to combine PSF weights

Returns: DataFrame with columns [‘source_id’, ‘healpix_id’] and optional ‘weight’*


add_psf_weights_to_sidecar

 add_psf_weights_to_sidecar (sidecar_df, src_geoms, cell_geoms,
                             data_psf=None, cell_psf=None,
                             combine_method='multiply', normalize=True)

Add a ‘weight’ column to the sidecar DataFrame using PSF functions. - src_geoms: sequence of source geometries (indexed by source_id) - cell_geoms: dict or sequence mapping healpix_id to cell geometry


normalize_weights_per_cell

 normalize_weights_per_cell (df, cell_col='healpix_id',
                             weight_col='weight')

Normalize weights so that sum of weights per cell is 1.0.


compute_assignment_weight

 compute_assignment_weight (src_geom, cell_geom, data_psf=None,
                            cell_psf=None, combine_method='multiply',
                            data_psf_sigma=None, cell_psf_sigma=None)

Compute the assignment weight for a (source geometry, cell geometry) pair. - data_psf: callable or None - cell_psf: callable or None - combine_method: ‘multiply’, ‘sum’, ‘min’, ‘max’


write_sidecar_metadata

 write_sidecar_metadata (output_path:pathlib.Path,
                         input_path:pathlib.Path, nside:int, mode:str,
                         lon_convention:str, ncores:int, args)

*Write sidecar processing metadata to JSON file.

Args: output_path: Path to the sidecar parquet file input_path: Path to the input file nside: HEALPix nside parameter mode: Assignment mode (‘strict’ or ‘fuzzy’) lon_convention: Longitude convention used ncores: Number of cores used args: Parsed command-line arguments

Returns: Path to the written metadata file*


build_output_path

 build_output_path (input_path:pathlib.Path, mode:str, nside:int)

Build output path for sidecar file based on input and parameters.


parse_arguments

 parse_arguments (argv=None)

Parse command line arguments.


validate_nside

 validate_nside (nside:int)

Validate that nside is a positive power of two.


get_psf

 get_psf (psf_type, sigma=None)

GaussianPSF

 GaussianPSF (sigma=None)

2D Gaussian PSF centered at (0,0).


PSF

 PSF ()

Base class for Point Spread Functions (PSF).


write_partitioned_output

 write_partitioned_output (tasks, out_file:pathlib.Path, nparts:int)

*Write output as partitioned parquet files (one per partition).

Returns: Total number of rows written*


write_coalesced_output

 write_coalesced_output (tasks, out_file:pathlib.Path, nside:int,
                         mode:str, ncores:int, nparts:int)

*Write output as a single coalesced parquet file with incremental batching.

Returns: Total number of rows written*


main

 main (argv=None)

Main entry point for HEALPix sidecar generation.


get_healpix_cell_geometry

 get_healpix_cell_geometry (healpix_id, nside, nest=True)

Return a shapely Polygon for the given HEALPix cell. Uses healpy boundaries (in degrees, lon/lat).

Usage Example

See the main() function for CLI usage, or import functions directly for programmatic use.

Geographical Statistics Feature

The sidecar tool includes a standalone geo-statistics feature to inspect your data before processing.

Key Design

--geo-stats is a separate operation: When specified, the tool analyzes the raw data, displays statistics, and exits without performing sidecar calculations. This allows you to:

  1. Inspect data ranges and quality
  2. Get convention recommendations based on actual data
  3. Validate coordinates before heavy processing

Usage

# Step 1: Inspect your data (raw, no filtering - auto-detects lon/lat columns)
healpyxel-sidecar -i data.parquet --geo-stats

# Step 1b: Inspect with filtering to see what will be processed
healpyxel-sidecar -i data.parquet --geo-stats --lon-convention 0_360

# Step 2: Run actual processing with appropriate convention
healpyxel-sidecar -i data.parquet --nside 64 --lon-convention 0_360 --mode fuzzy

# Optional: Specify explicit lon/lat columns
healpyxel-sidecar -i data.parquet --geo-stats \
  --lon-col longitude --lat-col latitude

# Optional: Control sample size for geometry-based extraction
healpyxel-sidecar -i data.parquet --geo-stats \
  --stats-sample-size 50000

# Compare raw vs filtered statistics
2. **Flexible analysis modes**:
   - **Raw data** (default): No filtering, shows actual data ranges
   - **Filtered data** (with `--lon-convention`): Apply same filtering as HEALPix processing
3. **Automatic column detection**: Intelligently detects lon/lat columns using common naming patterns
4. **Geometry extraction**: Falls back to extracting coordinates from geometry column (centroids for polygons)
5. **Efficient computation**: Uses DuckDB SQL with WHERE clauses for fast filtered statistics
6. **Filtering impact**: Shows total records, filtered records, and drop percentage
7. **Convention suggestion**: Recommends appropriate `--lon-convention` based on data ranges
8. **Validation warnings**: Checks if coordinates are within valid ranges
9. **Beautiful output**: Uses Rich library for formatted tables (falls back to plain text if unavailable)
10. **JSON export**: Saves statistics to `.geo_stats.json` file for reference

### Statistics Computed

- **Count**: Number of valid coordinates
- **Min/Max**: Range of longitude and latitude values (as-is from file)
- **Mean**: Average position
- **Std Dev**: Standard deviation (spread)

### Why This Matters

- **Workflow efficiency**: Quick data inspection before heavy processing
- **Convention detection**: Tool suggests the right `--lon-convention` for your data

- **Data validation**: Catch coordinate system issues early
### Why This Matters- **Performance**: Statistics computed efficiently without loading entire dataset into memory

- **Quality control**: Identify outliers or invalid coordinates
- **Quality control**: Identify outliers or invalid coordinates

- **Performance**: Statistics computed efficiently without loading entire dataset into memory
- **Workflow efficiency**: Quick data inspection before heavy processing- **Data validation**: Catch coordinate system issues early
- **Convention detection**: Tool suggests the right `--lon-convention` for your data

::: {#cell-24 .cell}
``` {.python .cell-code}
# Test: _read_input_lazy three-tier fallback
import tempfile
from pathlib import Path
from unittest.mock import patch

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Create a minimal parquet file (no geometry — forces Tier 2)
n = 500
rng = np.random.default_rng(42)
df = pd.DataFrame({
    "spot_lon": rng.uniform(-180, 180, n).astype(np.float32),
    "spot_lat": rng.uniform(-90, 90, n).astype(np.float32),
    "reflectance": rng.normal(0.05, 0.01, n).astype(np.float32),
})

with tempfile.TemporaryDirectory() as tmpdir:
    path = Path(tmpdir) / "test_no_geom.parquet"
    df.to_parquet(path, index=False)

    # Should succeed via Tier 2 (no geometry → dask_geopandas may fail)
    ddf = _read_input_lazy(path, ncores=4)
    result = ddf.compute()
    assert result.shape[0] == n, f"Expected {n} rows, got {result.shape[0]}"
    assert "spot_lon" in result.columns
    assert "spot_lat" in result.columns
    assert np.allclose(result["spot_lon"].values, df["spot_lon"].values, atol=1e-6)

    # Verify healpy works on the output
    import healpy as hp
    phi = np.radians(result["spot_lon"].values.astype(np.float64) % 360)
    theta = np.radians(90.0 - result["spot_lat"].values.astype(np.float64))
    cells = hp.ang2pix(64, theta, phi, nest=True)
    assert cells.shape == (n,)
    assert np.all(cells >= 0)
    assert np.all(cells < 12 * 64**2)

# Test: simulate broken spatial partitions (force Tier 1 failure → Tier 2 success)
with tempfile.TemporaryDirectory() as tmpdir:
    path = Path(tmpdir) / "test_broken_spatial.parquet"
    df.to_parquet(path, index=False)

    with patch("dask_geopandas.read_parquet", side_effect=ValueError("Expected spatial partitions of length 6, got 115 instead.")):
        ddf = _read_input_lazy(path, ncores=4)
        result = ddf.compute()
        assert result.shape[0] == n, f"Tier 2 fallback failed: expected {n} rows, got {result.shape[0]}"

print(f"✓ _read_input_lazy verified: {n} rows, Tier 1→2 fallback works, healpy compatible")
✓ _read_input_lazy verified: 500 rows, Tier 1→2 fallback works, healpy compatible

:::

  • Report an issue