HEALPix Sidecar

Generate HEALPix cell assignments for spatial data

HEALPix Sidecar: PSF Weighting Extensions

This section introduces support for data point spread functions (PSF) and cell spread functions (CSF) in the sidecar generation process.

Data PSF: Models the spatial response of each data geometry (e.g., a 2D Gaussian).
Cell PSF: Models the spatial response of each HEALPix cell (e.g., a 2D Gaussian centered on the cell).
Combination: The final weight for each (source, cell) assignment is computed by combining the two (default: multiplication).
Normalization: Weights are normalized per cell so that their sum is 1, preserving compatibility with unweighted aggregation.

The implementation is modular and ready for future extension to custom/user-provided PSFs.

compute_healpix_ids_from_lonlat

 compute_healpix_ids_from_lonlat (nside:int, lons:numpy.ndarray,
                                  lats:numpy.ndarray)

*Compute HEALPix indices for arrays of lon,lat in degrees.

Tries to use cdshealpix if available, otherwise falls back to healpy. Returns a 1D integer numpy array of same length as inputs.*

format_geo_statistics

 format_geo_statistics (stats:dict)

*Format geo-statistics for display using rich tables.

Args: stats: Statistics dictionary from compute_geo_statistics

Returns: Formatted string representation*

compute_geo_statistics

 compute_geo_statistics (input_path:pathlib.Path, lon_col:str|None=None,
                         lat_col:str|None=None, sample_size:int=10000,
                         lon_convention:str|None=None)

*Compute geographical statistics for a GeoParquet file using DuckDB for efficiency.

This function can analyze raw data or apply filtering based on longitude convention.

Args: input_path: Path to input GeoParquet file lon_col: Name of longitude column (if None, will auto-detect or extract from geometry) lat_col: Name of latitude column (if None, will auto-detect or extract from geometry) sample_size: Number of rows to sample for geometry-based extraction (if needed) lon_convention: Optional longitude convention for filtering: ‘0_360’ for [0,360) × [-90,90] ‘minus_plus180’ for [-180,180) × [-90,90] None (default) for no filtering (raw data)

Returns: Dictionary with statistics: { ‘lon’: {‘min’, ‘max’, ‘mean’, ‘std’, ‘count’}, ‘lat’: {‘min’, ‘max’, ‘mean’, ‘std’, ‘count’}, ‘source’: ‘columns’ or ‘geometry’, ‘lon_col’: column name or None, ‘lat_col’: column name or None, ‘filtered’: bool (True if convention filtering was applied), ‘total_count’: int (total records before filtering, if filtered), ‘filtered_count’: int (records after filtering, if filtered) }*

detect_lonlat_columns

 detect_lonlat_columns (gdf_sample)

*Auto-detect longitude and latitude columns from a GeoDataFrame sample.

Returns: Tuple of (lon_column, lat_column) or (None, None) if not found*

process_partition

 process_partition (gdf, nside:int, mode:str, base_index:int|None=None,
                    lon_convention:str='None', data_psf=None,
                    cell_psf=None, combine_method='multiply')

*Process a single dask partition (GeoDataFrame) and return DataFrame of assignments.

The returned D/ataFrame has columns [‘source_id’, ‘healpix_id’] and one row per assignment (for strict mode: at most one row per source_id; for fuzzy mode: one row per touched healpix cell).

Args: gdf: GeoDataFrame partition nside: HEALPix nside parameter mode: ‘strict’ or ‘fuzzy’ assignment mode base_index: Base index for source_id generation lon_convention: Longitude convention - ‘0_360’ for [0,360) or ‘minus_plus180’ for [-180,180)*

add_psf_weights_to_sidecar

 add_psf_weights_to_sidecar (sidecar_df, src_geoms, cell_geoms,
                             data_psf=None, cell_psf=None,
                             combine_method='multiply', normalize=True)

Add a ‘weight’ column to the sidecar DataFrame using PSF functions. - src_geoms: sequence of source geometries (indexed by source_id) - cell_geoms: dict or sequence mapping healpix_id to cell geometry

normalize_weights_per_cell

 normalize_weights_per_cell (df, cell_col='healpix_id',
                             weight_col='weight')

Normalize weights so that sum of weights per cell is 1.0.

compute_assignment_weight

 compute_assignment_weight (src_geom, cell_geom, data_psf=None,
                            cell_psf=None, combine_method='multiply',
                            data_psf_sigma=None, cell_psf_sigma=None)

Compute the assignment weight for a (source geometry, cell geometry) pair. - data_psf: callable or None - cell_psf: callable or None - combine_method: ‘multiply’, ‘sum’, ‘min’, ‘max’

write_sidecar_metadata

 write_sidecar_metadata (output_path:pathlib.Path,
                         input_path:pathlib.Path, nside:int, mode:str,
                         lon_convention:str, ncores:int, args)

*Write sidecar processing metadata to JSON file.

Args: output_path: Path to the sidecar parquet file input_path: Path to the input file nside: HEALPix nside parameter mode: Assignment mode (‘strict’ or ‘fuzzy’) lon_convention: Longitude convention used ncores: Number of cores used args: Parsed command-line arguments

Returns: Path to the written metadata file*

build_output_path

 build_output_path (input_path:pathlib.Path, mode:str, nside:int)

Build output path for sidecar file based on input and parameters.

parse_arguments

 parse_arguments (argv=None)

Parse command line arguments.

validate_nside

 validate_nside (nside:int)

Validate that nside is a positive power of two.

get_psf

 get_psf (psf_type, sigma=None)

GaussianPSF

 GaussianPSF (sigma=None)

2D Gaussian PSF centered at (0,0).

PSF

 PSF ()

Base class for Point Spread Functions (PSF).

write_partitioned_output

 write_partitioned_output (tasks, out_file:pathlib.Path, nparts:int)

*Write output as partitioned parquet files (one per partition).

Returns: Total number of rows written*

write_coalesced_output

 write_coalesced_output (tasks, out_file:pathlib.Path, nside:int,
                         mode:str, ncores:int, nparts:int)

*Write output as a single coalesced parquet file with incremental batching.

Returns: Total number of rows written*

main

 main (argv=None)

Main entry point for HEALPix sidecar generation.

get_healpix_cell_geometry

 get_healpix_cell_geometry (healpix_id, nside, nest=True)

Return a shapely Polygon for the given HEALPix cell. Uses healpy boundaries (in degrees, lon/lat).

Usage Example

See the main() function for CLI usage, or import functions directly for programmatic use.

Geographical Statistics Feature

The sidecar tool includes a standalone geo-statistics feature to inspect your data before processing.

Key Design

--geo-stats is a separate operation: When specified, the tool analyzes the raw data, displays statistics, and exits without performing sidecar calculations. This allows you to:

Inspect data ranges and quality
Get convention recommendations based on actual data
Validate coordinates before heavy processing

Usage

```bash # Step 1: Inspect your data (raw, no filtering - auto-detects lon/lat columns) healpyxel-sidecar -i data.parquet –geo-stats

Step 1b: Inspect with filtering to see what will be processed

healpyxel-sidecar -i data.parquet –geo-stats –lon-convention 0_360

Step 2: Run actual processing with appropriate convention

healpyxel-sidecar -i data.parquet –nside 64 –lon-convention 0_360 –mode fuzzy

Optional: Specify explicit lon/lat columns

healpyxel-sidecar -i data.parquet –geo-stats
–lon-col longitude –lat-col latitude

Optional: Control sample size for geometry-based extraction

healpyxel-sidecar -i data.parquet –geo-stats
–stats-sample-size 50000

Compare raw vs filtered statistics

Flexible analysis modes:
- Raw data (default): No filtering, shows actual data ranges
- Filtered data (with --lon-convention): Apply same filtering as HEALPix processing
Automatic column detection: Intelligently detects lon/lat columns using common naming patterns
Geometry extraction: Falls back to extracting coordinates from geometry column (centroids for polygons)
Efficient computation: Uses DuckDB SQL with WHERE clauses for fast filtered statistics
Filtering impact: Shows total records, filtered records, and drop percentage
Convention suggestion: Recommends appropriate --lon-convention based on data ranges
Validation warnings: Checks if coordinates are within valid ranges
Beautiful output: Uses Rich library for formatted tables (falls back to plain text if unavailable)
JSON export: Saves statistics to .geo_stats.json file for reference

Statistics Computed

Count: Number of valid coordinates
Min/Max: Range of longitude and latitude values (as-is from file)
Mean: Average position
Std Dev: Standard deviation (spread)

Why This Matters

Workflow efficiency: Quick data inspection before heavy processing
Convention detection: Tool suggests the right --lon-convention for your data
Data validation: Catch coordinate system issues early ### Why This Matters- Performance: Statistics computed efficiently without loading entire dataset into memory
Quality control: Identify outliers or invalid coordinates
Quality control: Identify outliers or invalid coordinates
Performance: Statistics computed efficiently without loading entire dataset into memory
Workflow efficiency: Quick data inspection before heavy processing- Data validation: Catch coordinate system issues early
Convention detection: Tool suggests the right --lon-convention for your data