HEALPix Sidecar
HEALPix Sidecar: PSF Weighting Extensions
This section introduces support for data point spread functions (PSF) and cell spread functions (CSF) in the sidecar generation process.
- Data PSF: Models the spatial response of each data geometry (e.g., a 2D Gaussian).
- Cell PSF: Models the spatial response of each HEALPix cell (e.g., a 2D Gaussian centered on the cell).
- Combination: The final weight for each (source, cell) assignment is computed by combining the two (default: multiplication).
- Normalization: Weights are normalized per cell so that their sum is 1, preserving compatibility with unweighted aggregation.
The implementation is modular and ready for future extension to custom/user-provided PSFs.
compute_healpix_ids_from_lonlat
compute_healpix_ids_from_lonlat (nside:int, lons:numpy.ndarray, lats:numpy.ndarray)
*Compute HEALPix indices for arrays of lon,lat in degrees.
Tries to use cdshealpix if available, otherwise falls back to healpy. Returns a 1D integer numpy array of same length as inputs.*
format_geo_statistics
format_geo_statistics (stats:dict)
*Format geo-statistics for display using rich tables.
Args: stats: Statistics dictionary from compute_geo_statistics
Returns: Formatted string representation*
compute_geo_statistics
compute_geo_statistics (input_path:pathlib.Path, lon_col:str|None=None, lat_col:str|None=None, sample_size:int=10000, lon_convention:str|None=None)
*Compute geographical statistics for a GeoParquet file using DuckDB for efficiency.
This function can analyze raw data or apply filtering based on longitude convention.
Args: input_path: Path to input GeoParquet file lon_col: Name of longitude column (if None, will auto-detect or extract from geometry) lat_col: Name of latitude column (if None, will auto-detect or extract from geometry) sample_size: Number of rows to sample for geometry-based extraction (if needed) lon_convention: Optional longitude convention for filtering: ‘0_360’ for [0,360) × [-90,90] ‘minus_plus180’ for [-180,180) × [-90,90] None (default) for no filtering (raw data)
Returns: Dictionary with statistics: { ‘lon’: {‘min’, ‘max’, ‘mean’, ‘std’, ‘count’}, ‘lat’: {‘min’, ‘max’, ‘mean’, ‘std’, ‘count’}, ‘source’: ‘columns’ or ‘geometry’, ‘lon_col’: column name or None, ‘lat_col’: column name or None, ‘filtered’: bool (True if convention filtering was applied), ‘total_count’: int (total records before filtering, if filtered), ‘filtered_count’: int (records after filtering, if filtered) }*
detect_lonlat_columns
detect_lonlat_columns (gdf_sample)
*Auto-detect longitude and latitude columns from a GeoDataFrame sample.
Returns: Tuple of (lon_column, lat_column) or (None, None) if not found*
process_partition
process_partition (gdf, nside:int, mode:str, base_index:int|None=None, lon_convention:str='None', data_psf=None, cell_psf=None, combine_method='multiply')
*Process a single dask partition (GeoDataFrame) and return DataFrame of assignments.
The returned D/ataFrame has columns [‘source_id’, ‘healpix_id’] and one row per assignment (for strict mode: at most one row per source_id; for fuzzy mode: one row per touched healpix cell).
Args: gdf: GeoDataFrame partition nside: HEALPix nside parameter mode: ‘strict’ or ‘fuzzy’ assignment mode base_index: Base index for source_id generation lon_convention: Longitude convention - ‘0_360’ for [0,360) or ‘minus_plus180’ for [-180,180)*
add_psf_weights_to_sidecar
add_psf_weights_to_sidecar (sidecar_df, src_geoms, cell_geoms, data_psf=None, cell_psf=None, combine_method='multiply', normalize=True)
Add a ‘weight’ column to the sidecar DataFrame using PSF functions. - src_geoms: sequence of source geometries (indexed by source_id) - cell_geoms: dict or sequence mapping healpix_id to cell geometry
normalize_weights_per_cell
normalize_weights_per_cell (df, cell_col='healpix_id', weight_col='weight')
Normalize weights so that sum of weights per cell is 1.0.
compute_assignment_weight
compute_assignment_weight (src_geom, cell_geom, data_psf=None, cell_psf=None, combine_method='multiply', data_psf_sigma=None, cell_psf_sigma=None)
Compute the assignment weight for a (source geometry, cell geometry) pair. - data_psf: callable or None - cell_psf: callable or None - combine_method: ‘multiply’, ‘sum’, ‘min’, ‘max’
write_sidecar_metadata
write_sidecar_metadata (output_path:pathlib.Path, input_path:pathlib.Path, nside:int, mode:str, lon_convention:str, ncores:int, args)
*Write sidecar processing metadata to JSON file.
Args: output_path: Path to the sidecar parquet file input_path: Path to the input file nside: HEALPix nside parameter mode: Assignment mode (‘strict’ or ‘fuzzy’) lon_convention: Longitude convention used ncores: Number of cores used args: Parsed command-line arguments
Returns: Path to the written metadata file*
build_output_path
build_output_path (input_path:pathlib.Path, mode:str, nside:int)
Build output path for sidecar file based on input and parameters.
parse_arguments
parse_arguments (argv=None)
Parse command line arguments.
validate_nside
validate_nside (nside:int)
Validate that nside is a positive power of two.
get_psf
get_psf (psf_type, sigma=None)
GaussianPSF
GaussianPSF (sigma=None)
2D Gaussian PSF centered at (0,0).
PSF
PSF ()
Base class for Point Spread Functions (PSF).
write_partitioned_output
write_partitioned_output (tasks, out_file:pathlib.Path, nparts:int)
*Write output as partitioned parquet files (one per partition).
Returns: Total number of rows written*
write_coalesced_output
write_coalesced_output (tasks, out_file:pathlib.Path, nside:int, mode:str, ncores:int, nparts:int)
*Write output as a single coalesced parquet file with incremental batching.
Returns: Total number of rows written*
main
main (argv=None)
Main entry point for HEALPix sidecar generation.
get_healpix_cell_geometry
get_healpix_cell_geometry (healpix_id, nside, nest=True)
Return a shapely Polygon for the given HEALPix cell. Uses healpy boundaries (in degrees, lon/lat).
Usage Example
See the main() function for CLI usage, or import functions directly for programmatic use.
Geographical Statistics Feature
The sidecar tool includes a standalone geo-statistics feature to inspect your data before processing.
Key Design
--geo-stats is a separate operation: When specified, the tool analyzes the raw data, displays statistics, and exits without performing sidecar calculations. This allows you to:
- Inspect data ranges and quality
- Get convention recommendations based on actual data
- Validate coordinates before heavy processing
Usage
```bash # Step 1: Inspect your data (raw, no filtering - auto-detects lon/lat columns) healpyxel-sidecar -i data.parquet –geo-stats
Step 1b: Inspect with filtering to see what will be processed
healpyxel-sidecar -i data.parquet –geo-stats –lon-convention 0_360
Step 2: Run actual processing with appropriate convention
healpyxel-sidecar -i data.parquet –nside 64 –lon-convention 0_360 –mode fuzzy
Optional: Specify explicit lon/lat columns
healpyxel-sidecar -i data.parquet –geo-stats
–lon-col longitude –lat-col latitude
Optional: Control sample size for geometry-based extraction
healpyxel-sidecar -i data.parquet –geo-stats
–stats-sample-size 50000
Compare raw vs filtered statistics
- Flexible analysis modes:
- Raw data (default): No filtering, shows actual data ranges
- Filtered data (with
--lon-convention): Apply same filtering as HEALPix processing
- Automatic column detection: Intelligently detects lon/lat columns using common naming patterns
- Geometry extraction: Falls back to extracting coordinates from geometry column (centroids for polygons)
- Efficient computation: Uses DuckDB SQL with WHERE clauses for fast filtered statistics
- Filtering impact: Shows total records, filtered records, and drop percentage
- Convention suggestion: Recommends appropriate
--lon-conventionbased on data ranges - Validation warnings: Checks if coordinates are within valid ranges
- Beautiful output: Uses Rich library for formatted tables (falls back to plain text if unavailable)
- JSON export: Saves statistics to
.geo_stats.jsonfile for reference
Statistics Computed
- Count: Number of valid coordinates
- Min/Max: Range of longitude and latitude values (as-is from file)
- Mean: Average position
- Std Dev: Standard deviation (spread)
Why This Matters
Workflow efficiency: Quick data inspection before heavy processing
Convention detection: Tool suggests the right
--lon-conventionfor your dataData validation: Catch coordinate system issues early ### Why This Matters- Performance: Statistics computed efficiently without loading entire dataset into memory
Quality control: Identify outliers or invalid coordinates
Quality control: Identify outliers or invalid coordinates
Performance: Statistics computed efficiently without loading entire dataset into memory
Workflow efficiency: Quick data inspection before heavy processing- Data validation: Catch coordinate system issues early
Convention detection: Tool suggests the right
--lon-conventionfor your data