Data Sources and Schema
External sources
The project collects data from the following APIs.
National Park Service
The NPS API requires a free API key, and provides park metadata and boundary geometries.
- The collector queries the
/parksendpoint for park names, codes, designations, coordinates, descriptions, and contact information, filtering for National Park designations. - A second pass fetches GeoJSON boundary polygons from the
/mapdata/parkboundaries/{park_code}endpoint. The collector standardizes boundaries to MultiPolygon format with bounding boxes calculated for downstream spatial queries.
OpenStreetMap
The pipeline collects trail geometries from OpenStreetMap via the Overpass API using the osmnx library. Note that the collector:
- Queries for paths and footways (
highway=path|footway) within each park's boundary polygon. - Retains named trails only.
- Aggregates segments sharing the same name into single MultiLineString records.
- Calculates trail lengths in miles using a projected coordinate system
EPSG:5070. - Clips trails to the park boundary before storage.
The National Map
The National Map provides official USGS trail data through an ArcGIS REST endpoint. The collector queries using each park's bounding box and returns trail geometries along with detailed attributes: trail type, trail number, length, and use flags (hiker, bicycle, pack/saddle, cross-country ski, etc.). Like OSM trails, the collector aggregates segments by name, clips to the park boundary, and recalculates lengths in the projected CRS.
USGS Elevation Point Query Service
The USGS EPQS returns elevation in meters for individual latitude/longitude coordinates. The collector samples points along matched trail geometries at regular intervals (default 50 meters) and queries the service for each point. The collector caches results locally to avoid redundant API calls. A three-stage validation pipeline checks API responses, individual point values, and the complete elevation profile. It treates the USGS sentinel value of -1,000,000 as missing data.
Coordinate systems
All geographic data uses EPSG:4326 (WGS84). Length calculations use EPSG:5070 (NAD83/Conus Albers) for accurate distance measurements in meters.
Internal sources
For the following two sources, you can replace the author's files with your own.
Park visit log
A CSV file (raw_data/park_visit_log.csv) recording which national parks you've visited, with park name, month, and year. The NPS collector uses this to tag parks as visited or unvisited, enabling filtered queries through the API. See Park visit log for formatting details.
Google My Maps KML files
KML files exported from Google My Maps containing named hiking location points, organized into layers by 4-letter park code. The pipeline imports these points, matches them to the nearest trail geometries from TNM or OSM, and then collects elevation profiles for the matched trails. See Google My Maps hiking data for export instructions.
Database schema
Core tables
The pipeline orchestrator creates the following tables:
| Table | Description |
|---|---|
| parks | Park metadata (codes, names, coordinates, descriptions, visit dates) |
| park_boundaries | Spatial boundaries as MultiPolygon geometries in WGS84 |
| osm_hikes | Aggregated trail geometries from OpenStreetMap (segments with same name combined into MultiLineString) |
| tnm_hikes | Trail data from The National Map with detailed trail characteristics |
| gmaps_hiking_locations | Google Maps hiking location points with coordinates |
| gmaps_hiking_locations_matched | Matched locations with trail correlation results |
| usgs_trail_elevations | Elevation profile data for matched trails |
Key features
- Spatial indexing with PostGIS GIST indexes for performance
- Foreign key relationships for data integrity across tables
- Composite primary keys for trail uniqueness (
park_code+osm_id) - Coordinate validation with proper range constraints