Getting Started

This guide walks through locally setting up this NPS hiking project from scratch. By the end, you'll have a PostGIS database with all 63 US national parks and hundreds of hiking trails, queryable through an interactive API.

Tip: Instead of a local setup, you can also explore a live demo of the Swagger UI docs at seanangio-nps-hikes.onrender.com/docs. To query the visualization endpoints however, continue with the local setup instructions here.

Step 0: Prerequisites

Before you begin, make sure you have the following:

Docker Desktop: Install Docker Desktop for your operating system. Docker runs the database and API in containers so you don't need to install PostgreSQL or PostGIS locally.
Python 3.12+: The data collection pipeline runs on your local machine. Check your version with python3 --version. If you need to install or upgrade, see python.org.
Git: Install Git for your operating system to clone the repository.
An NPS API key: Free to sign up at the NPS Developer Portal. You should receive a key by email within minutes.

Step 1: Clone the repository

Start by cloning the repository.

git clone https://github.com/seanangio/nps-hikes.git
cd nps-hikes

Step 2: Set up a Python environment

Create a virtual environment, and install the project dependencies. You need them for the data collection pipeline, which runs outside of Docker.

python3.12 -m venv .venv
source .venv/bin/activate    # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Step 3: Configure environment variables

Copy the example environment file, and fill in your credentials:

cp .env.example .env

Open .env in your editor. You need to set two values:

# Required: your NPS API key from Step 0
NPS_API_KEY=your_actual_api_key

# Required: choose any password for the Docker database
POSTGRES_PASSWORD=choose_a_password

The remaining defaults work as-is for the Docker setup:

# No changes required for these defaults
POSTGRES_USER=postgres
NPS_USER_EMAIL=your_email@example.com
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=nps_hikes_db

How the project uses the .env file: Docker Compose reads it automatically to configure the database container. The Python scripts also read it (via python-dotenv) for API credentials and database connections.

Step 4: Personalize the raw data

The repository includes two types of sample raw data that you can substitute with your own.

Park visit log

The file raw_data/park_visit_log.csv records which parks you've visited:

park_name,month,year
Yosemite,July,2023
Grand Canyon,March,2024
Acadia,Oct,2024

Edit this file with your own visits. The park_name column requires the common short name (for example, "Yosemite" not "Yosemite National Park"). The collector appends "National Park" automatically, so "Yosemite" becomes "Yosemite National Park" and matches directly.

For parks with other designations like "National Park & Preserve," the collector falls back to substring matching. For example, "Denali" doesn't match "Denali National Park & Preserve" exactly, but the pipeline finds a match because the string "Denali" is a substring of the official name.

Make sure your entry is an exact substring of the official name. For example, use "Redwood" (not "Redwoods") for Redwood National and State Parks.

Tip: When you test the pipeline below in Step 6, it processes the first NPS park alphabetically (Acadia). If you include Acadia in your visit log, you'll have a visited park with trail data to explore.

Google My Maps hiking data

The raw_data/gmaps/ directory contains KML files with hiking locations exported from Google My Maps. The pipeline uses these named points to match hiking locations to trail geometries, and then collects elevation data for the matched trails. This enables personalized trail matching, hiked/unhiked filtering, and 3D trail visualizations with elevation profiles.

Tip: The repository includes sample KML files from the author's hikes. You can substitute your own files following the instructions below. Otherwise, leave the samples as-is, and skip ahead to Step 5.

How it works

The pipeline processes every .kml file in the raw_data/gmaps/ directory. Inside each KML file, it looks for folders (layers) named for 4-letter park codes, and reads the placemarks within them. A single Google My Maps KML file can contain up to ten layers (one per park).

Finding park codes: You can find the 4-letter abbreviation for each park on the NPS website in each park's URL. Once the API is running, they're also available at http://localhost:8000/parks.

Create your hiking maps

In Google My Maps, create one or more maps for your hikes:

Add a layer for each park, named according to the 4-letter park code (for example, zion).
Add placemarks to each layer for the trails or locations you've hiked.

Export and add KML files

Export each map as a KML file, and save the files to raw_data/gmaps/:

raw_data/gmaps/
├── nps_points_west.kml    # could contain layers: zion, yose, grca, ...
└── nps_points_east.kml    # could contain layers: acad, shen, grsm, ...

Step 5: Start the Docker services

Make sure Docker Desktop is running. Then launch the database and API containers:

docker compose up --build -d

This starts two services:

Service	Port	Description
`db`	5433	PostGIS database (mapped to 5433 to avoid conflicts with any local PostgreSQL)
`api`	8000	FastAPI REST API

Tip: The first run may take a few minutes while Docker downloads the base images. Subsequent runs are much faster.

On first startup, the database container automatically creates the required PostGIS and pg_trgm extensions and runs all schema migrations. You can verify the services are running:

docker compose ps

You should see both db and api with a status of "Up" (the database should show "healthy").

Note: The database uses port 5433 on your machine, not the standard 5432. This is intentional to avoid conflicts if you have PostgreSQL installed locally.

Step 6: Run the data collection pipeline

Next, populate the database with park and trail data. The pipeline runs on your local machine and writes to the Docker database.

Since the Docker database is on port 5433, override the port when running the pipeline:

POSTGRES_HOST=localhost POSTGRES_PORT=5433 python scripts/orchestrator.py --write-db --test-limit 1

The --test-limit 1 flag processes only one park, so you can verify it works before committing to the full run. Due to the elevation data collection step, this test run may take approximately 10 minutes.

The pipeline runs six steps in order:

Step	What it does	Data source
1. NPS Data Collection	Park metadata, coordinates, and boundary polygons	NPS API
2. OSM Trails Collection	Hiking trails within park boundaries	OpenStreetMap
3. TNM Trails Collection	Official trail data within park boundaries	The National Map
4. GMaps Import	Hiking locations from Google My Maps KML files	KML files in `raw_data/gmaps/`
5. Trail Matching	Matches GMaps locations to TNM or OSM trail geometries	Internal
6. Elevation Collection	Elevation profiles for matched trails	USGS EPQS

Verify the test run

First, confirm that the pipeline created and populated the tables by querying the database directly:

docker compose exec db bash -c 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT park_code, park_name FROM parks;"'

Tip: You should see the park code and name of parks collected (one if you used --test-limit 1). This runs psql inside the already-running database container, so you don't need PostgreSQL installed locally.

You can also verify through the API:

curl http://localhost:8000/parks | python3 -m json.tool

You should see a JSON response with park_count showing the number of parks collected and a parks array with details for each one.

Run the full pipeline

Once you've confirmed the test run works, collect data for all 63¹ national parks:

POSTGRES_HOST=localhost POSTGRES_PORT=5433 python scripts/orchestrator.py --write-db

This takes longer. If using the author's files, expect more than 2 hours for the full run. The main bottleneck is the elevation collection step, which queries the USGS EPQS API for sampled points along each matched trail (one request per point, with a rate limit delay between calls). The more trails matched from your KML files, the longer this step takes.

The pipeline is resumable: with --write-db, each collector skips parks or trails that already have data in the database, and the elevation collector also maintains a persistent cache of individual elevation lookups. If something interrupts a run, re-running the same command picks up roughly where it left off. To force a full re-collection, pass --force-refresh.

Tip: The pipeline is fail-fast. If a step fails, check logs/orchestrator.log for details. You can also run individual collectors directly for debugging (see the README for individual component commands).

Step 7: Explore your data

With the pipeline complete, you now have a database full of national park and trail data. The API should be running at http://localhost:8000.

Interactive API documentation

Open http://localhost:8000/docs in your browser to access the Swagger UI. This interactive interface lets you try every endpoint, see request/response schemas, and experiment with query parameters.

Quick examples

Description	URL
Browse all parks	`http://localhost:8000/parks`
Filter to parks you've visited	`http://localhost:8000/parks?visited=true`
See all trails for a specific park	`http://localhost:8000/parks/yose/trails`
Find long trails across all parks	`http://localhost:8000/trails?min_length=10`
Filter by state	`http://localhost:8000/trails?state=CA`

Note: The data endpoints (/parks, /trails) work immediately after the pipeline. The visualization endpoints (park maps, trail maps, elevation charts) require an additional generation step covered in the API Tutorial.

Stopping and restarting

Here are a few handy commands for stopping and restarting the Docker services.

Stop the services (preserving the data):

docker compose down

Restart later (no rebuild needed unless code changed):

docker compose up -d

Start fresh (removes all database data):

docker compose down -v

Troubleshooting

"Set POSTGRES_PASSWORD in .env"

Docker Compose requires setting a POSTGRES_PASSWORD. Make sure your .env file exists in the project root and contains a POSTGRES_PASSWORD value.

Pipeline can't connect to the database

When running the pipeline from your local machine against the Docker database, make sure you're using port 5433:

POSTGRES_HOST=localhost POSTGRES_PORT=5433 python scripts/orchestrator.py --write-db

"Visit log file not found"

The NPS collector expects a file at raw_data/park_visit_log.csv. If you don't have one, create it with just the header row:

echo "park_name,month,year" > raw_data/park_visit_log.csv

Docker containers won't start

Make sure Docker Desktop is running, then try rebuilding:

docker compose down
docker compose up --build

Check the logs for specific errors:

docker compose logs db
docker compose logs api

API returns empty results after pipeline

Verify that the pipeline wrote data to the database:

curl http://localhost:8000/health

The response should show "database": "connected". If connected but no data, re-run the pipeline and check logs/orchestrator.log for errors.

Next steps

API Tutorial: A guided tour of the API's query capabilities and visualizations
README; Full project documentation including architecture, testing, and data profiling

The NPS manages Sequoia and Kings Canyon as one park (seki), and so it appears as one entry. ↩