01 — Project Overview¶
Mission¶
Characterize ambient air quality in 13 South Texas counties from 2015 through 2025, identify weather-driven pollutant patterns, and build the data foundation for spatial interpolation models that extend monitored estimates into unmonitored areas.
Study area¶
13 counties, 47 monitoring sites in the registry (42 active + 3 reference-only CPS fence-line + 1 officially retired Calaveras Lake Park (TSP only, outside scope) + 1 disabled Williams Park), 15 weather stations.
| County | MSA / Region | Active AQ Sites | Weather Stations |
|---|---|---|---|
| Atascosa | San Antonio–New Braunfels, TX | 1 | — |
| Bexar | San Antonio–New Braunfels, TX | 19 | 5 |
| Cameron | Brownsville–Harlingen, TX | 2 | 1 |
| Comal | San Antonio–New Braunfels, TX | 3 | 1 |
| Guadalupe | San Antonio–New Braunfels, TX | 2 | 1 |
| Hidalgo | McAllen–Edinburg–Mission, TX | 2 | 1 |
| Karnes | (non-MSA) | 1 | — |
| Kleberg | (non-MSA) | 1 | 1 |
| Maverick | (non-MSA) | 1 | — |
| Nueces | Corpus Christi, TX | 7 | 3 |
| Victoria | Victoria, TX | 1 | 1 |
| Webb | Laredo, TX | 2 | 1 |
| Wilson | San Antonio–New Braunfels, TX | 1 | — |
Counts reflect the post-pipeline state (April 2026). Some counties rely on their nearest neighbor's weather station via Haversine pairing (see methodology).
Pollutants measured¶
| Group | Parameters | Unit (normalized) | Source networks |
|---|---|---|---|
| Ozone (O₃) | 44201 | ppm | EPA + TCEQ |
| NOx family | 42601 (NO), 42602 (NO₂), 42603 (NOx) | ppb | EPA + TCEQ |
| CO | 42101 | ppm | EPA |
| SO₂ | 42401 | ppb | EPA + TCEQ |
| PM₂.₅ | 88101, 88500, 88502 | µg/m³ | EPA + TCEQ |
| PM₁₀ | 81102, 85101 | µg/m³ | EPA |
| VOCs | 43xxx, 45xxx (individual species) | ppbC | TCEQ (Hillcrest only) |
All pollutant values in the pipeline output are normalized to the EPA unit convention — see methodology §Unit normalization.
Weather and irradiance variables¶
Hourly observations from 15 OpenWeather stations covering 2015–2025:
- Air temperature, feels-like, dew point (°C — converted if source was Kelvin)
- Relative humidity (%)
- Station pressure (hPa)
- Wind speed, gust, direction (m/s, degrees)
- Wind u/v components (m/s) for kriging
- Cloud cover (%), visibility (m), precipitation (mm)
- GHI, DNI, DHI — clear-sky and all-sky (W/m²)
- Heat index (°C, Rothfusz when applicable)
Deliverables¶
The pipeline produces four tiers of output (see architecture):
- Raw-preserved parquet store — 4.87M pollutant hourly + 1.47M weather hourly rows
- NAAQS design values — 764 rows × 9 metrics × 40 sites × 11 years
- Daily aggregates — 236k site-day-parameter rows with completeness flags
- AQ + weather combined — 236k rows joined to paired daily weather
All outputs exist as:
- Parquet — fast local analytics (primary)
- Flat CSV — for R/Colab users without Arrow
- Postgres (Neon) — for SQL + BI tool access (analysis-ready tables only)
- R .rds — optional, for R-native pipelines
Intended users¶
- Lab researchers — hourly-resolution analysis in R notebooks
- Students / collaborators — SQL queries against Neon, daily-resolution
- Manuscript authors — NAAQS design value tables, time series plots
- Spatial modelers — Kriging inputs with
distance_kmweighting
Not in scope (yet)¶
- Spatial interpolation / kriging surfaces (user-led, downstream of this pipeline)
- Predictive models for unmonitored areas (downstream)
- Refactored R notebooks loading from
data/parquet/(follow-up work) - Coordinates and data for 2 pending VOC sites (awaits TCEQ TAMIS download)