Skip to content

03 — Data Schemas

Complete schemas for every output the pipeline produces. This is the authoritative reference — if a column name or unit differs from what you see here, the pipeline has drifted and this doc is wrong.

Parquet Store

data/parquet/pollutants/ — Hive partitioned

Partition keys: pollutant_group=…/year=…/*.parquet Rows: ~4,870,334 after deduplication (from 5,843,628 raw) Produced by: step_01_build_pollutant_store.py

Column Type Unit Description
state_code int32 FIPS state code (always 48 = Texas)
county_code int32 FIPS county code (3-digit)
site_number int32 AQS site number
parameter_code int32 AQS parameter code (e.g. 44201 = O₃)
poc int32 Parameter Occurrence Code (sub-instrument identifier)
date_local string ISO date YYYY-MM-DD (local time)
time_local string HH:MM (local time)
sample_measurement float64 varies Normalized to EPA units (see §Unit conventions)
method_code int32 Measurement method code
county_name string Title case (normalized from ALL-CAPS)
pollutant_name string Specific pollutant name
aqsid string 9-digit AQS site identifier (state+county+site)
data_source string "EPA" or "TCEQ"
pollutant_group string Partition key. One of: Ozone, NOx_Family, CO, SO2, PM2.5, PM10, VOCs
site_name string Human Name_XXXX
datetime timestamp[ns] Derived from date_local + time_local
year int16 Partition key
month int8 1–12
hour int8 0–23
season string DJF / MAM / JJA / SON

Unit conventions (pollutant parquet)

Pollutant group Parameter codes Unit Note
Ozone 44201 ppm TCEQ rows converted from ppb (×0.001) in step 01
NOx_Family 42601, 42602, 42603 ppb EPA and TCEQ both native ppb
CO 42101 ppm EPA-only, native ppm
SO2 42401 ppb EPA and TCEQ both native ppb
PM2.5 88101, 88500, 88502 µg/m³ Local conditions
PM10 81102, 85101 µg/m³ Local conditions
VOCs 43xxx, 45xxx ppbC Carbon-normalized

data/parquet/weather/ — Hive partitioned

Partition keys: location=…/year=…/*.parquet Rows: 1,470,049 Produced by: step_02_build_weather_store.py

Inherits all 45 columns from Weather_Irradiance_Master_2015_2025.csv. The most-used subset:

Column Type Unit Description
dt int64 unix seconds UTC timestamp
datetime_local string Local datetime
datetime_utc string UTC datetime
year, month, hour int16/8/8 Derived from datetime_local
date_local string Local date YYYY-MM-DD
location string Partition key. Weather station name (renamed from site_name)
county_name string County the station is in
lat, lon float64 degrees Station coordinates
temp float64 °C Air temperature (already Celsius in master)
temp_c float64 °C Stable alias (identical to temp)
temp_f float64 °F Pre-computed Fahrenheit
feels_like float64 °C Apparent temperature
dew_point float64 °C Dew point
td_spread float64 °C Dew point spread (temp - dew_point)
humidity float64 % Relative humidity
pressure float64 hPa Station pressure
sea_level, grnd_level float64 hPa Reduced pressures
wind_speed float64 m/s Wind speed
wind_deg float64 degrees Wind direction (meteorological convention)
wind_gust float64 m/s Peak gust in hour
wind_u float64 m/s U-component (-speed · sin(deg))
wind_v float64 m/s V-component (-speed · cos(deg))
clouds_all float64 % Cloud cover fraction
cloud_fraction float64 0–1 Decimal cloud cover
visibility float64 m Horizontal visibility
rain_1h float64 mm Rainfall last hour
rain_3h float64 mm Rainfall last 3 hours
snow_1h float64 mm Snowfall last hour
is_raining bool Flag
weather_id, weather_main, weather_description OpenWeather condition
heat_index_c float64 °C Rothfusz (null when T<26°C or RH<40%)
ghi_cloudy_sky, ghi_clear_sky float64 W/m² Global horizontal irradiance
dni_cloudy_sky, dni_clear_sky float64 W/m² Direct normal irradiance
dhi_cloudy_sky, dhi_clear_sky float64 W/m² Diffuse horizontal irradiance

data/parquet/naaqs/design_values.parquet

Rows: 764 Produced by: step_03_compute_naaqs.py

Column Type Description
aqsid string Site
year int Calendar year
pollutant_group string Ozone, PM2.5, PM10, CO, SO2, NOx_Family
metric string See §NAAQS metric catalog
value float64 Computed design value
units string ppm, ppb, ug/m3, or count
naaqs_level float64 NAAQS threshold from config.yaml (null for exceedance counts)
exceeds bool value > naaqs_level
site_name, county_name string

NAAQS metric catalog

metric Pollutant Formula NAAQS level Applies when
ozone_8hr_4th_max O₃ 4th-highest daily max 8-hr rolling avg per year 0.070 ppm All sites with O₃ data
pm25_annual_mean PM₂.₅ Annual mean of daily means (≥18 hrs) 9.0 µg/m³ PM₂.₅ sites
pm25_24hr_p98 PM₂.₅ 98th percentile of daily means per year 35 µg/m³ PM₂.₅ sites
pm10_24hr_exceedances PM₁₀ Count of daily means > 150 µg/m³ per year PM₁₀ sites
co_8hr_max CO Annual max of 8-hr rolling means 9 ppm CO sites
co_1hr_max CO Annual max hourly 35 ppm CO sites
so2_1hr_p99 SO₂ 99th percentile of daily max 1-hr per year 75 ppb SO₂ sites
no2_1hr_p98 NO₂ 98th percentile of daily max 1-hr per year 100 ppb NO₂ only (param 42602)
no2_annual_mean NO₂ Annual mean 53 ppb NO₂ only (param 42602)

Completeness rules are documented in 05_methodology.md.

data/parquet/daily/pollutant_daily.parquet

Rows: 236,070 Produced by: step_04_compute_daily_aggregates.py

One row per (aqsid, date_local, parameter_code).

Column Type Description
aqsid, date_local, parameter_code, pollutant_name, pollutant_group Identifiers
county_name, site_name string Metadata
mean float64 Daily mean of sample_measurement
min float64 Daily min
max float64 Daily max
std float64 Daily std dev (ddof=1)
n_hours int64 Hours reported that day
completeness_pct float64 n_hours / 24
valid_day bool completeness_pct >= 0.75

data/parquet/daily/pollutant_monthly.parquet

Rows: 6,070 Produced by: step_04_compute_daily_aggregates.py

One row per (aqsid, year_month, parameter_code). Uses only valid days.

Column Type Description
aqsid, year_month, parameter_code, pollutant_name, pollutant_group Identifiers
county_name, site_name string
monthly_mean float64 Mean of daily means
monthly_min, monthly_max, monthly_std float64
n_valid_days int64 Number of days meeting 75% threshold

data/parquet/combined/aq_weather_daily.parquet

Rows: 236,070 Produced by: step_05_merge_aq_weather.py

Daily pollutant joined with daily-aggregated weather at the nearest weather station. Each row carries all of pollutant_daily's columns PLUS:

Column Type Description
weather_station string Paired station (same-county Haversine nearest)
distance_km float64 Great-circle distance from pollutant site to weather station
temp_c, temp_c_min, temp_c_max float64 Daily temperature stats (°C)
feels_like_c, dew_point_c float64
humidity, humidity_min, humidity_max float64 %
pressure float64 hPa
wind_speed, wind_speed_max float64 m/s
wind_u, wind_v float64 m/s (for kriging)
wind_gust_max float64 m/s
clouds_all float64 %
visibility float64 m
rain_1h_sum float64 Daily precipitation
ghi_cloudy_sky_sum, ghi_clear_sky_sum float64 Daily integrated GHI
heat_index_c_max float64 Daily peak heat index

Flat CSV exports (data/csv/)

One-to-one dumps of the parquet tables above. Same schemas. Regenerated on every pipeline run.

File Source Rows (approx)
daily_pollutant_means.csv pollutant_daily.parquet 236k
naaqs_design_values.csv design_values.parquet 764
combined_aq_weather_daily.csv aq_weather_daily.parquet 236k
site_registry.csv Built in step 05 from 4 sources 47

site_registry.csv (47 rows — full inventory)

Column Description
aqsid 9-digit AQS site identifier
state_code, county_code, site_number FIPS + site
site_name Human-readable name
county_name Title case
network EPA, TCEQ, BOTH, or empty (reference/alias rows)
pollutants ;-separated list of pollutant groups measured
n_pollutants Count of pollutant groups
first_date, last_date Data coverage period (null for non-active rows)
n_records Total raw observations across all pollutants
data_status See breakdown below
co_located_with Cross-reference AQSID for aliases (empty for most rows)
notes Free-text explanation of the row's status
lat, lon Decimal degrees (WGS84)

Status breakdown (47 total, as of v0.3.3):

Count Status Meaning
42 active Has measurement data in the pipeline
3 reference CPS Energy fence-line monitors (Gardner Rd, Gate 9A, Gate 58)
1 excluded Calaveras Lake Park (480291609) — TCEQ monitor, officially retired from analysis: measures only Total Suspended Particulate (TSP), which is outside the project's pollutant scope
1 disabled Williams Park (483551024) — confirmed disabled in inventory

Note on Calaveras: 480290059 (Calaveras Lake, EPA-operated, active) and 480291609 (Calaveras Lake Park, TCEQ-operated, excluded) are separate physical monitoring stations. Calaveras Lake Park measures only Total Suspended Particulate (TSP), which is outside the project's scope (PM₂.₅, PM₁₀, O₃, CO, NOx, SO₂, VOCs). Do not deduplicate.

Important: Always filter to data_status == 'active' for analytical queries. The other four statuses describe registry entries that do not have associated measurement rows.

Postgres tables (aq schema)

Analysis-ready tables mirror the parquet/CSV schemas exactly. See 10_usage_sql.md for connection and query details.

Table Row count Index
aq.site_registry 47 aqsid
aq.naaqs_design_values 764 aqsid, year, metric, pollutant_group
aq.pollutant_daily 236,070 aqsid, date_local, pollutant_group
aq.pollutant_monthly 6,070 aqsid, year_month, pollutant_group
aq.aq_weather_daily 236,070 aqsid, date_local