05 — Methodology¶

Technical and regulatory foundations for every transformation the pipeline applies. This document is written to be citable from the Methods section of a manuscript.

1. Unit normalization¶

Problem¶

The upstream reorganization scripts merged EPA AQS and TCEQ CAMS data into single per-pollutant CSVs (01_Data/Processed/By_Pollutant/*.csv) without reconciling native units. For ozone specifically:

Source	Native unit	Example raw value
EPA AQS	ppm	0.0459
TCEQ CAMS	ppb	45.9

Downstream NAAQS calculations against the 0.070 ppm standard produced nonsense values (~75 "ppm" at San Antonio sites) because TCEQ rows were treated as ppm when they were actually ppb.

Verification¶

Units were confirmed directly from the raw files under !Final Raw Data/:

EPA (CSV with units_of_measure column):

import pandas as pd
epa = pd.read_csv('!Final Raw Data/EPA AQS Downloads/by_pollutant/Ozone_2015_2025_AllCounties.csv')
epa['units_of_measure'].unique()  # → ['Parts per million']

TCEQ (AQS RD Transaction format with Unit Cd field):

# From TCEQ_O3_2016-2025_MissingGuadelupe.txt
# RD|I|48|091|0503|44201|01|1|008|...
#                             ^^^
#                             Unit Cd = 008 = Parts per billion (ppb)

The AQS standard unit code table (EPA AQS Data Mart documentation) maps: - 001 = ppm - 007 = ppmC - 008 = ppb ← TCEQ ozone uses this - 009 = ppbC - 105 = µg/m³ LC

Conversion table¶

Verified for every (parameter_code, data_source) combination in the project data:

Parameter	EPA unit	TCEQ unit	Target	Conversion
44201 (O₃)	ppm	ppb	ppm	TCEQ × 0.001
42101 (CO)	ppm	—	ppm	(TCEQ absent)
42401 (SO₂)	ppb	ppb	ppb	(none)
42601 (NO)	ppb	ppb	ppb	(none)
42602 (NO₂)	ppb	ppb	ppb	(none)
42603 (NOx)	ppb	ppb	ppb	(none)
88101 (PM₂.₅ FRM)	µg/m³	—	µg/m³	—
88502 (PM₂.₅ non-FRM)	—	µg/m³	µg/m³	—
81102 (PM₁₀)	µg/m³	—	µg/m³	—

Only ozone required conversion. All other parameters already use matching units across the two networks.

Implementation¶

pipeline/step_01_build_pollutant_store.py applies conversions via the UNIT_CONVERSIONS dictionary before writing parquet:

UNIT_CONVERSIONS: dict[tuple[int, str], tuple[float, str]] = {
    (44201, "TCEQ"): (0.001, "ozone ppb → ppm"),
}

def _normalize_units(df, log):
    for (param, src), (factor, desc) in UNIT_CONVERSIONS.items():
        mask = (df.parameter_code == param) & (df.data_source == src)
        if mask.sum():
            df.loc[mask, 'sample_measurement'] *= factor
            log.info(f"unit normalize: {desc}  ({mask.sum():,} rows × {factor})")
    return df

Applied in practice: 638,174 TCEQ ozone rows were converted in the current pipeline run. Post-conversion, the Bexar 8-hr ozone 4^th-max values land at 0.063–0.077 ppm, consistent with the San Antonio MSA's ongoing nonattainment status.

Reproducibility¶

Running python pipeline/run_pipeline.py --only 01 after updating UNIT_CONVERSIONS will rebuild the parquet store with the new conversion in place. The step is idempotent; there is no state to roll back.

2. Duplicate handling¶

Problem¶

The upstream reorg scripts wrote ~973k exact full-row duplicates when merging TCEQ data into EPA-dominated CSVs. Additionally, ~167k rows for ~84k groups have identical (aqsid, date, time, parameter, poc) keys but different sample_measurement values — these are not simple duplicates but likely reflect concurrent reports from redundant sub-instruments.

Resolution¶

Exact full-row duplicates are dropped in step 01 before writing parquet. These carry no information.
Same-key-different-value rows are preserved in the raw parquet store and handled downstream in the NAAQS computation (step 03), which averages across POCs at the same timestamp via:

s = df.set_index('datetime')['sample_measurement'].sort_index()
if s.index.duplicated().any():
    s = s.groupby(level=0).mean()

This matches EPA AQS's standard practice of averaging across POCs when computing design values.

Per-group counts (pipeline v0.3.3, April 2026)¶

Pollutant	Raw rows	Exact-dup drops	Out-of-scope filter drops	Rows in parquet	Notes
NOx_Family	1,989,602	551,305	251,328	1,186,969	Calaveras TCEQ feed drop
Ozone	1,823,627	311,508	85,025	1,427,094	Calaveras TCEQ feed drop
PM2.5	1,168,298	110,481	60,377	997,440	Calaveras TCEQ feed drop
SO2	524,039	0	82,116	441,923	Calaveras TCEQ feed drop
CO	191,448	0	0	191,448
PM10	99,910	0	0	99,910
VOCs	3,354,321	0	0	3,354,321	CC Palm + Hillcrest
Total	9,151,245	973,294	478,846	7,699,105

2b. Out-of-scope row filtering¶

After deduplication and before unit normalization, step 01 applies per-row filters defined in pipeline/step_01_build_pollutant_store.py::OUT_OF_SCOPE_FILTERS. Each filter is an AND over {column: value} matches; matching rows are dropped from the parquet store and logged.

Current filters (v0.3.3)¶

Filter	Rule	Rationale
Calaveras Lake TCEQ feed	`aqsid='480290059' AND data_source='TCEQ'`	TCEQ republishes the EPA feed for this site through TAMIS. Rows partially match EPA's `sample_measurement` (exact dups get dropped during dedup) and partially carry value conflicts from rounding / QC differences. Using only the EPA feed gives a single authoritative source per site. See 06_data_quality.md issue #8b.

How to add a new filter¶

Edit OUT_OF_SCOPE_FILTERS in step 01:

OUT_OF_SCOPE_FILTERS: list[tuple[str, dict]] = [
    ("description", {"col": "value", ...}),
    # Add new filters here
]

Rerun with python pipeline/run_pipeline.py --only 01,03,04,05,07 to propagate changes through every downstream layer. Document the new filter in 06_data_quality.md with a clear rationale.

3. NAAQS design value computation¶

All formulas follow 40 CFR Part 50 (U.S. National Ambient Air Quality Standards). Each function lives as a pure, unit-testable routine in pipeline/utils/naaqs.py.

3.1 Ozone 8-hour¶

Regulation: 40 CFR §50.19, primary and secondary standard 0.070 ppm Form: 4^th-highest daily maximum 8-hour rolling mean, averaged over 3 consecutive years. The pipeline emits the per-year 4^th-max; 3-year averaging is left to downstream analysis. Completeness: At least 6 of 8 hours for each rolling window

def ozone_8hr_4th_max(hourly_ppm, min_hours_8hr=6):
    rolling = hourly_ppm.rolling(8, min_periods=min_hours_8hr).mean()
    daily_max = rolling.resample('D').max()
    for year, grp in daily_max.groupby(daily_max.index.year):
        yield year, grp.dropna().sort_values(ascending=False).iloc[3]  # 4th-highest

3.2 PM₂.₅ annual¶

Regulation: 40 CFR §50.13, primary standard 9.0 µg/m³ (revised February 2024; previously 12.0 µg/m³). The pipeline uses the current 9.0 value. Form: Annual mean of daily means, averaged over 3 years. Completeness: At least 18 of 24 hours for a daily mean.

def pm_annual_mean(hourly, min_hours_daily=18):
    daily = hourly.resample('D').mean()
    daily[hourly.resample('D').count() < min_hours_daily] = np.nan
    return daily.groupby(daily.index.year).mean()

3.3 PM₂.₅ 24-hour¶

Regulation: 40 CFR §50.13, primary and secondary standard 35 µg/m³ Form: 98^th percentile of 24-hour mean concentrations per year, averaged over 3 years.

def pm25_24hr_p98(hourly, min_hours_daily=18):
    daily = hourly.resample('D').mean()  # with completeness filter
    return daily.groupby(daily.index.year).quantile(0.98)

3.4 PM₁₀ 24-hour¶

Regulation: 40 CFR §50.6, standard 150 µg/m³ Form: Count of days where the 24-hour mean exceeds 150 µg/m³; not to exceed an average of 1 per year over 3 years. Output: Exceedance count per year (value = integer, units = "count").

3.5 CO (1-hour and 8-hour)¶

Regulation: 40 CFR §50.8 - 1-hour: 35 ppm, not to exceed more than once per year - 8-hour: 9 ppm, not to exceed more than once per year

Pipeline emits the annual maximum for both metrics; downstream analysis can count exceedances.

3.6 SO₂ 1-hour¶

Regulation: 40 CFR §50.17, standard 75 ppb Form: 99^th percentile of daily maximum 1-hour concentrations per year, averaged over 3 years.

def so2_1hr_p99(hourly, min_hours_daily=18):
    daily_max = hourly.resample('D').max()  # with completeness filter
    return daily_max.groupby(daily_max.index.year).quantile(0.99)

3.7 NO₂ 1-hour and annual¶

Regulation: 40 CFR §50.11 - 1-hour: 100 ppb, 98^th percentile of daily max 1-hr averaged over 3 years - Annual: 53 ppb arithmetic mean

Only parameter code 42602 (NO₂) is used; NO (42601) and NOx (42603) are stored in the parquet but excluded from NAAQS metrics.

Completeness rules¶

Two completeness thresholds are enforced, consistent with EPA guidance:

Window	Minimum hours	Rationale
8-hour rolling average	6 of 8	40 CFR §50 App. I
24-hour daily mean	18 of 24	40 CFR §50 App. N (PM₂.₅), §50 App. S (SO₂/NO₂)

Both thresholds are configurable in config.yaml:

data_quality:
  hourly_completeness_threshold: 0.75
  daily_completeness_threshold:  0.75
  ozone_8hr_min_hours:           6
  pm_daily_min_hours:           18

Daily validity flag: pollutant_daily.valid_day is true when completeness_pct ≥ 0.75. Invalid days are retained in the output but excluded from monthly rollups.

4. Haversine nearest-station pairing¶

Problem¶

Pollutant sites and weather stations are not co-located. The project has 41 active AQ sites and 15 weather stations — a many-to-one relationship where each AQ site needs to be paired to its nearest weather station for joint analysis.

Input coordinates¶

Source	Sites	Coverage
`01_Data/Reference/enhanced_monitoring_sites.csv`	29	EPA + 2 TCEQ with AQS-verified lat/lon
`!Final Raw Data/Extra TCEQ Sites.xlsx`	18	TCEQ CAMS-registered coordinates

The two sources are merged by aqsid, with the CSV taking precedence on overlap. After merge, all 41 active pollutant sites have coordinates.

Weather station coordinates are derived from the weather parquet itself — the first (lat, lon) row per station. All 15 stations have coordinates.

Haversine formula¶

Great-circle distance between two points on a sphere of radius R:

\[ d = 2R \arcsin\left(\sqrt{\sin^2\left(\frac{\phi_2 - \phi_1}{2}\right) + \cos(\phi_1)\cos(\phi_2)\sin^2\left(\frac{\lambda_2 - \lambda_1}{2}\right)}\right) \]

where φ is latitude, λ is longitude (both in radians), R = 6371.0088 km.

Implementation in pipeline/step_05_merge_aq_weather.py:

def _haversine_km(lat1, lon1, lat2, lon2):
    R = 6371.0088
    p1, p2 = np.deg2rad(lat1), np.deg2rad(lat2)
    dp = p2 - p1
    dl = np.deg2rad(lon2) - np.deg2rad(lon1)
    a = np.sin(dp/2)**2 + np.cos(p1)*np.cos(p2)*np.sin(dl/2)**2
    return 2*R*np.arcsin(np.sqrt(a))

For each AQ site, the nearest of the 15 weather stations is chosen and the distance is stored in aq_weather_daily.distance_km. This allows downstream consumers to weight by distance in spatial interpolation or to threshold (e.g. drop pairings > 30 km when doing hyper-local analysis).

Alternative pairings (out of scope for v0.3)¶

This nearest-neighbor approach is adequate for daily-resolution analysis but has known limitations: - Does not account for elevation differences - Does not handle coastal/inland gradients (e.g. Corpus Christi coast vs. West) - Does not leverage MERRA-2 or NLDAS reanalysis grids

The project plans to use spatial interpolation (kriging, IDW) for weather fields downstream of this pipeline. The distance_km field exposed in aq_weather_daily is the handoff point.

5. Season assignment¶

Months are mapped to meteorological seasons using the standard 3-month grouping:

Season	Months	Abbreviation
Winter	Dec, Jan, Feb	DJF
Spring	Mar, Apr, May	MAM
Summer	Jun, Jul, Aug	JJA
Fall	Sep, Oct, Nov	SON

Implemented in step_01_build_pollutant_store.py::_SEASON_MAP.

6. Regulatory references and source documents¶

The following documents define the NAAQS calculations, unit conventions, and completeness rules used by the pipeline. These are the authoritative sources — not self-citations of this project.

40 CFR Part 50 — National Primary and Secondary Ambient Air Quality Standards. Code of Federal Regulations. https://www.ecfr.gov/current/title-40/chapter-I/subchapter-C/part-50
U.S. EPA, Air Quality System (AQS) — Data Mart. https://www.epa.gov/aqs
EPA NAAQS Design Value Methodology — U.S. EPA Office of Air Quality Planning and Standards, Technical Reports on NAAQS Compliance. 2024.
TCEQ TAMIS — Texas Commission on Environmental Quality, Texas Air Monitoring Information System. https://www17.tceq.texas.gov/tamis/
OpenWeather — Historical Hourly Weather API documentation. https://openweathermap.org/api/one-call-3
Rothfusz, L.P. (1990) — The Heat Index Equation. NWS Technical Attachment SR 90-23.
Haversine formula — derived from Sinnott, R.W. "Virtues of the Haversine." Sky and Telescope, 68(2):159, 1984.