16 — Analysis Project Timeline & Deliverables¶

Goal: Complete all data analysis for the South Texas Air Quality manuscript by August 1, 2026, ready for writing.

Team: Aidan Meyers (AM) · Manassa Kuchavaram (MK) · PI: Dr. Rajesh Melaram

Timeline: May 1 – August 1, 2026 (~13 weeks)

Tools: Google Colab (primary) → Neon Postgres (SQL queries) → pipeline parquet store (local heavy lifting)

Phase overview¶

%%{init: {'theme':'base','themeVariables':{
    'fontFamily':'Arial',
    'fontSize':'14px',
    'primaryColor':'#FFFFFF',
    'primaryTextColor':'#213c4e',
    'primaryBorderColor':'#213c4e',
    'lineColor':'#6b7a85',
    'sectionBkgColor':'#F5F7F9',
    'taskBkgColor':'#345372',
    'taskTextColor':'#FFFFFF',
    'taskBorderColor':'#213c4e',
    'gridColor':'#CCCCCC'
}}}%%
gantt
    title South Texas AQ — Analysis Timeline (May 1 → Aug 1, 2026)
    dateFormat YYYY-MM-DD
    axisFormat %b %d

    section Phase 1 · Refresh & describe
    Final EPA + TCEQ data refresh (2025 finalized) :crit, p1a, 2026-05-01, 1w
    Descriptives + outlier flagging                :p1b, 2026-05-01, 1w

    section Phase 2 · Imputation
    Method evaluation                              :p2a, 2026-05-08, 1w
    Apply imputation + sensitivity analysis        :p2b, 2026-05-15, 1w

    section Phase 3 · Stat tests + correlations
    Hypothesis tests (imputed + raw)               :p3a, 2026-05-22, 1w
    Correlation analysis (imputed + raw)           :p3b, 2026-05-29, 1w

    section Phase 4 · NAAQS + events
    NAAQS 3-yr design values + EPA cross-check     :p4a, 2026-06-05, 1w
    Functional event + disturbance annotation      :p4b, 2026-06-12, 1w

    section Phase 5 · Modeling + PCA
    PCA + dimensionality reduction                 :p5a, 2026-06-19, 1w
    ML modeling (RF, XGBoost, kriging)             :p5b, 2026-06-26, 2w

    section Phase 6 · Validation + figures
    Cross-validation + SHAP                        :p6a, 2026-07-10, 1w
    Publication figures + tables                   :p6b, 2026-07-17, 2w

    section Milestone
    Manuscript draft ready                         :milestone, 2026-08-01, 0d

Click the Gantt chart to enlarge

Mermaid diagrams in this docs site are SVG — zoom with browser Ctrl+scroll, or right-click → "Open image in new tab" for a full-screen view. The timeline is also mirrored in the week-by-week sections below where every cell is searchable.

Week-by-week deliverables¶

Each week is collapsible. Click to expand. Use the status checkboxes in the tracker at the bottom to mark progress — edit the markdown directly on GitHub (pipeline/docs/16_project_timeline.md), commit, and the docs site rebuilds in ~2 minutes.

Week 1 — May 1–9 · Final data refresh + Descriptive statistics

This week is the hard data freeze

Pull the final EPA AQS Data Mart and TCEQ TAMIS extracts for 2025-complete. After this week, no new raw data enters the pipeline before manuscript submission. Every analysis from Phase 2 onward locks against the v0.3.5 data tranche.

Task	Lead	Deliverable
Pull EPA AQS Data Mart 2025-complete (all 13 counties, all parameters) and stage under `!Final Raw Data/EPA AQS Downloads/`	AM	Refreshed master CSV + commit reference
Pull TCEQ TAMIS 2025-complete extracts for the 14 TCEQ sites	AM	Refreshed `*.txt` files in `!Final Raw Data/TCEQ Data - Missing Sites/`
Re-run upstream reorg scripts to refresh `By_Pollutant/*.csv`	AM	Updated merged CSVs
Re-run `pipeline/run_pipeline.py` end-to-end (~30 min)	AM	Updated parquet store + Postgres
Compute per-pollutant summary statistics (mean, median, P5, P25, P75, P95, max, σ) by county and year	MK	Summary statistics table (publication-ready)
Generate box/violin plots of pollutant distributions by county, season, and year	MK	6–8 publication-quality figures
Compute diurnal profiles (mean hourly concentration by hour-of-day) for O₃, NO₂, PM₂.₅ at 5 highest-loaded sites	AM	Diurnal profile plots
Outlier detection + flagging (negative values, physically impossible spikes, zero-variance runs >24 h)	MK	Outlier report CSV + `qc_flag` column proposal
Audit hourly completeness rates per site per year — completeness heatmap (site × year)	AM	Completeness heatmap figure
Initial week 1 report posted to dashboard repo	AM + MK	First entry in week-by-week reports site

Week 2 — May 12–16 · Imputation method evaluation

Imputation is required for many of the temporal models and analyses in later weeks. We evaluate methods this week, then commit to one in Week 3.

Task	Lead	Deliverable
Build a held-out evaluation set: artificially mask 10% of observed values across pollutants/sites/seasons	AM	Held-out mask + ground truth
Evaluate linear interpolation (baseline for short gaps ≤6 h)	MK	MAE, RMSE per pollutant
Evaluate seasonal LOCF (last observation carried forward, season-aware)	MK	MAE, RMSE per pollutant
Evaluate kNN-based imputation (using site neighbors + weather)	AM	MAE, RMSE per pollutant
Evaluate multiple imputation by chained equations (MICE, `miceforest` package)	AM	MAE, RMSE per pollutant
Pick the winner per gap-length bucket (≤6 h, 6–24 h, 24–48 h, >48 h) and document rationale	MK	Methods paragraph draft (imputation subsection)
Week 2 dashboard report	AM + MK	Posted to dashboard repo

Week 3 — May 19–23 · Apply imputation + sensitivity analysis

Task	Lead	Deliverable
Apply chosen imputation methods to gaps ≤48 h across all pollutants and weather	AM	Imputed parquet store with `imputed` flag column
Apply chosen weather imputation for missing meteorological values	MK	Imputed weather parquet
Re-compute daily aggregates and NAAQS values from the imputed dataset	AM	Imputed `pollutant_daily` + `naaqs_design_values` (separate Postgres tables, e.g. `aq.pollutant_daily_imputed`)
Sensitivity analysis: how much do annual means + NAAQS design values change after imputation vs. raw NA-dropped?	MK	Before/after comparison table + figure
Document imputation approach for Methods section, including gap-length thresholds	MK	Methods paragraph (final)
Week 3 dashboard report	AM + MK	Posted

Week 4 — May 26–30 · Statistical hypothesis testing

Run all tests twice: once on the raw NA-dropped data, once on the imputed data. Differences between the two are themselves informative (often noted in Results / Discussion).

Task	Lead	Deliverable
Mann-Kendall trend tests + Sen's slope estimator for annual O₃, PM₂.₅, NO₂ at each active site (raw + imputed)	AM	Slopes + p-values per site-pollutant pair (paired tables)
Kruskal-Wallis / Dunn's post-hoc for seasonal differences in each pollutant at each county (raw + imputed)	MK	Seasonal significance tables (paired)
Paired weekday vs. weekend comparison for NO₂ and CO (traffic signal); Wilcoxon test	AM	Weekday/weekend comparison figure + test results
PM₂.₅ annual means vs. revised 9.0 µg/m³ NAAQS — formal exceedance test with confidence intervals (raw + imputed)	MK	Exceedance table
Week 4 dashboard report	AM + MK	Posted

Week 5 — June 2–6 · Correlation analysis

Task	Lead	Deliverable
Inter-pollutant correlations within each county (Pearson + Spearman, raw + imputed)	AM	Correlation matrix heatmaps
Pollutant-weather correlations (daily means vs. temp, humidity, wind, GHI, pressure) — raw + imputed	MK	Correlation matrix figures per pollutant group
Inter-site correlations for O₃ and PM₂.₅ — how well do nearby sites agree?	AM	Inter-site correlation matrix
Lagged correlation analysis: does today's weather predict tomorrow's pollutant level?	MK	Lagged correlation tables + figure
Document statistical/correlational findings for Methods + Results	AM + MK	~500-word Methods paragraph + Results outline
Week 5 dashboard report	AM + MK	Posted

Week 6 — June 9–13 · NAAQS deep dive + 3-year design values

Task	Lead	Deliverable
Compute 3-year rolling design values (formal NAAQS compliance metric) for O₃ and PM₂.₅	AM	3-year design value tables + trend figures
Compare our computed design values against EPA's published values; quantify agreement per site	MK	Cross-validation table with % difference
Identify all sites in formal nonattainment (3-yr O₃ > 0.070 ppm, 3-yr PM₂.₅ > 9.0 µg/m³) for any year	AM	Nonattainment-site catalog
Time series plots of 3-year rolling NAAQS values (2017–2025) per site	MK	Multi-panel time series
Week 6 dashboard report	AM + MK	Posted

Week 7 — June 16–20 · Functional event & disturbance annotation

For every confirmed exceedance day or unusually high pollutant episode, annotate the probable cause: wildfire smoke, Saharan dust intrusion, industrial event, refinery flare, port activity, holiday-related (e.g. fireworks PM₂.₅), etc.

Task	Lead	Deliverable
Build a candidate-events list: every site-day where any pollutant exceeded its NAAQS or local 95^th percentile	AM	Candidate events table (CSV)
Cross-reference with NIFC wildfire database, NOAA Saharan dust forecasts, TCEQ industrial incident reports	MK	Annotated events catalog with probable cause column
Build a "typical exceedance day" weather profile (temp, wind, mixing height) by event type	AM	Profile table + figure
Document the events catalog as a supplementary table for the manuscript	MK	Final supplementary table draft
Week 7 dashboard report	AM + MK	Posted

Week 8 — June 23–27 · PCA + dimensionality reduction

Task	Lead	Deliverable
PCA across the daily pollutant + weather feature space — how many components explain 80%/95% of variance?	AM	Scree plot + variance-explained figure
Identify pollutant groupings via the loading matrix (which pollutants co-vary?)	MK	Loading-matrix heatmap
Site-level PCA biplot — cluster sites by pollutant signature	AM	Biplot figure with site labels
Optional: NMF or t-SNE for non-linear comparison	MK	Comparison figure if results warrant
Week 8 dashboard report	AM + MK	Posted

Week 9 — June 30 – July 4 · ML modeling part 1 (Random Forest + XGBoost)

Task	Lead	Deliverable
Feature engineering: build a daily modeling dataset (pollutant targets + weather + calendar features)	AM	Feature parquet
Train Random Forest regressors for daily O₃ and PM₂.₅ at each site	MK	RF performance table (R², RMSE, MAE per site)
Train XGBoost on the same targets; compare to RF	AM	Model comparison table
Feature importance analysis (RF + XGBoost) — which weather variables dominate?	MK	Importance bar charts per pollutant
Week 9 dashboard report	AM + MK	Posted

Week 10 — July 7–11 · ML modeling part 2 (kriging + spatial interpolation)

Task	Lead	Deliverable
Implement ordinary kriging for annual O₃ across the 13-county area (`pykrige`)	AM	Kriged surface map
Implement IDW as comparison baseline	MK	IDW surface map
Cross-validation: leave-one-out for both methods — which has lower error?	AM	LOO-CV error table
Repeat for PM₂.₅ — assess whether sparser PM₂.₅ network limits interpolation quality	MK	PM₂.₅ kriged surface + error analysis
Week 10 dashboard report	AM + MK	Posted

Week 11 — July 14–18 · Model validation

Task	Lead	Deliverable
Spatial cross-validation: train on N-1 sites, predict the held-out one — assess transferability	AM	Spatial CV error table
Temporal cross-validation: train on 2015–2022, predict 2023–2024 — assess forecast skill	MK	Temporal CV forecast plots
Hyperparameter tuning (Optuna) for the best-performing model family	AM	Tuned params + improvement over baseline
SHAP analysis for model interpretability	MK	SHAP summary plots per pollutant
Week 11 dashboard report	AM + MK	Posted

Week 12 — July 21–25 · Publication figures part 1

Task	Lead	Deliverable
Finalize all time series + trend figures (consistent fonts, colors, axes, captions)	AM	Figures 1–4
Finalize all spatial maps (basemaps, scale bars, north arrows, legends)	MK	Figures 5–7
NAAQS summary heatmap (site × year, color = % of standard)	AM	Figure 8
Weather-driven analysis composite figure	MK	Figure 9
Week 12 dashboard report	AM + MK	Posted

Week 13 — July 28 – August 1 · Publication figures part 2 + Methods finalization

Task	Lead	Deliverable
Model performance comparison figure (bar chart: R², RMSE across model types and pollutants)	AM	Figure 10
SHAP / feature importance composite	MK	Figure 11
Compile Tables 1–4 (study area, summary stats, NAAQS exceedance, model performance)	AM + MK	LaTeX/Word tables
Assemble complete Methods section from weekly drafts	AM	Methods final
Assemble Results section outline with figure/table refs	MK	Results outline
Final QC pass on all figures + tables	AM + MK	Verified set
Final week 13 dashboard report + handoff checklist	AM + MK	Posted

Milestone: August 1, 2026 — Analysis Complete

All figures, tables, and Methods/Results drafts ready for manuscript assembly. Writing phase begins.

Live status tracker¶

Edit this table directly on GitHub (pipeline/docs/16_project_timeline.md) to keep the dashboard current. Push any commit and the docs site rebuilds within 2 minutes.

Phase	Week	Dates	AM	MK	Notes
1	1	May 1–9	⬜	⬜	Hard data freeze week
2	2	May 12–16	⬜	⬜	Imputation eval
2	3	May 19–23	⬜	⬜	Imputation apply
3	4	May 26–30	⬜	⬜	Hypothesis tests (raw + imputed)
3	5	June 2–6	⬜	⬜	Correlations
4	6	June 9–13	⬜	⬜	3-yr NAAQS design values
4	7	June 16–20	⬜	⬜	Event annotation
5	8	June 23–27	⬜	⬜	PCA
5	9	June 30–July 4	⬜	⬜	RF + XGBoost
5	10	July 7–11	⬜	⬜	Kriging + IDW
6	11	July 14–18	⬜	⬜	CV + SHAP
6	12	July 21–25	⬜	⬜	Figures part 1
6	13	July 28–Aug 1	⬜	⬜	Figures part 2 + Methods

Legend: ⬜ Not started · 🟡 In progress · ✅ Complete · ❌ Blocked

Weekly report dashboard (separate repository)¶

Every week, Aidan + Manassa post a structured report to a dedicated results-and-reports repository:

Repo (planned): https://github.com/AidanJMeyers/south-texas-aq-results Site: https://aidanjmeyers.github.io/south-texas-aq-results/ (to be created when Week 1 reports are ready)

Format for each weekly report¶

Each week-NN-MM-DD.md file in that repo follows this structure:

# Week N — <date range> · <phase name>

## What we did
- Bullet list of completed tasks (cross-reference timeline doc 16)

## Key results
- Headline figures, tables, statistics

## Code blocks
- The actual SQL / Python that produced the results
- Links to the Colab notebook(s) that ran them

## Decisions made / assumptions
- Anything that locks future work (e.g. "We chose MICE for imputation
  because X, Y, Z; this means the modeling phase assumes...")

## Open questions / blockers
- For PI review

## Next week preview

Why a separate repo?¶

Keeps pipeline code clean and stable in this repo (publish once)
Reports change rapidly week-to-week — separate change history
Reports site can be public without exposing the pipeline's evolution; pipeline site can stay focused on protocol documentation
Both rebuild via the same MkDocs + GitHub Pages flow
Site members navigate between them via cross-links

Future: GIS dashboard¶

Once weekly reports are flowing (after ~Week 4), Aidan will spin up a third repo for an interactive GIS dashboard — kriged surface maps, site markers with hover-to-query pollutant time series, exceedance overlays, etc. Likely tech: Leaflet + a small Python/Flask backend pulling from the Neon authenticated Data API.

Delegation philosophy¶

The split above alternates tasks so that:

Both AM and MK touch every analytical domain — no single points of failure. If one is unavailable, the other has context.
AM leans toward infrastructure-heavy tasks (data refresh, feature engineering, kriging, figure finalization, ML training) given his role as lead pipeline developer.
MK leans toward statistical/interpretive tasks (outlier reports, significance testing, model comparison, narrative drafting, event annotation) to build manuscript-ready outputs directly.
Modeling work is explicitly shared so both contribute to the ML story.

If the balance needs adjusting (e.g. coursework obligations in June), swap tasks within a week — deliverables stay the same, only the lead changes.

Phase dependency diagram¶

%%{init: {'theme':'base','themeVariables':{'fontFamily':'Arial'}}}%%
flowchart LR
    classDef phase fill:#FFFFFF,stroke:#213c4e,stroke-width:2px,color:#213c4e,font-weight:600
    classDef hard  fill:#FDEBD3,stroke:#c2410c,stroke-width:2.5px,color:#7c2d0b,font-weight:700
    classDef done  fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1b5e20,font-weight:600

    P0["Pipeline engineering<br/>(complete)"]
    P1["Week 1<br/>Refresh + Descriptives"]
    P2["Weeks 2–3<br/>Imputation"]
    P3["Weeks 4–5<br/>Tests + Correlations"]
    P4["Weeks 6–7<br/>NAAQS + Events"]
    P5["Weeks 8–10<br/>PCA + ML + Kriging"]
    P6["Weeks 11–13<br/>Validation + Figures"]
    DONE["Aug 1<br/>Manuscript-ready"]

    P0 --> P1 --> P2 --> P3 --> P4 --> P5 --> P6 --> DONE

    class P0 done
    class P1 hard
    class P2,P3,P4,P5,P6 phase
    class DONE done

Each phase depends on the previous one's outputs. If a phase runs ahead, start pulling tasks from the next. If a phase falls behind, flag it in the status tracker and discuss rebalancing with the PI.