17 — Colab + Neon Database Guide¶
The complete guide to querying the South Texas AQ pipeline data from Google Colab. Covers both direct SQL (the recommended fast path) and the Neon Data API (for lightweight HTTP-based access). Includes worked examples against every table, performance tips, and copy-paste recipes for common analytical workflows.
Database: Neon Postgres 16 · project
south-texas-aq· schemaaqPlan: Launch ($19/mo, 10 GB storage, 300 CU-hours/month included) Region: AWS us-east-1 Last updated: 2026-04-22
What's in the database¶
After the pipeline runs, the aq schema contains:
| Table | Rows | Size | Use this for |
|---|---|---|---|
aq.site_registry |
47 | <100 kB | Site metadata, network membership, status, coordinates |
aq.naaqs_design_values |
764 | 200 kB | Per-site, per-year, per-metric NAAQS compliance values |
aq.pollutant_monthly |
~11k | 2 MB | Quick monthly trends, low-resolution dashboards |
aq.pollutant_daily |
~390k | 70 MB | Daily aggregates with completeness flags (most common) |
aq.aq_weather_daily |
~390k | 130 MB | Daily pollutants joined to nearest-station weather |
aq.pollutant_hourly |
~7.7M | ~2.5 GB | Hourly resolution; diurnal cycles, episode analysis |
aq.weather_hourly |
~1.47M | ~900 MB | Hourly weather organized by original station name |
For column-level schemas see 03_data_schemas.md.
Connection method 1: Direct SQL (recommended)¶
This is the fast path — connects to Postgres directly via the binary wire protocol. Use this for any analytical workload (anything pulling more than ~100 rows back).
One-time setup in Colab¶
Step 1 — Get your connection string¶
The connection string lives in your Neon console:
https://console.neon.tech/app/projects/aged-salad-62359207 →
"Connection Details" tab → copy the URL that starts with
postgresql://...
It looks like:
Step 2 — Add it as a Colab secret¶
- In Colab, click the 🔑 key icon in the left sidebar
- Add new secret:
- Name:
AQ_POSTGRES_URL - Value: paste the connection string
- Toggle "Notebook access" ON for the notebook you're using
This stores the credential outside your notebook code so you can share the notebook freely without leaking the password.
Step 3 — Install the client (one-time per Colab runtime)¶
Step 4 — Connect¶
from google.colab import userdata
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine(
userdata.get('AQ_POSTGRES_URL'),
pool_pre_ping=True, # handles Neon auto-pause
)
# Sanity check
print(pd.read_sql("SELECT version()", engine).iloc[0, 0])
That's it. From this point on, every query is pd.read_sql(SQL, engine).
Performance tips¶
-
Always filter at the SQL level, not in pandas:
-
Select only the columns you need.
pollutant_hourlyhas 21 columns; if you only need 4, name them explicitly. -
Use
LIMITwhile exploring to avoid pulling millions of rows by accident. -
Neon auto-pauses after 5 min idle. Your first query after a long pause takes ~500 ms to wake the compute.
pool_pre_ping=Truemakes this transparent. -
For huge result sets (>1M rows), consider
chunksize:
Connection method 2: Neon Data API (HTTP / REST)¶
Useful for lightweight integrations — dashboards, web apps, serverless functions — anywhere you don't want to ship a Postgres driver. Slower than direct SQL for large pulls because everything serializes to JSON. Two access modes: anonymous (public reads) and authenticated (login-gated, with audit trail and higher rate limits).
How the Data API roles work¶
Neon's Data API uses PostgREST under the hood, with these built-in Postgres roles:
| Role | Purpose | Activated by |
|---|---|---|
anonymous |
Public reads, rate-limited | No header / no JWT |
authenticated |
Logged-in reads, audited, higher limits | Valid JWT in Authorization: Bearer <token> |
authenticator |
Login role the API uses to switch into the above | Internal — never used directly |
The pipeline grants SELECT on all aq.* tables to both roles, so
either path works for read queries. The authenticated path additionally
gives you per-user audit logs, higher request quotas, and a clean upgrade
path if you later add Row-Level Security or write-enabled tables.
Live API endpoints for this project¶
| Purpose | URL |
|---|---|
| Data API (PostgREST) | https://ep-muddy-star-ant3mvxo.apirest.c-6.us-east-1.aws.neon.tech/neondb/rest/v1 |
| Auth (login / sign-up / JWT issuer) | https://ep-muddy-star-ant3mvxo.neonauth.c-6.us-east-1.aws.neon.tech/neondb/auth |
| Direct Postgres | (in Neon console → Connection Details — never put this in client-side code) |
Compute endpoint ID: ep-muddy-star-ant3mvxo · Region: AWS us-east-1 · Database: neondb · Schema: aq
Anonymous mode — quick public queries¶
No headers needed. Any HTTP client can hit it:
import requests, pandas as pd
API_URL = "https://ep-muddy-star-ant3mvxo.apirest.c-6.us-east-1.aws.neon.tech/neondb/rest/v1"
# Get all NAAQS exceedances for 2023 — PostgREST query syntax
resp = requests.get(
f"{API_URL}/aq/naaqs_design_values",
params={
"year": "eq.2023",
"exceeds": "is.true",
"select": "aqsid,site_name,county_name,metric,value",
"order": "value.desc",
},
headers={"Accept": "application/json"},
)
df = pd.DataFrame(resp.json())
print(df)
PostgREST query operators you'll use most:
eq.<val>, neq.<val>, gt.<val>, gte.<val>, lt.<val>, lte.<val>,
in.(<a>,<b>,<c>), is.null, is.true, like.*pattern*,
ilike.*pattern* (case-insensitive). Combine with & in the URL.
Authenticated mode — login-gated queries¶
This is the recommended path for any real downstream use (notebooks shared across the lab, dashboards, internal apps). You get audit trails, higher rate limits, and the ability to lock specific tables to authenticated users only later if needed.
Step 1 — Sign up / sign in to Neon Auth¶
This project's auth is configured with two providers:
- Google OAuth (one-click sign-in — recommended for
@tamucc.eduaccounts) - Email + password (no email verification required, instant signup)
The hosted login page is at:
https://ep-muddy-star-ant3mvxo.neonauth.c-6.us-east-1.aws.neon.tech/neondb/auth
Open it, sign in via Google or create an email/password account, and the page returns a session JWT (good for ~24 hours by default).
If you want to drive the auth flow programmatically — e.g. from a Colab notebook — use the same Auth URL with the Better Auth REST endpoints:
import requests
AUTH_URL = "https://ep-muddy-star-ant3mvxo.neonauth.c-6.us-east-1.aws.neon.tech/neondb/auth"
# Email + password sign-in (returns a session containing a JWT)
resp = requests.post(
f"{AUTH_URL}/sign-in/email",
json={"email": "you@tamucc.edu", "password": "your_password"},
)
session = resp.json()
jwt = session["token"] # use this in subsequent API requests
Step 2 — Use the JWT in Colab¶
from google.colab import userdata
import requests, pandas as pd
API_URL = "https://ep-muddy-star-ant3mvxo.apirest.c-6.us-east-1.aws.neon.tech/neondb/rest/v1"
# Store the JWT as a Colab secret named AQ_NEON_JWT, then:
JWT = userdata.get('AQ_NEON_JWT')
headers = {
"Accept": "application/json",
"Authorization": f"Bearer {JWT}",
}
resp = requests.get(
f"{API_URL}/aq/pollutant_daily",
params={
"aqsid": "eq.480290052",
"pollutant_group": "eq.Ozone",
"valid_day": "is.true",
"select": "date_local,mean,max,n_hours",
"order": "date_local.asc",
"limit": "1000",
},
headers=headers,
)
resp.raise_for_status()
df = pd.DataFrame(resp.json())
print(f"Got {len(df)} rows for Camp Bullis ozone")
df.head()
Step 3 — Re-using the JWT across notebooks¶
Once stored as a Colab secret, every notebook in the project that
enables that secret can read it via userdata.get('AQ_NEON_JWT'). No
re-login needed until the token expires (~24 h, configurable in Neon Auth
settings).
When the JWT expires, the API returns 401 Unauthorized — re-run the
sign-in flow above to mint a new one.
Step 4 — Programmatic refresh (advanced, optional)¶
For long-running notebooks or scheduled jobs, build a short helper that hits the Better Auth refresh endpoint and rotates the JWT before it expires. This is overkill for one-off Colab analysis but useful if you're building a daily-refresh dashboard. Patterns at https://www.better-auth.com/docs/concepts/session-management.
Helper: Python wrapper for the authenticated Data API¶
Drop this in any Colab notebook to get a clean query() function:
from google.colab import userdata
import requests
import pandas as pd
class NeonDataAPI:
"""Tiny PostgREST client for the South Texas AQ data API."""
def __init__(self, base_url: str, jwt: str | None = None):
self.base_url = base_url.rstrip("/")
self.headers = {"Accept": "application/json"}
if jwt:
self.headers["Authorization"] = f"Bearer {jwt}"
def query(self, table: str, **filters) -> pd.DataFrame:
"""Run a PostgREST GET against `aq.<table>` with given filters.
Special params: select, order, limit, offset.
Filter syntax: column='eq.value' (or any PostgREST operator).
"""
url = f"{self.base_url}/aq/{table}"
resp = requests.get(url, params=filters, headers=self.headers)
resp.raise_for_status()
return pd.DataFrame(resp.json())
# Usage:
api = NeonDataAPI(
base_url="https://ep-muddy-star-ant3mvxo.apirest.c-6.us-east-1.aws.neon.tech/neondb/rest/v1",
jwt=userdata.get('AQ_NEON_JWT'),
)
# Example: Bexar County ozone exceedances in 2024
exceed = api.query(
"naaqs_design_values",
county_name="eq.Bexar",
metric="eq.ozone_8hr_4th_max",
year="eq.2024",
exceeds="is.true",
select="aqsid,site_name,value",
order="value.desc",
)
print(exceed)
When to choose Data API vs. direct SQL¶
| Use Data API (HTTP) when… | Use direct SQL (psycopg) when… |
|---|---|
| Building a public web dashboard | Doing heavy analysis in Colab |
| Lightweight serverless function (no driver install) | Pulling >10 k rows |
| Need per-user audit trail (auth mode) | Need joins, CTEs, window functions |
| Sharing a query with non-Python collaborators | Reproducing NAAQS calculations |
| Access from a tool that only speaks HTTP | Generating manuscript figures |
For day-to-day analysis → direct SQL. For end-user-facing dashboards or apps → authenticated Data API.
Why authenticated > anonymous (even for read-only data)¶
Three concrete reasons:
- Audit trail. Every authenticated request is tied to a Neon Auth
user ID. You can query
neon_auth.sessionto see which lab member accessed what, when. Anonymous requests are anonymous — you only get IP addresses. - Higher rate limits. Neon caps anonymous requests aggressively (a few hundred per minute per IP). Authenticated users get thousands per minute.
- Forward-compatible. If you later add tables that need write
access (e.g. a
notestable for analyst annotations) or sensitive tables locked behind RLS, the auth flow is already in place — no user-facing migration.
Worked examples¶
Example 1 — Hourly diurnal ozone profile at Camp Bullis (2023)¶
sql = """
SELECT EXTRACT(HOUR FROM datetime)::int AS hour_of_day,
AVG(sample_measurement) AS mean_o3_ppm,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sample_measurement) AS median_o3,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY sample_measurement) AS p95_o3
FROM aq.pollutant_hourly
WHERE aqsid = '480290052'
AND pollutant_group = 'Ozone'
AND year = 2023
AND sample_measurement IS NOT NULL
GROUP BY hour_of_day
ORDER BY hour_of_day
"""
diurnal = pd.read_sql(sql, engine)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(diurnal.hour_of_day, diurnal.mean_o3_ppm, marker='o', label='Mean')
ax.plot(diurnal.hour_of_day, diurnal.median_o3, marker='s', label='Median')
ax.plot(diurnal.hour_of_day, diurnal.p95_o3, marker='^', label='95th %ile')
ax.set_xlabel('Hour of day (local)')
ax.set_ylabel('Ozone (ppm)')
ax.set_title('Diurnal Ozone Profile — Camp Bullis 2023')
ax.legend()
ax.grid(alpha=0.3)
ax.set_xticks(range(0, 24, 3))
plt.tight_layout()
Expected pattern: Concentration rises after sunrise as photochemistry kicks in, peaks mid-afternoon (1–4 PM), declines through the evening as NOx scavenging dominates. Classic diurnal curve.
Example 2 — Weather time series for one station (March 2024)¶
sql = """
SELECT datetime_local::timestamp AS ts,
temp_c, humidity, wind_speed, wind_deg, pressure
FROM aq.weather_hourly
WHERE location = 'SA Northwest'
AND year = 2024
AND month = 3
ORDER BY ts
"""
wx = pd.read_sql(sql, engine, parse_dates=['ts'])
wx.set_index('ts', inplace=True)
import matplotlib.pyplot as plt
fig, axes = plt.subplots(4, 1, figsize=(12, 9), sharex=True)
wx['temp_c'].plot(ax=axes[0], color='#c2410c'); axes[0].set_ylabel('Temp (°C)')
wx['humidity'].plot(ax=axes[1], color='#0c6e8a'); axes[1].set_ylabel('RH (%)')
wx['wind_speed'].plot(ax=axes[2], color='#213c4e'); axes[2].set_ylabel('Wind (m/s)')
wx['pressure'].plot(ax=axes[3], color='#345372'); axes[3].set_ylabel('Pressure (hPa)')
fig.suptitle('SA Northwest weather — March 2024')
plt.tight_layout()
Example 3 — Episode analysis: highest PM₂.₅ days¶
sql = """
WITH daily_max AS (
SELECT date_local, county_name, site_name,
MAX(sample_measurement) AS peak_pm25
FROM aq.pollutant_hourly
WHERE pollutant_group = 'PM2.5'
AND year = 2024
AND sample_measurement IS NOT NULL
GROUP BY date_local, county_name, site_name
)
SELECT date_local, county_name, site_name, ROUND(peak_pm25::numeric, 1) AS peak_pm25
FROM daily_max
ORDER BY peak_pm25 DESC
LIMIT 20
"""
episodes = pd.read_sql(sql, engine)
print(episodes.to_string(index=False))
For each high-PM₂.₅ day, follow up by looking at hourly PM₂.₅ + weather together to see whether it was a wildfire smoke plume, Saharan dust, local stagnation, etc.
Example 4 — Pull hourly pollutant + weather for a single site-day (regression-ready)¶
sql = """
SELECT
p.datetime,
p.sample_measurement AS o3_ppm,
w.temp_c, w.humidity, w.wind_speed, w.wind_u, w.wind_v
FROM aq.pollutant_hourly p
JOIN aq.weather_hourly w
ON w.location = 'SA Northwest'
AND w.datetime_local::timestamp = p.datetime
WHERE p.aqsid = '480290052'
AND p.pollutant_group = 'Ozone'
AND p.date_local BETWEEN '2024-08-01' AND '2024-08-31'
ORDER BY p.datetime
"""
df = pd.read_sql(sql, engine, parse_dates=['datetime'])
# Quick correlation check
print(df.corr(numeric_only=True)['o3_ppm'].round(3))
Note: This joins on (weather_station, datetime) — the weather
station name is hardcoded here for the example. To do this for every
site automatically, join via aq.site_registry to look up each AQ
site's nearest weather station.
Example 5 — Reproduce the NAAQS 8-hr ozone calculation from raw hourly¶
sql = """
WITH ozone AS (
SELECT aqsid, datetime, sample_measurement
FROM aq.pollutant_hourly
WHERE aqsid = '480290052'
AND pollutant_group = 'Ozone'
AND year = 2023
AND sample_measurement IS NOT NULL
),
rolling_8hr AS (
SELECT aqsid, datetime,
AVG(sample_measurement) OVER (
ORDER BY datetime
ROWS BETWEEN 7 PRECEDING AND CURRENT ROW
) AS avg_8hr,
COUNT(sample_measurement) OVER (
ORDER BY datetime
ROWS BETWEEN 7 PRECEDING AND CURRENT ROW
) AS hrs_in_window
FROM ozone
),
daily_max AS (
SELECT DATE(datetime) AS day, MAX(avg_8hr) AS daily_8hr_max
FROM rolling_8hr
WHERE hrs_in_window >= 6 -- 75% completeness rule
GROUP BY DATE(datetime)
)
SELECT day, ROUND(daily_8hr_max::numeric, 4) AS daily_max_8hr_ppm
FROM daily_max
ORDER BY daily_8hr_max DESC
LIMIT 5
"""
top5 = pd.read_sql(sql, engine)
print(top5)
print(f"\n4th-highest = {top5.daily_max_8hr_ppm.iloc[3]} ppm")
Cross-check this against aq.naaqs_design_values —
the value should match what step_03_compute_naaqs.py computed.
Example 6 — Multi-site monthly heatmap¶
sql = """
SELECT site_name, year_month,
monthly_mean::float AS pm25_ugm3
FROM aq.pollutant_monthly
WHERE pollutant_group = 'PM2.5'
AND county_name = 'Bexar'
AND year_month BETWEEN '2022-01' AND '2024-12'
ORDER BY site_name, year_month
"""
df = pd.read_sql(sql, engine)
pivot = df.pivot(index='site_name', columns='year_month', values='pm25_ugm3')
import seaborn as sns, matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(14, 6))
sns.heatmap(pivot, cmap='RdYlBu_r', cbar_kws={'label': 'PM₂.₅ (µg/m³)'}, ax=ax)
ax.set_title('Bexar County PM₂.₅ Monthly Means 2022–2024')
plt.tight_layout()
Example 7 — Summary table for the manuscript¶
sql = """
SELECT
pollutant_group,
COUNT(DISTINCT aqsid) AS n_sites,
COUNT(DISTINCT EXTRACT(YEAR FROM datetime)::int) AS n_years,
COUNT(*) AS n_observations,
ROUND(AVG(sample_measurement)::numeric, 4) AS mean,
ROUND(MIN(sample_measurement)::numeric, 4) AS min,
ROUND(MAX(sample_measurement)::numeric, 4) AS max,
ROUND(STDDEV(sample_measurement)::numeric, 4) AS sd
FROM aq.pollutant_hourly
WHERE sample_measurement IS NOT NULL
GROUP BY pollutant_group
ORDER BY pollutant_group
"""
manuscript_table = pd.read_sql(sql, engine)
manuscript_table.to_csv('table_pollutant_summary.csv', index=False)
print(manuscript_table.to_string(index=False))
This is exactly the kind of table that goes straight into a Methods or Results section. Save once, paste anywhere.
Example 8 — Wind-conditional ozone (pollution rose data)¶
sql = """
SELECT
CASE
WHEN w.wind_deg < 22.5 OR w.wind_deg >= 337.5 THEN 'N'
WHEN w.wind_deg < 67.5 THEN 'NE'
WHEN w.wind_deg < 112.5 THEN 'E'
WHEN w.wind_deg < 157.5 THEN 'SE'
WHEN w.wind_deg < 202.5 THEN 'S'
WHEN w.wind_deg < 247.5 THEN 'SW'
WHEN w.wind_deg < 292.5 THEN 'W'
WHEN w.wind_deg < 337.5 THEN 'NW'
END AS wind_dir,
AVG(p.sample_measurement) AS mean_o3_ppm,
COUNT(*) AS n
FROM aq.pollutant_hourly p
JOIN aq.weather_hourly w
ON w.location = 'SA Northwest'
AND w.datetime_local::timestamp = p.datetime
WHERE p.aqsid = '480290052'
AND p.pollutant_group = 'Ozone'
AND w.wind_speed > 1.0 -- ignore calm-wind hours
GROUP BY wind_dir
ORDER BY mean_o3_ppm DESC
"""
rose = pd.read_sql(sql, engine)
print(rose)
Cost and quota — what you'll actually pay¶
Your project is on the Launch plan ($19/month base):
| Resource | Included | Overage |
|---|---|---|
| Storage | 10 GB | $1.50/GB-month |
| Compute | 300 CU-hours/month | $0.16/CU-hour |
| Data transfer | unlimited | — |
What is a CU?¶
A Compute Unit is roughly "one vCPU + 4 GB RAM running for one hour." Your project autoscales between 0.25 CU and 2 CU depending on demand, and suspends to 0 CU after 5 minutes idle.
Real-world cost examples¶
| Activity | CU-hours used |
|---|---|
| 100 short queries (each finishes in seconds) | <0.5 |
| One 10-minute Colab analysis session | ~0.3 |
| Loading the full hourly tables (one-time pipeline run) | ~2 |
| A team of 4 doing 2 hours of analysis each, daily for a month | ~120 |
Bottom line: Typical research use lands at \(0–5/month in compute** (well within the 300 CU-hour allotment), so you're effectively paying **\)19/month flat for everything.
Monitoring usage¶
Check the Neon console → Billing → Usage tab any time. The hourly breakdown shows exactly when compute is spent.
Troubleshooting¶
psycopg.OperationalError: FATAL: password authentication failed¶
The Colab secret value is wrong. Double-check you copied the full URL
including the password (the long npg_... string after the : in
username:password@host).
First query after a few minutes is slow¶
Expected — Neon auto-pauses after 5 min idle. The first query wakes
the compute (~500 ms), subsequent queries are fast. pool_pre_ping=True
handles this transparently in SQLAlchemy.
SSL connection has been closed unexpectedly¶
Pipeline is hammering the database (large INSERT batches). Wait a
few seconds and retry. If you're running queries from Colab while the
pipeline is loading, you'll see this occasionally.
relation "aq.pollutant_hourly" does not exist¶
The hourly tables haven't been loaded yet. Run the pipeline:
python pipeline/run_pipeline.py --only 07. Until then, only the
five aggregate tables (daily, monthly, NAAQS, combined, registry) exist.
Query is taking >30 seconds¶
Either you're missing a WHERE clause and pulling millions of rows, or
the query needs an index. The hourly tables are indexed on
(aqsid, date_local, pollutant_group, year) for pollutants and
(location, year, date_local) for weather — make sure your WHERE
clauses use one of those columns.
column "..." does not exist¶
Schema reference is in 03_data_schemas.md.
Watch out for: date_local is text, not date; cast with ::date for
date arithmetic. sample_measurement units depend on pollutant_group
(see schema doc).
See also¶
- 03_data_schemas.md — Complete column-by-column schemas
- 10_usage_sql.md — Original SQL guide (daily aggregates only)
- 09_usage_colab.md — Generic Colab quickstart
- 15_recipes.md — Worked recipes for common research tasks
- 16_project_timeline.md — Week-by-week analysis plan