Impact data pipeline spec template
Impact data pipeline spec template
This template documents how /impact/ dashboard data is generated and maintained.
Purpose
- Keep dashboard outputs reproducible and deploy-safe.
- Define upstream data dependencies and generated runtime artifacts.
- Define minimum validation checks before publish.
Runtime consumer
- Page shell:
_pages/impact.md - Include shell:
_includes/impact-dashboard.html - Client logic:
assets/js/impact-dashboard.js - Client styles:
assets/css/impact-dashboard.css - Runtime datasets loaded by the client:
data/impact/impact_dashboard.jsondata/impact/impact_reconciliation.json
- Client fetch/cache expectations:
- JSON fetch uses default browser caching behavior in
assets/js/impact-dashboard.js. - Include URLs for dashboard JSON/CSS/JS should use a build-time version query (for example
?v=20260220204831) to invalidate stale caches after deploy.
- JSON fetch uses default browser caching behavior in
Build-time generator
- Script:
scripts/build-impact-dashboard-data.py - Default invocation:
python3 scripts/build-impact-dashboard-data.py --repo-root "$ROOT_DIR" --out-dir "$ROOT_DIR/data/impact"
- Reach refresh script (API-backed; run on controlled schedule/manual trigger):
scripts/build-impact-reach-data.pypython3 scripts/build-impact-reach-data.py --repo-root "$ROOT_DIR" --out-dir "$ROOT_DIR/data/impact/reach"
Upstream inputs
_publications/*.md(canonical publication registry + metadata)_data/scholar_metrics.json(citation summary + cites-per-year series)_data/map_data.json(WOS citation geography points)data/altmetric/raw/*.csv(Altmetric mention exports)
Generated outputs
data/impact/impact_dashboard.jsondata/impact/impact_reconciliation.jsondata/impact/exports/*.jsondata/impact/exports/*.csvdata/impact/reach/outlet_reach.jsondata/impact/reach/outlet_reach.csvdata/impact/reach/reach_metadata.jsondata/impact/reach/time_adjusted_mentions_reach.jsondata/impact/reach/time_adjusted_mentions_reach.csvdata/impact/reach/time_adjusted_outlet_reach.jsondata/impact/reach/time_adjusted_outlet_reach.csvdata/impact/reach/tranco_snapshots_used.jsondata/impact/reach/tranco_snapshots_used.csv
Dataset links in impact_dashboard.json should remain site-absolute (/data/impact/exports/...) so downloads resolve in local preview and on deployed Pages.
Data contract: dashboard payload
- File:
data/impact/impact_dashboard.json - Required top-level keys:
generated_at_utcdescriptionmetricsreconciliation_countscitation_seriesdonut_seriescitation_geographyaltmetric_geographycanonical_publicationsmentionsoutletsstoriesderived_insightsdataset_catalog
Minimal JSON shape
{
"generated_at_utc": "2026-01-01T00:00:00Z",
"description": "Build metadata for this dataset.",
"metrics": {},
"reconciliation_counts": {},
"citation_series": {},
"donut_series": {},
"citation_geography": {},
"altmetric_geography": {},
"canonical_publications": [],
"mentions": [],
"outlets": [],
"stories": [],
"derived_insights": {},
"dataset_catalog": []
}
Data contract: reconciliation payload
- File:
data/impact/impact_reconciliation.json - Required top-level keys:
generated_at_utcsummaryscholar_unmatchedscholar_ignored_non_researchaltmetric_unmatchedaltmetric_alias_matchedtracked_doi_counttracked_dois
Minimal JSON shape
{
"generated_at_utc": "2026-01-01T00:00:00Z",
"summary": {},
"scholar_unmatched": [],
"scholar_ignored_non_research": [],
"altmetric_unmatched": [],
"altmetric_alias_matched": [],
"tracked_doi_count": 0,
"tracked_dois": []
}
Update workflows
Deploy-time generation
- Workflow file:
.github/workflows/deploy_site.yml - Required behavior: run
scripts/build-impact-dashboard-data.pybeforejekyll build. - Required behavior: verify committed reach datasets exist before
jekyll build.
Local preview generation
- Script:
scripts/local_preview.command - Default behavior: skips impact data regeneration for faster UI/content-only iteration.
- Data-refresh behavior: run with
--with-datato executescripts/build-impact-dashboard-data.pybefore localjekyll build. - Required behavior: use committed reach datasets (do not run API-backed reach refresh).
Reach refresh generation
- Workflow file:
.github/workflows/refresh_impact_reach_data.yml - Required behavior: run
scripts/build-impact-reach-data.pyon schedule and manual trigger. - Required behavior: apply conservative lookup controls (
--historical-window-days,--historical-max-date-lookups,--historical-date-api-delay-ms). - Required behavior: commit only
data/impact/reach/*outputs when they changed.
Upstream source refreshes
- Scholar refresh workflow:
.github/workflows/fetch_scholar_data.ymlupdates_data/scholar_metrics.json. - Citation geography refresh:
citation_map_parser.Rparses_data/map.txtto_data/map_data.json. - Altmetric refresh: add/update CSV exports under
data/altmetric/raw/.
Operational checklist
- Builder exits successfully.
data/impact/impact_dashboard.jsonanddata/impact/impact_reconciliation.jsonparse as valid JSON.dataset_catalogpaths correspond to files present underdata/impact/exports/.data/impact/reach/reach_metadata.jsonreports non-zero ranked domains when network or cache is available./impact/loads charts/tables without console errors.- Deploy and local preview paths avoid live API refresh for reach data.
- Reach refresh workflow updates committed reach outputs on controlled cadence.
Runbook notes
- If upstream source formats change, update parser/normalization logic in
scripts/build-impact-dashboard-data.pybefore regenerating outputs. - If scholar/map refresh jobs fail, preserve prior valid upstream input files and rerun generation once inputs are healthy.