Federation Scraper - Automated Match Data for KnowMore
Published on Mar 1, 2026
KnowMore imports match schedules from the Swiss Football Federation (FVRZ). In the early versions, this was a simple CSV fetch inside PocketBase. That worked for schedules, but we wanted more — live results, league standings, match lineups, player statistics, and tournament data. That meant building a dedicated scraping service.
This post covers the architecture of the Federation Scraper: how it ingests data from the FVRZ matchcenter, how it connects to KnowMore via webhooks and SSE, and the design decisions behind it.
Table of Contents
- Why a Separate Service
- The Stack
- Two Ingestion Paths
- Data Model
- The Sync Pipeline
- Real-Time: Postgres LISTEN/NOTIFY to SSE
- The API
- Match Detail: The Telegramm Parser
- Club Discovery
- Deployment
- What’s Next
Why a Separate Service
The matchcenter at matchcenter.fvrz.ch is a classic ASP.NET application. Some data is available as CSV exports (schedules), but most of the interesting content — standings tables, match details with lineups, goal events — is rendered server-side in HTML that requires JavaScript execution.
PocketBase (KnowMore’s backend) is a Go binary. Embedding a headless browser inside it would be fragile and resource-heavy. Instead, the scraper runs as a standalone FastAPI service alongside KnowMore, pushing updates via webhooks.
The scraper owns the federation data — clubs, teams, matches, standings, player stats. KnowMore receives match updates via a webhook endpoint and can also query the scraper’s REST API directly for standings, results, and match details.
The Stack
| Layer | Technology | Why |
|---|---|---|
| API | FastAPI (async) | Native async, Pydantic validation, auto-generated OpenAPI docs |
| Database | PostgreSQL 17 | Relational data with LISTEN/NOTIFY for real-time events |
| Browser | Playwright (Firefox) | JavaScript rendering for ASP.NET matchcenter pages |
| Scheduling | APScheduler | Cron-based sync with per-team refresh intervals |
| Logging | Structured JSON + Grafana Loki | Centralized observability across services |
Everything runs in Docker. Four containers: the scraper app, PostgreSQL, a pg-backup sidecar (daily pg_dump at 22:00 Swiss time with 7-day retention), and pgAdmin for database inspection. An earlier version used Valkey (Redis fork) for response caching, but PostgreSQL with proper indexes handles all queries in low single-digit milliseconds — the extra caching layer added complexity without meaningful benefit.
Two Ingestion Paths
Not all data requires a headless browser. The matchcenter exposes some data as CSV exports — fast and reliable. Other data is only available as rendered HTML. The scraper uses both paths depending on what it needs.
Fast Path: CSV
The matchcenter has an undocumented SOAP-like endpoint that returns semicolon-delimited CSV with match schedules:
GET /portaldata/1/nisrd/WebService/verein/calendar.asmx/Verein?v={club_id}&format=csv
This returns all matches for a club in a single HTTP request — teams, dates, times, venues, scores, match types. The CSV fetcher parses each row, classifies the match type (league, friendly, cup, tournament), and separates regular matches from tournament entries.
Tournament rows get special treatment. The description field contains structured text like "Organisator: FC BĂĽsingen, Teams: FC A, FC B, FC C" which the parser extracts into organizer and participant lists.
The CSV path handles ~90% of schedule data and runs in under a second per club.
Slow Path: Playwright
For standings, match results with detailed scores, and match reports (the “Telegramm” page), the scraper launches Firefox via Playwright. The matchcenter renders these pages with JavaScript, so httpx can’t reach the data.
_browser = await _playwright.firefox.launch(headless=False, args=["--no-remote"])Firefox runs with headless=False through an Xvfb virtual framebuffer. This is intentional — real Firefox rendering (not headless mode) is harder for anti-bot systems to detect. The browser also injects an anti-detection script that removes the navigator.webdriver property.
A semaphore limits concurrency to one page at a time. Each request gets a fresh browser context with Swiss locale and timezone, then a 2-second cooldown between requests. This is slow but respectful — the matchcenter is a community resource, not a target.
Data Model
The scraper maintains a normalized relational model:
Matches are the central entity, linked to clubs and teams on both sides via junction tables (match_clubs, match_teams). This normalized structure replaced earlier denormalized columns — team data is now resolved at query time via JOINs.
Key design decisions:
match_numberas natural key: The federation assigns a unique Spielnummer to every match. This is the upsert key, making re-syncs idempotent.- Two-sided junctions: Both
match_clubsandmatch_teamscarry asidecolumn ('home'or'away'), using native PostgreSQL ENUM types for type safety. - Team categories: Each team carries a
categoryfield parsed from its name during club discovery. Junior teams are classified into age groups (A through G, with pitch-size variants like D-7, D-9, E/FF-11), while adult teams are bucketed into Aktive, Senioren, or Frauen. This enables frontend filtering — show only E-category teams, or only senior squads. - Computed player stats:
player_season_statsare recomputed from match lineups and events after each match detail sync. Minutes played are calculated from entry/exit events — starters enter at minute 0, substitutes at their sub-in minute, everyone exits at 90 or at their sub-out/red-card minute.
The Sync Pipeline
The scheduler runs every 60 minutes and processes subscribed clubs in two phases.
Phase 1: Club Schedule
For each club with due teams, fetch the club-level CSV once. This gives us matches for all teams in a single request, with both-side club numbers for opponent resolution.
Phase 2: Per-Team Enrichment
For each due team within the club:
Standings — Playwright scrapes the league table. As a side effect, this builds an
opponent_lookupdictionary mapping display names to club IDs — useful for resolving opponents in the next step.Team schedule — CSV fetch with opponent resolution. The lookup from standings helps link opponent clubs that might have abbreviated or differently formatted names.
Results — Playwright scrapes the “Partien” page for match scores and external match IDs. These IDs enable the match detail endpoint to fetch lineups and events later.
Opponent resolution is surprisingly tricky. The CSV might say "FC Töss 1", the standings page says "FC Töss", and the database has "FC Töss Winterthur". The resolver tries multiple strategies: exact match on external ID, standings-derived lookup, regex name stripping (removing suffixes like " b", " 1", " U13 a"), and finally prefix matching.
Refresh Intervals
Not all teams need frequent updates. Youth teams (matching Junior(en|innen) [GFED]) sync every 24 hours — they play weekly and parents want current data. Adult teams sync every 168 hours (weekly) since their schedules change less often.
Real-Time: Postgres LISTEN/NOTIFY to SSE
When the scraper writes new match data, consumers need to know immediately. Polling is wasteful. Instead, PostgreSQL does the heavy lifting.
A trigger on the matches table fires on every INSERT or UPDATE:
CREATE FUNCTION notify_match_change() RETURNS trigger AS $$
BEGIN
PERFORM pg_notify('data_changed', jsonb_build_object(
'table', TG_TABLE_NAME,
'op', TG_OP,
'id', NEW.id
)::text);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
The payload is deliberately small — just the table name, operation, and row ID. PostgreSQL NOTIFY has an 8KB limit, and we don’t want to serialize entire match objects in the trigger. Instead, the scraper’s listener enriches the event server-side, loading the full match data with team info and venue details to produce a payload identical to the schedule API response.
Two delivery mechanisms:
- SSE endpoint (
GET /api/events/stream) — browsers and KnowMore’s frontend connect here for real-time updates. No auth required, no polling. - Webhook — the scraper POSTs the enriched event to KnowMore’s PocketBase instance, which can update its own match records and notify connected clients through PocketBase’s real-time subscriptions.
The API
The scraper exposes a REST API for on-demand queries. All public endpoints read from PostgreSQL — indexed queries return in low single-digit milliseconds. Data is populated by the background sync pipeline (CSV ingestion + Playwright enrichment), not by on-demand scraping. This keeps endpoint response times consistently fast and avoids exposing Playwright’s latency to API consumers.
Key endpoints:
| Endpoint | Source | What it returns |
|---|---|---|
GET /api/schedule/team/{club}/{team} | CSV/DB | All matches (past + future) with team info, venue, scores |
GET /api/results/team/{club}/{team} | Playwright/DB | Played matches with scores and match types |
GET /api/results/match/{match_id} | Playwright/DB | Full match report: lineups, events, goals, halftime score |
GET /api/standings/team/{club}/{team} | Playwright/DB | League table with W/D/L, goals, points |
GET /api/stats/team/{club}/{team} | Computed | Player statistics: appearances, goals, cards, minutes |
GET /api/team/summary/{club}/{team} | DB | Dashboard tile: last result, next match, next tournament, current rank |
GET /api/tournaments/{club} | CSV/DB | Tournament list with participants and organizers |
Public endpoints also include a club catalog (GET /api/clubs?search=... and GET /api/clubs/{id}) for browsing the full federation directory — no auth required. The club detail response includes all teams with their age-group category, so the frontend can build filtered team pickers (e.g. only junior E teams). This lets KnowMore’s admin setup flow import clubs and teams directly from the scraper without needing the API key.
Admin endpoints (API key required) manage subscriptions, trigger manual syncs, and handle club discovery.
There’s also a GET /api/llm-docs endpoint that serves a markdown API reference optimized for LLM consumption — useful when KnowMore’s AI assistant needs to query match data.
Match Detail: The Telegramm Parser
The most complex scraping target is the match detail page (“Telegramm”). It contains:
- Competition name and match number
- Date, time, and venue
- Final score and halftime score
- Goal summary (minute, player, side)
- Full timeline of events (goals, substitutions, yellow/red cards, with running score)
- Complete lineups for both teams (starting XI, substitutes, coach, absent players)
- Player IDs, shirt numbers, positions, captain designation
The parser extracts all of this from the rendered HTML. Event types are identified by data-eid attributes on timeline elements (1=goal, 2=substitution, 3=yellow card, 4=red card, 5=yellow-red). Player names are matched against the lineup to link events to player records in the database.
After parsing, the sync layer persists everything and recomputes aggregated player statistics — updating appearances, goals, cards, and minutes played across the season.
Club Discovery
Before the scraper can sync match data, it needs to know which clubs and teams exist. The discovery pipeline:
- Scrape the federation’s club listing page — extract all club IDs and names
- For each club, scrape the matchcenter — extract team IDs from navigation links and dropdowns
- Cache to JSON — save as
data/federations/FVRZ.jsonfor fast re-seeding - Upsert to database — create or update Club and Team records
This is a one-time operation (10-30 minutes for ~200 clubs), heavily rate-limited to avoid hitting the matchcenter too hard. After discovery, the seed data is persisted as JSON so new deployments can bootstrap without re-scraping.
Club detail scraping (address, website, club colors, founding date) now runs automatically on the first sync after subscribing a club. The scheduler checks whether a club’s details have been populated — if not, it scrapes the matchcenter club page and fills in the fields. For clubs without a street address (only a postal code line on the page), the address falls back to the zip code and city. Subsequent syncs skip the detail scrape since the data is already present. A manual admin endpoint is available to force-refresh details when needed.
The bulk discovery scrape (for all ~200 clubs in a federation) is still rate-limited: 3-8 second delays between clubs, 30-60 second pauses every 10 clubs.
Deployment
The scraper runs as three Docker containers orchestrated by Docker Compose:
services:
scraper: # FastAPI + Playwright Firefox + Xvfb
ports: ["8095:8000"]
depends_on: [postgres]
postgres: # PostgreSQL 17
ports: ["5432:5432"]
pgadmin: # Database admin UI
ports: ["5050:80"]
The Dockerfile installs Xvfb for the virtual display, then Playwright’s Firefox with system dependencies. At runtime, an entrypoint script starts Xvfb on display :99, runs Alembic database migrations, then launches uvicorn with the FastAPI app.
For production deployment on Unraid, the image is built locally and transferred via docker save | scp | docker load — no Docker Hub needed. The production compose file maps different ports (8097 for the API, 5051 for pgAdmin) and uses NVMe cache paths for volumes.
KnowMore connects to the scraper via the webhook URL (NOTIFY_WEBHOOK_URL). Both services run on the same Docker host, communicating over a shared Docker network.
The PostgreSQL container includes automated daily backups. A custom Dockerfile extends postgres:17-alpine with a cron job that runs pg_dump at 22:00 Swiss time, producing compressed custom-format dumps with 7-day retention. The backup script and cron run inside the same container — an entrypoint wrapper starts crond in the background before delegating to the stock postgres entrypoint. Backup files are written to a mounted volume, so they survive container restarts and are accessible from the host for off-site copies.
Structured JSON logs in UTC are pushed to Grafana Loki for centralized monitoring. Each API request logs endpoint, source (db/scrape), duration, and relevant IDs — making it straightforward to trace a slow response back to its cause.
What’s Next
The scraper is deployed and running in production, with full frontend integration since v0.5.0. Current priorities:
- Multi-federation support — the data model supports it (Federation -> Club -> Team), but only FVRZ is configured. The schema is designed to scale to all ~13 Swiss federations (~2600 clubs, ~39K teams, ~585K matches/season)
- E-category tournament participant count — some youth tournaments report fewer participants than expected, likely a parser edge case
- Match lineup data — player-level statistics from match details for deeper analytics integration
This is part of the KnowMore series. For the overall architecture, see Building KnowMore - Architecture of a Privacy-First Sports Calendar. For the real-time data system on the PocketBase side, see KnowMore Real-Time Data Loading.