Federation Scraper - Automated Match Data for KnowMore

KnowMore imports match schedules from the Swiss Football Federation (FVRZ). In the early versions, this was a simple CSV fetch inside PocketBase. That worked for schedules, but we wanted more — live results, league standings, match lineups, player statistics, and tournament data. That meant building a dedicated scraping service.

This post covers the architecture of the Federation Scraper: how it ingests data from the FVRZ matchcenter, how it connects to KnowMore via webhooks and SSE, and the design decisions behind it.

Why a Separate Service
The Stack
Two Ingestion Paths
- Fast Path: CSV
- Slow Path: Playwright
Data Model
The Sync Pipeline
Real-Time: Postgres LISTEN/NOTIFY to SSE
The API
Match Detail: The Telegramm Parser
Club Discovery
Deployment
What’s Next

Why a Separate Service

The matchcenter at matchcenter.fvrz.ch is a classic ASP.NET application. Some data is available as CSV exports (schedules), but most of the interesting content — standings tables, match details with lineups, goal events — is rendered server-side in HTML that requires JavaScript execution.

PocketBase (KnowMore’s backend) is a Go binary. Embedding a headless browser inside it would be fragile and resource-heavy. Instead, the scraper runs as a standalone FastAPI service alongside KnowMore, pushing updates via webhooks.

The scraper owns the federation data — clubs, teams, matches, standings, player stats. KnowMore receives match updates via a webhook endpoint and can also query the scraper’s REST API directly for standings, results, and match details.

The Stack

Layer	Technology	Why
API	FastAPI (async)	Native async, Pydantic validation, auto-generated OpenAPI docs
Database	PostgreSQL 17	Relational data with LISTEN/NOTIFY for real-time events
Browser	Playwright (Firefox)	JavaScript rendering for ASP.NET matchcenter pages
Scheduling	APScheduler	Cron-based sync with per-team refresh intervals
Logging	Structured JSON + Grafana Loki	Centralized observability across services

Everything runs in Docker. Four containers: the scraper app, PostgreSQL, a pg-backup sidecar (daily pg_dump at 22:00 Swiss time with 7-day retention), and pgAdmin for database inspection. An earlier version used Valkey (Redis fork) for response caching, but PostgreSQL with proper indexes handles all queries in low single-digit milliseconds — the extra caching layer added complexity without meaningful benefit.

Two Ingestion Paths

Not all data requires a headless browser. The matchcenter exposes some data as CSV exports — fast and reliable. Other data is only available as rendered HTML. The scraper uses both paths depending on what it needs.

Fast Path: CSV

The matchcenter has an undocumented SOAP-like endpoint that returns semicolon-delimited CSV with match schedules:

GET /portaldata/1/nisrd/WebService/verein/calendar.asmx/Verein?v={club_id}&format=csv

This returns all matches for a club in a single HTTP request — teams, dates, times, venues, scores, match types. The CSV fetcher parses each row, classifies the match type (league, friendly, cup, tournament), and separates regular matches from tournament entries.

Tournament rows get special treatment. The description field contains structured text like "Organisator: FC Büsingen, Teams: FC A, FC B, FC C" which the parser extracts into organizer and participant lists.

The CSV path handles ~90% of schedule data and runs in under a second per club.

Slow Path: Playwright

For standings, match results with detailed scores, and match reports (the “Telegramm” page), the scraper launches Firefox via Playwright. The matchcenter renders these pages with JavaScript, so httpx can’t reach the data.

_browser = await _playwright.firefox.launch(headless=False, args=["--no-remote"])

Firefox runs with headless=False through an Xvfb virtual framebuffer. This is intentional — real Firefox rendering (not headless mode) is harder for anti-bot systems to detect. The browser also injects an anti-detection script that removes the navigator.webdriver property.

A semaphore limits concurrency to one page at a time. Each request gets a fresh browser context with Swiss locale and timezone, then a 2-second cooldown between requests. This is slow but respectful — the matchcenter is a community resource, not a target.

Data Model

The scraper maintains a normalized relational model:

Matches are the central entity, linked to clubs and teams on both sides via junction tables (match_clubs, match_teams). This normalized structure replaced earlier denormalized columns — team data is now resolved at query time via JOINs.

Key design decisions:

match_number as natural key: The federation assigns a unique Spielnummer to every match. This is the upsert key, making re-syncs idempotent.
Two-sided junctions: Both match_clubs and match_teams carry a side column ('home' or 'away'), using native PostgreSQL ENUM types for type safety.
Team categories: Each team carries a category field parsed from its name during club discovery. Junior teams are classified into age groups (A through G, with pitch-size variants like D-7, D-9, E/FF-11), while adult teams are bucketed into Aktive, Senioren, or Frauen. This enables frontend filtering — show only E-category teams, or only senior squads.
Computed player stats: player_season_stats are recomputed from match lineups and events after each match detail sync. Minutes played are calculated from entry/exit events — starters enter at minute 0, substitutes at their sub-in minute, everyone exits at 90 or at their sub-out/red-card minute.

The Sync Pipeline

The scheduler runs every 60 minutes and processes subscribed clubs in two phases.

Phase 1: Club Schedule

For each club with due teams, fetch the club-level CSV once. This gives us matches for all teams in a single request, with both-side club numbers for opponent resolution.

Phase 2: Per-Team Enrichment

For each due team within the club:

Standings — Playwright scrapes the league table. As a side effect, this builds an opponent_lookup dictionary mapping display names to club IDs — useful for resolving opponents in the next step.
Team schedule — CSV fetch with opponent resolution. The lookup from standings helps link opponent clubs that might have abbreviated or differently formatted names.
Results — Playwright scrapes the “Partien” page for match scores and external match IDs. These IDs enable the match detail endpoint to fetch lineups and events later.

Opponent resolution is surprisingly tricky. The CSV might say "FC Töss 1", the standings page says "FC Töss", and the database has "FC Töss Winterthur". The resolver tries multiple strategies: exact match on external ID, standings-derived lookup, regex name stripping (removing suffixes like " b", " 1", " U13 a"), and finally prefix matching.

Refresh Intervals

Not all teams need frequent updates. Youth teams (matching Junior(en|innen) [GFED]) sync every 24 hours — they play weekly and parents want current data. Adult teams sync every 168 hours (weekly) since their schedules change less often.

Real-Time: Postgres LISTEN/NOTIFY to SSE

When the scraper writes new match data, consumers need to know immediately. Polling is wasteful. Instead, PostgreSQL does the heavy lifting.

A trigger on the matches table fires on every INSERT or UPDATE:

CREATE FUNCTION notify_match_change() RETURNS trigger AS $$
BEGIN
    PERFORM pg_notify('data_changed', jsonb_build_object(
        'table', TG_TABLE_NAME,
        'op',    TG_OP,
        'id',    NEW.id
    )::text);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

The payload is deliberately small — just the table name, operation, and row ID. PostgreSQL NOTIFY has an 8KB limit, and we don’t want to serialize entire match objects in the trigger. Instead, the scraper’s listener enriches the event server-side, loading the full match data with team info and venue details to produce a payload identical to the schedule API response.

Two delivery mechanisms:

SSE endpoint (GET /api/events/stream) — browsers and KnowMore’s frontend connect here for real-time updates. No auth required, no polling.
Webhook — the scraper POSTs the enriched event to KnowMore’s PocketBase instance, which can update its own match records and notify connected clients through PocketBase’s real-time subscriptions.

The API

The scraper exposes a REST API for on-demand queries. All public endpoints read from PostgreSQL — indexed queries return in low single-digit milliseconds. Data is populated by the background sync pipeline (CSV ingestion + Playwright enrichment), not by on-demand scraping. This keeps endpoint response times consistently fast and avoids exposing Playwright’s latency to API consumers.

Key endpoints:

Endpoint	Source	What it returns
`GET /api/schedule/team/{club}/{team}`	CSV/DB	All matches (past + future) with team info, venue, scores
`GET /api/results/team/{club}/{team}`	Playwright/DB	Played matches with scores and match types
`GET /api/results/match/{match_id}`	Playwright/DB	Full match report: lineups, events, goals, halftime score
`GET /api/standings/team/{club}/{team}`	Playwright/DB	League table with W/D/L, goals, points
`GET /api/stats/team/{club}/{team}`	Computed	Player statistics: appearances, goals, cards, minutes
`GET /api/team/summary/{club}/{team}`	DB	Dashboard tile: last result, next match, next tournament, current rank
`GET /api/tournaments/{club}`	CSV/DB	Tournament list with participants and organizers

Public endpoints also include a club catalog (GET /api/clubs?search=... and GET /api/clubs/{id}) for browsing the full federation directory — no auth required. The club detail response includes all teams with their age-group category, so the frontend can build filtered team pickers (e.g. only junior E teams). This lets KnowMore’s admin setup flow import clubs and teams directly from the scraper without needing the API key.

Admin endpoints (API key required) manage subscriptions, trigger manual syncs, and handle club discovery.

There’s also a GET /api/llm-docs endpoint that serves a markdown API reference optimized for LLM consumption — useful when KnowMore’s AI assistant needs to query match data.

Match Detail: The Telegramm Parser

The most complex scraping target is the match detail page (“Telegramm”). It contains:

Competition name and match number
Date, time, and venue
Final score and halftime score
Goal summary (minute, player, side)
Full timeline of events (goals, substitutions, yellow/red cards, with running score)
Complete lineups for both teams (starting XI, substitutes, coach, absent players)
Player IDs, shirt numbers, positions, captain designation

The parser extracts all of this from the rendered HTML. Event types are identified by data-eid attributes on timeline elements (1=goal, 2=substitution, 3=yellow card, 4=red card, 5=yellow-red). Player names are matched against the lineup to link events to player records in the database.

After parsing, the sync layer persists everything and recomputes aggregated player statistics — updating appearances, goals, cards, and minutes played across the season.

Club Discovery

Before the scraper can sync match data, it needs to know which clubs and teams exist. The discovery pipeline:

Scrape the federation’s club listing page — extract all club IDs and names
For each club, scrape the matchcenter — extract team IDs from navigation links and dropdowns
Cache to JSON — save as data/federations/FVRZ.json for fast re-seeding
Upsert to database — create or update Club and Team records

This is a one-time operation (10-30 minutes for ~200 clubs), heavily rate-limited to avoid hitting the matchcenter too hard. After discovery, the seed data is persisted as JSON so new deployments can bootstrap without re-scraping.

Club detail scraping (address, website, club colors, founding date) now runs automatically on the first sync after subscribing a club. The scheduler checks whether a club’s details have been populated — if not, it scrapes the matchcenter club page and fills in the fields. For clubs without a street address (only a postal code line on the page), the address falls back to the zip code and city. Subsequent syncs skip the detail scrape since the data is already present. A manual admin endpoint is available to force-refresh details when needed.

The bulk discovery scrape (for all ~200 clubs in a federation) is still rate-limited: 3-8 second delays between clubs, 30-60 second pauses every 10 clubs.

Deployment

The scraper runs as three Docker containers orchestrated by Docker Compose:

services:
  scraper:    # FastAPI + Playwright Firefox + Xvfb
    ports: ["8095:8000"]
    depends_on: [postgres]
  postgres:   # PostgreSQL 17
    ports: ["5432:5432"]
  pgadmin:    # Database admin UI
    ports: ["5050:80"]

The Dockerfile installs Xvfb for the virtual display, then Playwright’s Firefox with system dependencies. At runtime, an entrypoint script starts Xvfb on display :99, runs Alembic database migrations, then launches uvicorn with the FastAPI app.

For production deployment on Unraid, the image is built locally and transferred via docker save | scp | docker load — no Docker Hub needed. The production compose file maps different ports (8097 for the API, 5051 for pgAdmin) and uses NVMe cache paths for volumes.

KnowMore connects to the scraper via the webhook URL (NOTIFY_WEBHOOK_URL). Both services run on the same Docker host, communicating over a shared Docker network.

The PostgreSQL container includes automated daily backups. A custom Dockerfile extends postgres:17-alpine with a cron job that runs pg_dump at 22:00 Swiss time, producing compressed custom-format dumps with 7-day retention. The backup script and cron run inside the same container — an entrypoint wrapper starts crond in the background before delegating to the stock postgres entrypoint. Backup files are written to a mounted volume, so they survive container restarts and are accessible from the host for off-site copies.

Structured JSON logs in UTC are pushed to Grafana Loki for centralized monitoring. Each API request logs endpoint, source (db/scrape), duration, and relevant IDs — making it straightforward to trace a slow response back to its cause.

What’s Next

The scraper is deployed and running in production, with full frontend integration since v0.5.0. Current priorities:

Multi-federation support — the data model supports it (Federation -> Club -> Team), but only FVRZ is configured. The schema is designed to scale to all ~13 Swiss federations (~2600 clubs, ~39K teams, ~585K matches/season)
E-category tournament participant count — some youth tournaments report fewer participants than expected, likely a parser edge case
Match lineup data — player-level statistics from match details for deeper analytics integration

This is part of the KnowMore series. For the overall architecture, see Building KnowMore - Architecture of a Privacy-First Sports Calendar. For the real-time data system on the PocketBase side, see KnowMore Real-Time Data Loading.