- Project
- PoliLoom: Structuring politicians' data for investigators and the accountability sector.
- Client
- OpenSanctions
- Role
- Data Developer (2025–present)
The problem
Assemble and verify structured politician data from Wikipedia/Wikidata and the wider web, across languages, ensuring provenance, correctness, and scale.
Solution highlights
- Two-stage extraction pipeline: LLM extracts free-text positions → vector search maps to exact Wikidata entities → LLM reconciles.
- Fast similarity search: Embeddings with SentenceTransformers; pgvector in Postgres.
- Source verification: FastAPI API and Next.js confirmation GUI for human verification.
- Parallel dump processing: near-linear speedup to 32+ cores; 1.8TB dump processed in passes.
Impact
- Trust: Clear citations from archived pages in GUI for verification.
- Scale: Parallelized, test-backed pipeline; batched database operations.
- Clarity: From unstructured source documents to structured, linkable positions.