Government Transparency Institute (2025) Public procurement data processing (Version 1.0).
Understanding government spending requires standardised and cross-country comparable data. This technical report accompanies GTI’s Global Public Procurement Dataset (GPPD) publication and explains how public procurement announcements are collected, parsed, cleaned, matched and mastered. The process starts with comprehensive source mapping and the automated scraping of HTML portals, XML feeds, APIs and CSV dumps. Each publication is then parsed into a unified JSON template. The cleaning process then converts text values into structured types, imputes missing fields, normalises NUTS regional codes and harmonises currencies. Subsequent matching algorithms group related bodies (e.g. buyers and bidders) and link publications by tenders. Finally, the mastering step applies variable-specific rules to select the most representative values, handle framework agreements, remove duplicate records and compile the final tender record.
Read the report here