How does HouseData calculate flood risk scores?

HouseData uses Environment Agency flood zone data, historical flood event records, and proximity to watercourses. A machine learning model (XGBoost) combines these features with property-level characteristics to produce a normalised flood risk score from 0 (minimal risk) to 100 (severe risk).

What is a Consent Gap Score in conveyancing?

A Consent Gap Score identifies properties where physical changes (detected via EPC time-series data — e.g. new extensions, loft conversions, changed heating systems) are not matched by corresponding planning permissions or building control sign-offs. This flags potential undisclosed alterations that could affect a property transaction.

What machine learning models does HouseData use?

HouseData primarily uses XGBoost and Random Forest ensemble models. These are trained on labelled datasets of known risk outcomes (e.g. properties that have flooded, properties with planning enforcement action) and validated against held-out test sets. Feature importance is published for transparency.

← Back to housedata.uk

Methodology

How HouseData calculates property risk scores from official UK government data.

Overview

HouseData aggregates open data from official UK government sources and applies machine learning models to infer property-level risk scores. Every score is traceable to its underlying data source, and we publish our methodology so users can understand exactly how results are generated.

We do not estimate, guess, or use proprietary black-box valuations. Every data point on HouseData is sourced from a named, verifiable government or public dataset.

Risk categories

Flood risk

Data sources: Environment Agency flood zones (Flood Map for Planning), historical flood event records, river and sea level monitoring stations, surface water flood risk maps.

Method: An XGBoost model combines property-level features (distance to watercourse, flood zone classification, elevation, historical flood frequency within 1 km) to produce a normalised risk score from 0 to 100. The model is validated against historical flood insurance claims data.

Planning risk

Data sources: Local planning authority feeds (425+ councils), Planning Portal records, planning enforcement notices.

Method: We count and classify planning applications within configurable radii of each property. Applications are weighted by type (major/minor), recency, and outcome (approved/refused/pending). A Random Forest model produces a development pressure score indicating how actively the surrounding area is being developed.

Contamination proximity

Data sources: Environment Agency pollution inventory, historical landfill sites register, contaminated land registers (where published by local authorities).

Method: Properties are scored based on proximity to known pollution sources, weighted by pollution type (category A, B, C) and operational status (active vs. closed/remediated). Scores are normalised to a 0–100 scale.

EPC trajectory

Data sources: DLUHC Energy Performance of Buildings Register (domestic and non-domestic).

Method: We analyse time-series EPC data for each property to identify trends: has the rating improved, deteriorated, or remained static? We also flag properties approaching certificate expiry and those where physical characteristics (floor area, heating system, insulation) have changed between assessments — which may indicate undisclosed alterations.

Consent Gap Score (PRISM)

Data sources: EPC time-series data, planning permission records, building control sign-off records, Listed Building registers.

Method: The Consent Gap Score cross-references physical changes detected in EPC data (e.g. a new extension appearing, changed heating type, increased floor area) against planning and building control records. A gap between detected changes and corresponding permissions flags a potential consent issue for conveyancers.

Model validation

All models are trained on labelled datasets of known outcomes and validated using stratified k-fold cross-validation with held-out test sets. We publish precision/recall metrics for each risk category and retrain models quarterly as new data becomes available.

Limitations

Risk scores are inferences, not guarantees. They should be used alongside professional surveys and legal advice.
Data availability varies by local authority — some councils publish planning data more comprehensively than others.
EPC data only exists for properties that have been assessed (typically at point of sale or let).
Historical contamination records may be incomplete for sites remediated before digital record-keeping.

Data interoperability

HouseData exports property data in PDTF (Property Data Trust Framework) v3 format, the UK industry standard for structured property data exchange. Each export includes verified claims with data provenance — documenting exactly which official source each data point came from and when it was retrieved. This enables seamless integration with conveyancing platforms, estate agents, lenders, and surveyors.