Pipeline documentation

How GeoValida works.

Thirteen automated stages, from raw satellite ingestion to a plain-language acquisition brief. Every stage runs without manual intervention. Total runtime: minutes.

Data acquisitionChange detectionSpatial ML decompositionContext and scoringLegal and marketOutput
Data acquisition
01

Satellite time-series ingestion

Monthly cloud-free composites are downloaded from Google Earth Engine for the full analysis period, covering the AOI at native satellite resolution. Four spectral bands are built per month: vegetation index, built-surface index, surface albedo, and land surface temperature. Cloud masking uses per-pixel quality flags with adaptive window widening when cloud cover is persistent; remaining gaps are filled via forward/backward temporal inpainting. Every filled observation is flagged for downstream confidence weighting.

Technical details

-Sentinel-2 (10 m native): NDVI, NDBI, Albedo
-Landsat 8/9 (30 m, bilinearly upsampled to 10 m): LST
-Source: Google Earth Engine, monthly composite per band
-Cloud masking: per-pixel QA flags + adaptive window widening
-Gap-fill: forward/backward inpainting; NaN fraction tracked per observation
-Inpainted observations receive 1.5x sigma inflation in anomaly scoring
Outputs
[T, H, W, 4] raster tensorNaN confidence mask
02

Land/water mask

A land/water boundary is auto-derived from the satellite signal itself rather than imported from a static dataset. NDVI values combined with the NaN fraction per pixel identify persistent water. Morphological operations remove coastal noise artifacts. The resulting mask is applied to every downstream layer: sea pixels are excluded from all statistics, scoring, and outputs.

Technical details

-Derived from NDVI threshold + NaN fraction (no external dataset dependency)
-Morphological cleanup: dilation/erosion to remove coastal noise
-Applied to all downstream layers before any metric is computed
Outputs
Binary land mask [H, W]
Change detection
03

Dual-baseline anomaly detection

The full time series is divided into a reference period (typically the first 36 months) and a monitoring period (the remainder). Change is not measured as before/after: each monitoring observation is compared to the same calendar month across all baseline years, removing the seasonal cycle entirely. Two z-score streams run in parallel: a conservative stream that anchors to the original land state, and an adaptive stream that tracks the monitoring period's own recent mean to catch accelerating local bursts. A Mann-Kendall trend test is then run per pixel to classify stable, transient, and directionally trending change.

Technical details

-Baseline window: typically 36 months (configurable)
-Monitoring window: remaining period up to present
-Month-matched comparison: seasonal cycles cancel out completely
-dC (conservative): deviation from same-month historical baseline
-dA (adaptive): deviation from monitoring-period own mean
-Cloud-weighted: inpainted months receive 1.5x sigma inflation
-Mann-Kendall trend classification (vectorized per pixel): stable / transient / trending
-Template-weighted band aggregation: residential vs. industrial vs. land-bank weightings differ
Outputs
Anomaly score map [H, W]Anomaly type map [H, W]Per-band z-score stack [T, H, W, 4]Trend classification [H, W]
Spatial ML decomposition
04a

Feature compression

The anomaly and trend outputs are compressed into a 17-feature matrix per pixel. This representation captures the magnitude of change across all four bands, the absolute drift (how far the monitoring period has moved from the baseline), the trend direction per band, and a one-hot encoding of the break context - the temporal character of when and how the change began.

Technical details

-17 features per pixel: magnitude[4], drift_abs[4], trend[4], break_context[5 one-hot]
-drift_signed: monitoring-period conservative z-score minus baseline z-score (the structural gap)
-break_context one-hot: no_break / single_early / single_late / multiple / monotone
Outputs
Feature matrix [H, W, 17]
04b

Spatial autocorrelation check

Global Moran's I is computed on the anomaly score surface before regime discovery begins. If spatial autocorrelation is too low, coherent zones cannot form and the pipeline warns rather than producing meaningless clusters.

Technical details

-Global Moran's I on anomaly score
-Low-autocorrelation warning triggers if I < threshold
-High autocorrelation confirms that spatially coherent regimes are discoverable
Outputs
Moran's I scalarAutocorrelation flag
04c

Adaptive grid construction

A quadtree subdivision places more grid cells in spatially heterogeneous areas and fewer where the signal is uniform. This ensures that local XGBoost models are concentrated where they can learn meaningful distinctions, rather than wasting capacity on homogeneous zones.

Technical details

-Quadtree subdivision based on anomaly score variance
-Higher density of cells in heterogeneous areas
-Pure-sea cells excluded entirely
Outputs
Grid cell list with spatial extents
04d

Geographically weighted random forest

One local XGBoost model is trained per grid cell against a global pool of pixels, weighted by spatial proximity (Gaussian kernel) and per-pixel uncertainty (inverse sigma). This local weighting prevents adjacent land-cover types from corrupting each other's attribution - a forest cell and an urban cell two kilometers apart will have almost no influence on each other's model, even if they share a grid quadrant.

Technical details

-One XGBoost model per grid cell
-Training pool: all land pixels, weighted by Gaussian proximity kernel
-Uncertainty weighting: pixel weight scales with 1/sigma from anomaly scoring
-Prevents cross-contamination between land-cover types
-Input: 17-feature matrix per pixel
-Output: locally calibrated anomaly score predictions
Outputs
Calibrated anomaly surface [H, W]Per-cell model residuals
04e

Spatial SHAP attribution

A TreeExplainer runs per grid cell to attribute each pixel's anomaly score to one of the four spectral bands. Attribution is local, not global: the same pixel can have a heat-driven signal in one zone and a vegetation-driven signal in an adjacent one. A drift ratio is also computed per pixel - separating the anomaly into its gradual structural drift component (months of slow accumulation) versus an acute event (rapid onset).

Technical details

-TreeSHAP (TreeExplainer) per grid cell
-Four attribution channels: LST, NDVI, NDBI, Albedo
-SHAP varies across space - attribution is local, not a global average
-Drift ratio: proportion explained by gradual structural drift vs. acute event
Outputs
Per-band SHAP attribution [H, W, 4]Drift ratio [H, W]
04f

SKATER regime discovery

Zones are discovered using SKATER, a graph-based spatial contiguity clustering algorithm. Unlike grid-based or k-means clustering, SKATER enforces spatial contiguity: every zone is a single connected region on the ground. Zone boundaries fall at genuine land-use transitions rather than arbitrary statistical partitions. The optimal number of zones is selected via silhouette score, rewarding genuine cluster separation over superficial splits.

Technical details

-SKATER: graph-based spatial contiguity clustering
-Zones are spatially contiguous - no disjoint regime islands
-k selection via silhouette score (rewards genuine cluster separation)
-Augmented feature matrix: pixel features [17] + per-band SHAP [4] + vegetation/built contrast + thermal signal
-Post-scale weights amplify spectral discriminators, suppress change-timing features for boundary placement
Outputs
Zone membership map [H, W]Zone boundary polygons
04g

Regime characterization

Each zone receives a complete characterization: the mean anomaly score, the dominant spectral driver, the direction of drift per band, the development phase, the onset timing, a plain-language narrative, and a decision signal. Onset timing is the commercially critical output - it tells the analyst not just that change has occurred, but when it started, and therefore whether the entry window is open or already closed.

Technical details

-Decision signal: INVEST / WATCH / AVOID
-Development phase: land_prep / active_construction / stabilizing / environmental_decline / heat_stress / recovering / mature_stable
-Onset timing: very_recent (< 3 months) / recent (3-12 months) / old (12+ months) / multiple
-Dominant driver and direction per band (degrading / recovering / stable)
-Drift ratio per zone: gradual vs. acute character
-Plain-language narrative generated per zone (template-aware)
-SHAP profile: mean and std per band
Outputs
Per-zone signal, phase, onset, narrativeZone SHAP profiles
04h

Point anomaly channel

Isolated high-intensity events that are too small to form a coherent regime are detected separately. Blobs of five pixels or fewer at or above the 97th percentile of land anomaly scores are flagged as point anomalies. These catch demolished individual blocks, small flood pockets, isolated construction events, and other localized activity that would otherwise dissolve into a surrounding stable zone.

Technical details

-Detection threshold: 97th percentile of land anomaly scores
-Maximum size: 5 pixels (below minimum coherent regime area)
-Separate from regime structure - never merged into zones
-Catches: isolated demolitions, flood pockets, single-plot clearing events
Outputs
Point anomaly GeoJSON features
Context and scoring
05

Context enrichment

Three context data sources are queried for each AOI. OSM Overpass retrieves POI counts by category within a dynamic radius scaled to AOI size. WorldPop provides gridded population estimates within the AOI and a 5 km buffer. Valhalla isochrones map walk and drive reachability from the AOI centroid. A coverage confidence rating is assigned based on POI density relative to expected urban coverage.

Technical details

-OSM Overpass: POIs by category (retail, food, healthcare, education, transit, commercial, industrial)
-Dynamic query radius scaled to AOI size
-WorldPop: population within AOI and 5 km buffer (gridded estimates, not census interpolation)
-Valhalla: walk/drive isochrones from AOI centroid (5 / 10 / 15 minute bands)
-Coverage confidence: high / moderate / low based on POI count relative to expected density
Outputs
POI counts by categoryPopulation estimatesIsochrone polygons
06

Site quality scoring

A six-component per-pixel livability composite is computed across all land pixels. Each component is normalized to [0, 1] over land pixels only, then blended using template-specific weights. The composite drives opportunity ranking - it is not a standalone output but an input to the ranking stage.

Technical details

-Vegetation gain: ΔNDVI > 0 only (rewards greening, never penalizes warming)
-Surface cooling: ΔLST < 0 only (rewards cooling, never penalizes warming)
-POI accessibility: Gaussian KDE of amenity locations, category-weighted by template
-Neighborhood momentum: GHSL built-up acceleration blurred to ~750 m neighborhood scale
-Road connectivity: linear distance decay from nearest road (0 at 5 km)
-Slope comfort: quadratic decay (0 at 20 degrees)
-All components normalized to [0,1] over land pixels only
-Template-weighted blend: residential, commercial, industrial, and land-bank weights differ
Outputs
Livability heatmap [H, W]
07

Land opportunity detection and ranking

Empty buildable land is detected using an ensemble of globally-trained ML products - no per-region threshold tuning. Any source that flags a pixel as built triggers exclusion; at least one source must confirm empty land for a pixel to qualify. Watershed segmentation on the NDVI/Albedo gradient delineates individual plots. Each qualifying plot is then ranked by a composite of livability, neighborhood momentum, frontier proximity, area, zone signal, and signal stability.

Technical details

-Built exclusion: Dynamic World built prob >= 0.30, ESA WorldCover class 50, Google Open Buildings v3, OSM footprints (dilated 2px), GHSL built fraction >= 0.30
-Water exclusion: NDWI land mask, Dynamic World water >= 0.30, ESA WorldCover 80/90, JRC GSW occurrence >= 25%
-Protected/unbuildable: WDPA protected areas, slope >= 30 degrees
-Positive empty signal required: Dynamic World (grass + bare + shrub) >= 0.30, or ESA WorldCover non-built class
-Plot delineation: watershed segmentation seeded by distance-transform peaks on (NDVI - Albedo) gradient
-Ranking inputs: livability + GHSL momentum + frontier proximity + plot area + zone signal + signal stability
-All globally-trained products - no per-region calibration required
Outputs
Ranked opportunity polygons (GeoJSON)Per-component scores per plotRGB validation chips per opportunity
Legal and market
08

Legal exclusion layers

Legal protection status is resolved at pixel level before any opportunity is ranked. Multiple sources are OR-combined into a single exclusion mask. Any pixel covered by a legal exclusion is permanently ineligible for opportunity ranking, regardless of its livability or change signal. For Brazilian AOIs, three additional federal registries are queried automatically.

Technical details

-WDPA protected areas: global coverage (IUCN categories I-VI and unassigned)
-Brazil INDE/ICMBio: federal, state, and municipal conservation units (UC federal, estadual, municipal)
-Brazil FUNAI: indigenous territories (Terras Indigenas), georeferenced from public shapefile
-Brazil CMR/FUNAI: quilombola territories from the national quilombola certification registry
-Brazil ONR: environmental restriction zones fetched via the same auth token as parcel boundaries
-All layers rasterized to 10 m pixel resolution - exclusions are spatially precise
-OR-combined into a single legal mask applied before any scoring occurs
Outputs
Legal exclusion mask [H, W]Per-source exclusion layers (for audit)
09

Property parcel boundaries

Cadastral parcel boundaries are retrieved through a three-stage global strategy. For Brazilian AOIs, the pipeline first queries the public ONR ArcGIS catalog, then falls back to an authenticated web session with XHR interception to capture urban parcels and transaction layers that are only served to authenticated users. For non-Brazilian AOIs, ArcGIS Online is searched for county and state-level assessor data.

Technical details

-Stage 1 (Brazil): ONR public ArcGIS catalog - rural lots and INCRA georeferenced lots
-Stage 2 (Brazil): Selenium + XHR interceptor against ONR - captures auth-gated services including imoveis_georreferenciamento and layer_transacoes
-Stage 3 (global): ArcGIS Online search for county tax parcels, state-level cadastral datasets
-Coverage depends on whether a public or semi-public cadastral source exists for the AOI
Outputs
onr_parcels.geojson (polygon boundaries)onr_transactions.geojson (transaction points with metadata)
10

Property price intelligence

Price context is assembled from multiple tiered sources and blended into a spatially continuous price surface. Each ranked opportunity receives a price estimate from the nearest reliable source. Active market listings from classifieds platforms are overlaid separately, providing supply-side asking prices alongside the registry-based transacted values.

Technical details

-Brazil - Sao Paulo: GeoSampa WFS, IPTU valor venal per fiscal lot
-Brazil - Rio de Janeiro: data.rio (same schema)
-Brazil - national: ONR deed transaction prices extracted from registry records
-US and global: ArcGIS Online assessor enrichment (assessed value, last recorded sale)
-Market listings: OLX Brazil and Zillow US asking prices via BrightData scraping
-Index fallback: FipeZAP monthly $/m2 by capital city (Brazil); FHFA HPI quarterly by state (US)
-IDW interpolation: all price points blended into a continuous $/m2 heatmap
-Per-opportunity output: est_value_per_m2, est_total, yoy_pct, source label, value percentile
Outputs
Price heatmap [H, W]Per-opportunity price contextTransaction point layerListing point layer
Output
11

Composite scoring

A 0-100 composite score is computed from five weighted components. Component weights are template-specific: a land-bank analysis weights trajectory health and compliance risk more heavily than amenity access; a single-family residential analysis does the reverse. The score drives the letter grade (A through D) shown in the acquisition brief.

Technical details

-Trajectory health: development phase stability, edge pressure, trend coverage
-Internal cohesion: number of distinct regimes (fewer zones = more uniform signal)
-Environmental quality: vegetation change, temperature shift, forest loss fraction
-Compliance risk: compliance flag count, trend-dominated area share
-Context match: amenity and accessibility alignment (weighted once context layer matures)
-All weights are template-specific
Outputs
0-100 composite scoreLetter grade A-DPer-component scores
12

Template configuration

Eight project-type templates reconfigure the entire pipeline from band weights through to LLM narrative emphasis. A single parcel analyzed as a single-family lot and as an industrial site will produce different anomaly weightings, different livability composites, different POI importance rankings, and different verdicts. Templates are selected at job submission time.

Technical details

-Templates: residential_single, residential_mid, residential_large, commercial_retail, industrial, mixed_use, land_bank, balanced
-Reconfigures: band weights in anomaly scoring, livability component weights, POI category importance
-Reconfigures: scoring component weights, verdict override rules, LLM narrative emphasis
-Industrial template upgrades 'livability erosion' signals to PROCEED (construction is expected)
-Land-bank template emphasizes trajectory health and compliance risk over current amenity access
Outputs
Template-specific weights applied throughout pipeline
13

LLM acquisition brief

A Claude Sonnet call synthesizes all scored pipeline outputs into a 400-word plain-language acquisition brief. The model receives pre-scored signals as structured input - it is not asked to interpret raw satellite data. Hard constraints are enforced: no fabricated permits or timelines, every adjective must be backed by a number from the pipeline output, and the verdict must align with the decision signal computed in Stage 04g. Two output modes are available: an acquisition brief for developers and investors, or a neighborhood livability assessment for residential buyers. Both English and Brazilian Portuguese are supported.

Technical details

-Model: Claude Sonnet (Anthropic)
-Pre-scored signals fed as structured input: early-stage construction, gradual drift magnitude, edge pressure, onset timing
-Hard constraints: no fabricated permits, no invented timelines, no internal metric names exposed
-Every adjective must reference a pipeline number (e.g. 'significant' must cite a z-score or percentile)
-Verdict must align with pipeline decision signal - model cannot override the computed INVEST/PROCEED/WAIT/AVOID
-Output modes: acquisition_brief (developer/investor) / neighborhood_assessment (resident/mover)
-Language support: English, Brazilian Portuguese
Outputs
400-word acquisition brief (or neighborhood assessment)Zone-by-zone narrative readsConcrete recommendation bullets

Run the full pipeline on any parcel.

All thirteen stages in minutes. Legal layers, price context, and the acquisition brief included.