Data Science: CompStak’s Property CompSet

SKIP AHEAD TO

Instant Comparable Set for Analytics
Methodology: A Two-Step Inference Pipeline
Step 1: Nearest Neighbors Search
Step 2: Machine Learning Refinement
Pipeline Overview
Product Impact

September 23, 2025

Instant Comparable Set for Analytics

CompStak’s Property CompSet delivers a systematic, data-driven approach to generate instant comparable property sets. Rather than relying on time-consuming traditional attribute by attribute filters, the CompSet employs a bottom-up methodology to automatically identify the most comparable properties for a given subject asset. By leveraging CompStak’s extensive property database and advanced machine learning techniques, the system objectively segments asset classes and normalizes key market metrics. This enables it to rank comparable properties with precision, free from bias introduced by buyer or seller perspectives, providing analysts with a robust, ready-to-use competitive set to accelerate decision-making.

Methodology: A Two-Step Inference Pipeline

Data & Attributes

The process begins with a comprehensive dataset enriched with key property attributes. These attributes are normalized against market-specific distributions to ensure consistent comparisons across diverse regions and asset classes.
Differences in building class, age, low-rise versus high-rise characteristics, and other factors can lead to significant variations in historical data, causing deviations that a simple average or weighted average approach cannot adequately address. These methods fail to capture the nuanced dynamics of the commercial real estate (CRE) market. To overcome these complexities, a comprehensive market rent index is necessary. This approach assumes that, all other factors being equal, macro market momentum is the primary force influencing rent levels. By treating variations in property and lease-level characteristics as independent variables and isolating the effect of time, the market rent index effectively captures rent changes driven by overall market trends, offering a robust tool for understanding and predicting market dynamics.

Step 1 – Nearest Neighbors Search

In the initial phase, a vector database search is used to reduce the search space dramatically through a multi-stage filtering process against all CompStak properties:

Vectorization Process

Key property attributes are transformed into two distinct vector representations: location vectors (derived from geographic coordinates) and property feature vectors (containing normalized property attributes such as property size)

Hybrid Search Architecture

Location-Based Search: Utilizes spatial vectors to identify properties within geographic proximity, accounting for market boundaries and building density
Attribute-Based Search: Simultaneously searches using property feature vectors to find functionally similar properties regardless of location
Dual Vector Strategy: Both searches operate in parallel, allowing the system to capture both spatial and characteristic-based similarity. A hybrid algorithm uses calibrated weights to combine similarity scores from both location-based and attribute-based search.

Intelligent Filtering & Grouping

Market-specific filters prevent cross-market contamination while maintaining sufficient candidate pool size
Property type filters (office, retail, industrial) ensure category-appropriate comparisons
Optional property subtype filtering provides additional granularity for specialized property subtype categories, such as Data Centers or Self Storage locations

Performance Optimization Strategy

Market-based grouping reduces computational overhead by limiting search scope to relevant geographic regions
Asynchronous processing enables concurrent searches across multiple market/ property type combinations
The system efficiently processes millions of properties, typically returning 500-1000 highly relevant candidates within milliseconds

This multi-layered approach ensures that the initial candidate selection captures both the most geographically proximate and functionally similar properties, creating an optimal foundation for the subsequent machine learning  refinement phase.

Step 2 – Machine Learning Refinement

The candidate set from Step 1 is further analyzed using a tree-based XGBoost model that employs market-contextualized feature engineering:

Market-Relative Feature Analysis

Rather than using raw property differences, the model employs percentile-based comparisons that account for market-specific norms. For example, a 10-floor difference between office properties in Midtown Manhattan represents a much smaller relative variance than the same difference between industrial properties in New Jersey, where building heights are typically more uniform.

Property Type-Specific Feature Sets

Office Properties:

Property Attributes: Key attributes include property size, floors, and year_built evaluated relative to local office market norms
Lease Attributes: Office-to-retail ratios, flex space allocation, and mixed-use space composition based on tenant lease data
Market Position: Rent levels compared to local office market distributions
Geographic Location: Proximity and submarket positioning within the broader office ecosystem

Industrial Properties:

Property Attributes: Key attributes include ceiling height, property size, year built, and number of loading docks
Lease Attributes: Tenant industry makeup and distribution, and occupancy patterns
Property Subtype Similarity: Warehouse, manufacturing, flex industrial, and distribution center classifications

Market Normalization Process

Percentile Ranking: All quantitative features are converted to percentile rankings within their respective markets

Model Training & Refinement

User-Contributed Data: The XGBoost model incorporates custom comparable sets contributed by CompStak users, to train the model and ensure real-world validation
Focused Analysis: By concentrating on the pre-filtered candidate pool, the model can perform detailed feature analysis without computational constraints
Re-ranking Logic: Candidates are re-ordered based on the comprehensive feature analysis, often significantly improving upon the initial vector search results

Probability Scoring

Confidence Measurement: Each comparison receives a probability score indicating the confidence level of the comparability match
Interpretable Output: Raw model scores are transformed through logistic scaling to produce user-friendly probabilities while preserving the relative ranking accuracy
Market-Adjusted Thresholds: Probability interpretation accounts for market-specific data quality and coverage variations

This approach ensures that property comparisons are meaningful within their specific market and property type context, rather than applying universal standards that may not reflect local real estate dynamics.

Pipeline Overview

Product Impact

CompStak Platform Integration

CompStak maintains the Property CompSet for all properties in our database. This can be found on the property page, under the Competitive Set tab which provides model-suggested comparable set and allows users to customize the comparable set selection.

API Integration

In addition to the platform experience, Property CompSet can be accessed directly through the CompStak API. Users can provide either an address or geographic coordinates (latitude/longitude), along with key property attributes such as property type, property size, floors, year built, ceiling height, and number of loading docks. The API returns a ranked set of comparable properties with associated probability scores, enabling seamless integration into external workflows such as portfolio analysis, underwriting, and valuation pipelines.

Disclaimer: This document is proprietary and confidential. Unauthorized copying, sharing or distribution of this document or the information contained herein is strictly prohibited.