CMAB: A Multi-Attribute Building Dataset of China

The technical validation of the CMAB dataset consists of three parts: (1) the performance of the OCRNet model and the XGBoost model on the test set (including rooftop, height, and function); (2) comparison of our data with related published datasets (including rooftop, height, and functions); (3) validation by comparing predicted values with observed values from SVIs (including height, function, and age). For details, see the sections “Model evaluation and comparison for geometric attributes,” and “Model evaluation and comparison for Indicative attributes,”.

For evaluation metrics of building rooftop, mIoU (mean Intersection over Union) represents the average segmentation accuracy across all classes. Accuracy denotes the overall pixel classification accuracy. The F1-score combines precision and recall, making it especially useful for imbalanced datasets. Precision and Recall indicate segmentation performance for each class, identifying where the model performs better or worse. We use mIoU, Recall, Precision, F1-score, and Accuracy to evaluate the rooftop segmentation model:

$$P{recision}=\frac{{TP}}{{TP}+{FP}}$$

(3)

$$R{ecall}=\frac{{TP}}{{TP}+{FN}}$$

(4)

$$F1-{score}=\frac{2\ast P{recision}\ast R{ecall}}{P{recision}+R{ecall}}$$

(5)

$${Accuracy}=\,\frac{{TP}+{TN}}{{TP}+{TN}+{FP}+{FN}}$$

(6)

TP is the True Positives for class i (building and background), FP is the False Positives for class i, and FN is the False Negatives for class i. mIoU is the mean IoU of the building and background classes, with k = 2 being the number of classes.

$${mIo}U=\frac{1}{k}{\sum }_{i=1}^{k}{{IoU}}_{i}=\frac{1}{k}{\sum }_{i=1}^{k}\frac{{{TP}}_{{\rm{i}}}}{{{TP}}_{{\rm{i}}}+{{FP}}_{{\rm{i}}}+{{FN}}_{{\rm{i}}}}$$

(7)

For evaluation metrics of building height, model accuracy metrics (RMSE/MAE/R²) were evaluated on the building height. RMSE emphasizes large errors and their impact. MAE reflects overall accuracy by averaging absolute errors. R² shows how well predictions fit the actual data. The formulas are as follows:

$${RMSE}=\sqrt{\frac{1}{n}{\sum }_{i=1}^{n}{({y}_{i}-{\hat{y}}_{i})}^{2}}$$

(8)

$${MAE}=\,\frac{1}{n}{\sum }_{i=1}^{n}|{y}_{i}-{\hat{y}}_{i}|$$

(9)

$${R}^{2}=1-\frac{{\sum }_{i=1}^{n}{({y}_{i}-{\hat{y}}_{i})}^{2}}{{\sum }_{i=1}^{n}{({y}_{i}-\bar{{y}_{i}})}^{2}}$$

(10)

To assess the uncertainty in building height estimation, we randomly selected 10% of the test data for error and uncertainty analysis. For the remaining 90% of the data, 20% was randomly selected as the validation set and 80% as the training set in each iteration. This process was repeated 100 times, with XGBoost hyperparameters optimized through grid search during each iteration. The mean of 100 prediction results per building served as the final height prediction. Model accuracy metrics (RMSE/MSE/MAE/R²) were evaluated on the test set. Uncertainty was quantified as the range of relative error ${{\rm{RE}}}_{{\rm{i}}}$ of the building ${\rm{i}}$, each trained on different data splits and optimized hyperparameters. A wide range indicates high uncertainty, while a narrow range suggests consistent predictions.

Specifically, for each building sample ${\rm{i}}$, ${{\rm{RE}}}_{{\rm{i}}}$ is defined as the ratio of the difference between the true building height ${{\rm{T}}}_{{\rm{i}}}$ and the predicted value ${{\rm{P}}}_{{\rm{i}},{\rm{j}}}$. Additionally, we provide the absolute error ${{\rm{AE}}}_{{\rm{i}}}$ and the range of absolute errors across 100 model estimates. Here, ${\rm{j}}$ denotes the ${\rm{j}}$ th predicted value of the 100 model estimation:

$${{RE}}_{i}=\left|\frac{{P}_{i,j}-{T}_{i}}{{T}_{i}}\right|,{{AE}}_{i}={P}_{i,j}-{T}_{i}$$

(11)

To verify the model reasoning and calculation results of each attribute, the study used SVIs to validate building height, function, and age along streets. Five administrative cities were initially selected, and buildings along streets were sampled for validation. These cities represent different urban hierarchies, provinces, and climate zones. The sampling aimed to cover a wide range of building heights, functions, and ages. Subsequently, the nearest point on the building’s outline to the closest SVI point was designated as the observation point. The direction of the street view sampling was defined as vector 1, and the direction from the street view point to the observation point was defined as vector 2. The angle difference ${\rm{\theta }}$ between these two vectors was calculated: if it fell between 45 and 135 degrees, the right-side image in the forward direction of the street viewpoint was extracted; if it fell between −45 and −135 degrees, the left-side image was extracted. A manual auditing platform was then established, involving an auditor with an urban planning background (Figure S6). This process led to the manual annotation of 2,500 data points on building height, function, structure, style, quality and age.

Table of Contents

Model evaluation and comparison for geometric attributes

Building rooftop

To evaluate the accuracy of our product, we utilized higher-resolution remote sensing imagery and supplemented the annotated dataset with dense urban areas. We based our evaluation on a validation set comprising 23,415 manually labelled building roofs from seven cities located in different climatic zones. The results demonstrate that our building roof segmentation model outperforms existing studies in terms of mIoU, Recall, and Precision on related datasets. After supplementing the annotated data to include 114,783 building instances, our rooftop segmentation model achieved an Accuracy of 91.59%, a mIoU of 81.95%, an F1 score of 89.93%, and a Kappa coefficient of 79.86% (Table 3), proving the model’s capability to accurately identify buildings across various regions in China.

Table 3 Data comparison from the evaluation metrics of rooftop extraction results.

From the recognition results (Fig. 8), our data product demonstrates superior accuracy and completeness compared to existing building footprint datasets. Given the increased focus on spatial cities, we conducted a comparative analysis of building areas identified by different data products. Our findings indicate that our approach identifies a greater number of buildings in spatial cities. This outcome is expected, as our methodology involved the manual interpretation and comparison of remote sensing images across all spatial cities. The Douglas-Peucker algorithm with an empirical threshold used for vectorizing the contours distinguishes well between buildings and non-building objects such as cars, compared to GABLE and East_Asia buildings datasets. Additionally, our method aligns better with visual interpretation than the further black-box post-processed East_Asia dataset (using GAN), which tends to over-regularize, although some untreated roofs might appear less aesthetically pleasing due to retaining more segmented shapes. Regarding the recognition of buildings in remote areas, our climate zone annotations significantly enhanced the accuracy for some special buildings, such as large religious structures, compared to GABLE, East Asia, and 90_cities_BRA datasets. The total recognized building area in existing studies is similar, but the building count varies greatly for two main reasons: first, whether multiple roofs of a single building are identified separately, and second, whether small structures are mistakenly identified as buildings. It is worth mentioning that recent methods, such as the SKTrans framework⁷⁵, have made significant strides in automatic building footprint extraction using deep learning, particularly in handling diverse building styles and tonal differences across regions. While our method shows strong accuracy, integrating newer approaches like SKTrans could improve both efficiency and precision in future updates. As methods evolve, incorporating self-supervised learning models could greatly enhance building footprint extraction, especially for large-scale and multi-temporal datasets.

Building height

Firstly, the comparison of the height partition models (A, B, C, D, E, representing models trained according to administrative levels) and the combination model, integrated through 100 training iterations using the Bootstrap Aggregated XGBoost method, is presented (See Table 4 for details). It can be observed that, except for the partition model trained on building data of level E, the accuracy metrics of the other partition models surpass those of the combination model. Notably, the R² values for the first two partition models exceed 0.8, with building height prediction errors less than 6 meters. This discrepancy can be explained by the substantial variation in construction investment intensity across different administrative levels. Buildings in higher-level cities are more influenced by socioeconomic factors, making their heights relatively harder to predict, as reflected in the higher R² values for building predictions in lower-level cities. Using SVIs, we audited 2500 building heights through manual observation of floor counts and further estimated their heights. Regression analysis against the dataset resulted in an R² of 0.72, indicating accurate identification of building heights in most cases.

Table 4 Comparison of height partition model and combination model.

Secondly, we compared different products through visualization and height segmentation, focusing on Baidu, GABLE and 3D-GloBFP datasets (Fig. 9). Baidu data is used as the ground truth for Chinese building heights in most studies, including ours, which indicates the accuracy of our data product through the similarity in distributions. Furthermore, GABLE, the only product that identifies the height of all buildings from optical images across China, provides RMSE values for height intervals of 0–10, 10–30, 30–50, 50–100, and 100–500 meters. By categorizing our data according to these intervals, we found that our product exhibits lower RMSE values in the intervals below 50 meters. Based on statistics of Baidu data we used, 98% of building heights are below 50 meters. Additionally, according to the 2020 Chinese Census, residential buildings with more than 10 stories (roughly equivalent to building height higher than 30 m) account for only 1%. Furthermore, compared to the 3D-GloBFP dataset, which also utilizes Baidu data as the ground truth for building heights and employs machine learning models to estimate building heights, it does not report RMSE within height segmentation intervals for the entire China and only provides visualization maps. In these visualizations, the RMSE in China generally exceeds 10 meters, with specific RMSE ranges reported for Jiangsu Province as follows: 0–10 at 7 meters, 10–30 at 10 meters, 30–50 at 18 meters, and greater than 50 at 23 meters. In contrast, as illustrated in Fig. 9, our height segmentation intervals demonstrate higher accuracy, which is primarily attributed to our more detailed approach and the alignment of height feature creation with urban planning systems, such as incorporating munlcipal function centers of cities and along-street characteristics.

Therefore, these results indicate that our product achieves better height prediction results for the vast majority of buildings. Although our model, driven by optical and multi-source data, performs well, radar data is increasingly used in building height estimation. Studies using radar show improved accuracy for high-rise buildings²⁵. Incorporating such data in future work could further enhance the precision of our height predictions, especially for high-rise structures.

The Relative Error range/Absolute Error range for the combination model, partition model C, partition model D, and partition model E is lower than partition models A and B (see Figure S5). This indicates that the overall accuracy decreases with higher city administrative levels, but the uncertainty in estimates due to different data partitioning methods and model parameters is lower, resulting in more stable estimates. The relative error curves of the four stable estimation models start to rise sharply around 20 meters (see Figure S5, similar to 3D-GloBFP), indicating that the deviation in building height estimates increases with the true height value, while the relative error for buildings below 20 meters is relatively low. The uncertainty is reduced in the partition models compared to the combination model (especially in the comparison between models C/D/E and the combination model).

Model evaluation and comparison for indicative attributes

Building function

The results clearly indicate that the partition models outperform the combined model, with Model A showing superior accuracy across multiple functional categories in Fig. 10. Notably, the precision for residential function identification is higher, with an F1-score approaching 0.90. Other functions, such as office buildings, have slightly lower identification precision, with F1-scores nearing 0.80. In contrast, the performance of the commercial and public service models is suboptimal, with F1-scores around 0.5. This discrepancy may be attributed to the varying sample sizes of different building types. Using SVI data, we verified the functions of 2500 buildings through manual auditing, observing details such as building names and architectural styles to determine their functions. Comparing these observations with the dataset, we found that 88% of buildings were accurately classified in terms of their functions. This indicates that the functional purposes of the majority of buildings were correctly identified through our methodology.

Furthermore, we compared our study with recent research that identified the primary functions of buildings by calculating the geometric features of building coverage, distances to adjacent objects, and the kernel density of points of interest (POI)²¹. Firstly, unlike the aforementioned study, which focused on three urban agglomerations in China (Beijing-Tianjin-Hebei, Yangtze River Delta, and Pearl River Delta), our research encompasses buildings on a national scale. Secondly, in terms of model accuracy, our functional prediction model’s accuracy is slightly lower than that of the recent study (average accuracy of 0.93), likely due to the latter’s more limited scope. Nonetheless, the performance of our models across different functions aligns with the findings of that study, exhibiting strong performance in residential functions (average accuracy of 0.97) while demonstrating weaker performance in commercial (average accuracy of 0.63) and public service functions (average accuracy of 0.67).

Building quality and age

Judging from the model evaluation, the accuracy of building quality depends on the accuracy of the Yolo-v8 model. According to Li⁶⁸ and related research, the identification accuracy of various building quality categories is as follows (See Table 5 for details): “Buildings with damaged facades” (82.7%), “illegal/temporary buildings” (71.6%), “Graffiti/illegal advertisement”(81.5%), “Stores with poor facades”(89.2%), “Buildings with unkempt facades”(79.2%), and “Stores with poor signboards”(86.3%). The accuracy of building age depends on the accuracy of GAIA data. According to Gong, the mean overall accuracy over years of GAIA is higher than 90%⁶⁹.

Table 5 The overall accuracy of the Yolo-v8 model in recognizing building quality.

We acknowledge that using impervious surfaces to date buildings may lead to estimates that are slightly earlier than the actual construction year, as impervious surfaces are often developed prior to building construction. To validate our method, we utilized extensive housing trade data from Anjuke.com, one of China’s largest real estate platforms, including 3,771,892 house rent and 608,984 community records, covering 2,490 cities with construction years and coordinates (Figure S7, Figure S8). We found a significant positive correlation (P < 0.05) between the building ages from housing trade data and our estimates at the provincial level. However, as housing trade data is typically aggregated at the community level, full validation of building-level age remains a challenge. Additionally, our comparison with GAIA building age data demonstrated an average difference of 8.7 years, with GAIA dating buildings earlier than the housing trade records. In future data product updates, we aim to incorporate accurate building ages observable through street view images to enhance accuracy.

Therefore, we finally used SVI to manually mark and validate the quality and age of 2500 buildings using SVIs through manual auditing, assigning quality problems severity scores from 0 to 6, where higher scores indicate more severe issues of building. The correlation analysis with existing results yielded an R² value of 0.78, indicating accurate identification of quality issues in the majority of buildings. Regarding the building age, we divided the building age in manual labelling into five categories (time division points are 1985, 1990, 2000, 2010, and 2018), and counted whether the identified building age conforms to the observed true value of the category, and found that 82% of the buildings are consistent. This shows that most building ages are accurately classified. The reason we can’t make statistics directly according to the year of building completion is that, firstly, we can’t determine the year of completion just by visual inspection; and secondly, the experiment assumes that the expansion of impervious surfaces is synchronous with building construction age, whereas in reality, some buildings are demolished, and others appear before the impervious surfaces are established.

Building structure and style acquired by large multimodal models

Currently, large multimodal models (LMMs) like GPT-4o have significantly changed the paradigm of machine learning application modelling. We found that LMMs perform well in inferring building structure and style (Figure S9). We tested models like CLIP⁷⁶ and GPT-4o⁷⁷. While GPT-4o’s accuracy, enhanced by semantic understanding, is about 8% higher, the high cost of inferring 60 million street view images led us to use the open-source CLIP model. We fine-tuned CLIP using 2,500 annotated SVIs (height, function, age, quality, structure, style), improving its top-1 accuracy to 25%, surpassing GPT-4o.

However, we observed that CLIP’s top-1 accuracy for attributes like building height, quality, and age remains much lower compared to machine learning models built with well-defined building features we used. This is understandable since SVIs cannot capture full building height, multimodal models lack fine-grained recognition of quality, and SVIs cannot provide year-by-year temporal granularity for building age. For functions, while LMMs achieve similar accuracy to our metric-based modelling, SVIs don’t cover internal buildings within blocks⁷⁷.

link

CMAB: A Multi-Attribute Building Dataset of China