November 6, 2024

Home Inspection

Home Inspection, Primary Monitoring for Your Home

Building a Chinese ancient architecture multimodal dataset combining image, annotation and style-model

Building a Chinese ancient architecture multimodal dataset combining image, annotation and style-model

Overview of the Dataset

The dataset contains over 500 photographs and 24 videos of Chinese ancient buildings located in the central and southern part of Shanxi province. To provide more details, each photograph is set to a size of 3000 × 2000 or 3000 × 1687 resolutions in the dataset, and it is qualified for all existing generative models, like SDXL2. The vast majority of these buildings can be traced back to the Yuan Dynasty and even earlier dynasties, such as the Tang, Five Dynasties, Liao, Song, Jin, etc. Ancient buildings built in these dynasties have existed for 600 years, representing the historical features of that period and having important humanistic value. Unlike the architectural style of the Forbidden City, which is the most famous Chinese ancient architectural complex, these buildings exhibit more distinct characteristics of an ancient era. As a result, our dataset provides more data basis for artificial intelligence model learning of Chinese ancient buildings. Compared with the fact that most of current famous models generate unsatisfied results which are similar to the appearance of the Forbidden City buildings.

As we know, Chinese architecture has formed a strict hierarchical system since ancient times. The shape of the roof is the most important factor in determining grade classification. Based on the appearance of the building roofs, which is the most striking feature of the different styles, we marked four styles of buildings with the name of roof in our dataset. For each style, we provide photographs of ancient buildings in different regions, different environments, different construction times, and different levels of conservation and restoration in Shanxi Province. In order to enrich the data set, we provide four types pictures for the same building (front, side, overhead and detailed). In addition, the dataset also constructs a multi-level text dataset system for each image, which includes two main parts, one is the accurate name annotation of ancient buildings in photograph, and the other is the detailed annotation of the buildings with background information. Such a dataset is well-suited for the training or fine-tuning of multimodal models. Taking Stable Diffusion as an example, such image data and multi-level text information can be well integrated into model training, including diffusion1 and CLIP23, based on some cleverly designed fine-tuned strategy mechanics, like LoRA15. Results in the Technical Validation part prove this.

Data Acquisition

The image dataset contains images of most of the existing ancient buildings in Shanxi Province and most of the buildings date back to the Yuan Dynasty and beyond (1368 AD and before). As mentioned in the research24, Shanxi is the major province of cultural relics in China. It has a complete sequence of ancient buildings and many categories. There are 518 wooden structures can be traced back to the Yuan Dynasty and before, ranking first in the country. The details are shown in the Table 1. The selection of image data acquisition targets was primarily guided by the geographic distribution of China’s national ancient buildings. High-quality on-site photography was adopted as the primary source of image data. To efficiently utilize resources and minimize expenses, we chose Shanxi province as the sampling site. This strategy is particularly effective in Shanxi Province, where a substantial concentration of China’s ancient architectural heritage exists.

Table 1 Regional distribution of Chinese traditional wooden structure buildings from Tang dynasty to yuan dynasty in Shanxi and the statistical table of national proportion.

Another criterion considered in the dataset creation process was the historical period associated with the ancient buildings. The majority of structures included in the dataset were constructed during the Yuan Dynasty and earlier periods. The decision to concentrate on ancient buildings was driven by the notably skewed representation of architecture from the Ming and Qing Dynasties in existing general datasets. The trend is highlighted by the frequent inclusion of iconic sites like the Forbidden City, which often overshadow less represented architectural styles and periods. This emphasis reflects the global recognition and acclaim attributed to Ming and Qing Dynasty buildings, while architecture of pre-Ming Dynasty has remained relatively understudied. To address this gap in existing research, this study compiled buildings absent from existing ancient building datasets, specifically focusing on those dating from the Yuan Dynasty and earlier periods. At the same time, Shanxi province has also preserved a large number of ancient buildings built pre-Ming Dynasty, which further explains the correctness of choosing Shanxi as the sampling site.

In order to provide more comprehensive data related to ancient buildings, we collected a large amount of picture data on the spot, and organized and post-processed the images according to the location. Beside image data, the description text corresponding to these buildings plays an important role during training process. Prior images are labelled with the total name and today’s multimodal model training requires more detailed descriptive information. We have collected a large amount of professional text data for these building from some professional websites and existing multimodal models. Moreover, One of our authors named Xun Wang who have long been engaged in the field of art is invited to construct several sub-datasets. All images are chosen from her works and features distinct architectural styles with significant characteristics. More fine-tuning models are trained with these high-quality art classification data and corresponding style description text. Thanks to these LoRA models trained on these professionally screened data, our dataset can better achieve multi-style image generation creation.

Image data

Basis details for data collection. Manual photography is a necessary approach for photo collection to obtain first-hand data and ensure professionalism, manual photography is employed for image retrieval. To this end, we firstly determine to visit Yuncheng, Linfen, Changzhi, and Jincheng areas in Shanxi Province, abd our target object focuses on Yuan Dynasty and earlier ancient buildings. According to the prior research24, these four city areas house an estimated 80% of the totally number of ancient buildings in the province. Moreover, we strategically selected late July for the photography campaign to get the best possible image in light conditions. The entire data acquisition cycle continued a full week.

Shooting challenges. Due to their conservation status and restricted public access, the obtained images exhibit inconsistent quality. Additionally, as many of these ancient structures are dispersed across the surrounding towns of these four cities and lack adequate protection, the surroundings hinder the acquisition of suitable shooting angles, presenting a challenge to manual photography. Nevertheless, given the present demand for pixels, lighting, and shadow in AI models, we strive to capture high-resolution, high-quality photographs.

Equipment information. In order to obtain the overall and partial details of these ancient buildings, we employed a diverse array of photographic tools: a mobile phone (iPhone 13 Pro Max), a camera (Sony a7R4), and a drone (DJI Air 2S). The camera served as the primary acquisition device since it is able to capture the highest quality images. The mobile phone is used as an auxiliary device in case there is a problem with the camera’s work. The drone is used to capture images and video content from a bird’s-eye view. To capture comprehensive details of ancient buildings and ensure image clarity, we utilized various camera lenses with distinct focal lengths. The lenses used were:

  • Sigma 16–35mm F2.8: A wide-angle lens that captures expansive views of the building’s exterior and surrounding environment.

  • Caiss ZE 35mm f/1.4: A prime lens that provides a natural perspective and excellent sharpness for capturing details of the building’s architectural features.

  • Sony 55mm f/1.8: A prime lens that offers a versatile focal length suitable for capturing both close-ups of architectural details and medium-distance shots.

  • Canon EF 70-200mm f/2.8L IS II USM: A telephoto lens that allows us to capture close-ups of distant details and the overall structure of the building from a distance.

These tools enabled us to obtain a comprehensive collection of high-resolution images from various angles, providing a rich dataset (Fig. 1) for subsequent analysis and modeling.

Fig. 1
figure 1

Some examples token by various lens and different photographic tools. These tools and equipment are employed for rich and varied information.

Image processing. Despite the camera’s exceptional performance, certain areas of these buildings with obstructed lighting exhibit sub-optimal image quality in the original images. Moreover, the focal length of the lens can also cause distortion and chromatic aberration in the image. To address these challenges, we utilize the versatile Adobe Photoshop software (2022) to perform professionally and meticulously image adjustments. This meticulous process commenced in late July and was concluded within a three-month timeframe. The comparison of the original collected pictures and their related information as well as the processed photos are shown in the Fig. 2. It can be intuitively felt in the picture that the processed photos have more details and better true restoration.

Fig. 2
figure 2

Four images are shown in the figure to provide comparing results. Moreover, the setting information related to shooting is also provided for better understanding. The refined pictures provides better architecture and more details.

Images&Videos Collecting and organizing. The comprehensive image dataset is constructed after thorough shooting and sorting comprises 59 ancient buildings in Shanxi Province and most are nationally protected while others are provincially and municipally protected. The majority of these structures are built dating back to the Yuan Dynasty or earlier, which serve as quintessential examples of traditional Chinese ancient architecture. For each ancient building, we captured images of its exterior and surrounding environment from multiple vantage points, with particular focus on architectural features that embody the building’s essence, such as roofs and brackets (named dougong in Chinese). Some are captured with images and videos taken by the drone. Moreover, the telephoto photography was employed for close-up shots, while aerial drone photography was utilized to provide a panoramic view of the ancient structures. This multifaceted approach aimed to capture the essence of each ancient building through a comprehensive collection of images. In the end, we obtained a large amount of images and videos data.

Text data

Text data annotation and construction. As mentioned by prior research25, describing images in natural language is a critical problem of vision-language learning. The development of the text dataset necessitates annotating each image within the compiled image dataset. Following data annotation procedures from previous multimodal research areas, the annotation process of our dataset is meticulously divided into two distinct steps. In the initial step, pre-training techniques are employed to generate a preliminary rough-draft text. We utilize the BLIP26 model to automatically label the input image, enabling it to generate a corresponding interpretation based on the image’s content. As a result, the information of the existing pre-trained Language-Image models, like CLIP23 and BLIP26, can be effectively expressed in text annotation. However, these generated primary annotations are reverse annotations of the training data and lack expression for unseen content. The interactive image description is the the core mechanism of the current image generative models, and it requires that the image description information required in the training process must be richer, more accurate and more detailed. To this end, we incorporate expert descriptions of ancient buildings sourced from the State Administration of Cultural Heritage, local cultural relics protection units, and online repositories like Shanxi culture relics bureau27, Wikipedia and Baidu Encyclopedia. What needs to be made especially clear is that there is a lot of duplication in these related contents, and we mainly adopted the contents of shanxi culture relics bureau27.

This fusion forms a foundational text, encompassing crucial information such as the names of the ancient buildings depicted in the image and their visual characteristics. The subsequent step involves meticulously compiling the historical, geographical, developmental, and influential aspects of the relevant buildings from aforementioned sources to manually refine the text. This enriched information is then seamlessly interwoven with image-specific details, such as the image’s shooting angle, to generate the refined text.

Pinyin annotation. A fact that needs to be stated is that the components and structures of Chinese architecture have many unique names that are different from those of Western architectures. As a result, how to exactly caption the content details of the image is a matter of the explicit expression for further AI multimodal learning tasks. Traditional translations translate a simple Chinese ancient building name into complex English phrases, such as the gable and hip roof. In order to describe the names of these parts more concisely and precisely, Pinyin, the basic tool for learning Chinese, is introduced into the text data annotation as the specific tools to describe the characters of these buildings. With the help of pinyin, the name of the gable and hip roof can be concisely expressed as xieshan roof.

Style Data

Considering the future application of multimodal models training with the proposed Chinese ancient architecture dataset, we collaborate with experts in related fields of artistic creation to build some AI-design related fine-tuning models. In fact, architectural style is a multifaceted concept, inextricably linked to aesthetic considerations. To ensure the accuracy and comprehensiveness of architectural style classification, we engaged the expertise who is experienced in the field of designing to perform a professional review of relevant architectural styles. We provide seven categories referring to different styles and scenes. Each category combines with detailed description including relevant knowledge background and style characteristics. This endeavor yielded a broad categorization of architectural styles, serving as a robust foundation for the development of the subsequent architectural style LoRA. For the purpose of protecting the original authored content, we do not provide the base images for the training. And seven style LoRA models are trained based on SDXL model. Finally, our datasets supply seven trained LoRA models and one architecture content LoRA model. Combined with the architecture LoRA, we generate some results which are provided in the technical validation part to exhibit the stunning results and the consistency and variety in generated content.

Data Alignment pipelines

It is very important to handle and classify these data appropriately. Hence, the image processing work was firstly done in detail. Then, we carefully analyze the image data and the historical information behind them for better classify images. The details are as follows.

The acquired images underwent a rigorous two-step processing procedure. Initially, drawing upon the insights gleaned from prior research, a substantial number of photographs were meticulously handpicked to showcase architectural structures, intricate details, and the surrounding environment. In the subsequent stage, employing the versatility of Adobe Photoshop, the captured images underwent a series of meticulous adjustments, encompassing aspects such as shooting angle, format, lighting, and resolution. These modifications were undertaken to ensure faithful reproduction of the images and enhance the recognition capabilities of the models. To attain a balance between true-to-life restoration and stylistic coherence, selective shading was applied to enhance intricate details. The primary Photoshop operations entailed fine-tuning color temperature, brightness, and contrast.

A multi-view division of the dataset was implemented to effectively organize the vast amount and various data. Given that there are images, text, video, LoRA models and these ancient buildings undergone multiple repairs and reconstructions throughout history, certain ancient building communities encompass structures from diverse dynasties. To enhance the clarity of the dataset structure, the image dataset involve hierarchical levels. The top-level category by the type and relevance. Like the text files, all annotation files correspond to images one-to-one. As a result, we build a folder named Image&Text. Moreover, there are video folder and style folder. The second-level relate to different geographical locations in the Image&Text folder. Additionally, statistical analyses were conducted to investigate the content characteristics of the dataset, including roof forms, construction dynasties, and functional aspects.

link

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright © All rights reserved. | Newsphere by AF themes.