Airbnb’s AI-powered photograph tour utilizing Imaginative and prescient Transformer | by Pei Xiong | The Airbnb Tech Weblog

Social Media Engagement in Early 2025

4 April 2025

Utilizing the Strangler Fig with Cellular Apps

Utilizing the Strangler Fig with Cell Apps

28 March 2025

Boosting laptop imaginative and prescient accuracy and efficiency at Airbnb

By: Pei Xiong, Aaron Yin, Jian Zhang, Lifan Yang, Lu Zhang, Dean Chen

In recent times, the combination of synthetic intelligence with journey platforms has remodeled how individuals seek for and e book lodging. As a number one international market for distinctive journey experiences and lodging, Airbnb continuously strives to reinforce the visitor expertise by offering informative content material concerning the number of properties shared by our hosts. One of many methods we assist visitors higher perceive what an inventory provides earlier than they e book is thru our AI-powered photograph tour function.

The AI-powered photograph tour within the Listings tab, which helps hosts higher set up their itemizing images, leverages imaginative and prescient transformers’ fine-tuned function to evaluate a various set of itemizing photos and precisely establish and classify images based mostly into particular rooms and areas. On this weblog submit, we are going to dive into the inside workings of the photograph tour together with mannequin choice, pretraining, fine-tuning strategies, and the trade-offs between computational prices and scalability. We can even particularly focus on how we enhanced mannequin accuracy regardless of having restricted coaching information.

Determine 1: Picture Tour product powered by ML

Room-type classification is the primary facet of the photograph tour, The purpose of room classification is to precisely categorize photos into 16 completely different room sorts designed within the Airbnb product reminiscent of ‘Bed room’, ‘Full lavatory’, ‘Half lavatory’, ‘Front room’, and ‘Kitchen’, offering customers with a complete understanding of the obtainable areas. The problem lies within the variety of room layouts, lighting situations, and the necessity for fashions that may generalize effectively throughout numerous environments.

We performed experiments utilizing a number of state-of-the-art fashions, together with Imaginative and prescient Transformer (ViT) variants — ViT-base, ViT-large and completely different resolutions. Moreover, we explored the efficiency of ConvNext2, a just lately proposed convolutional neural community with comparable efficiency to ViT, and MaxVit, a variant combining the strengths of each Imaginative and prescient Transformers and CNNs. Initially of this challenge, we examined these approaches on a picture classification process with Airbnb’s host-provided information, and located that ViT outperforms the opposite approaches. Thus we selected ViT in our following research.

One other key part of photograph tour is picture clustering, which teams the photographs of the identical room right into a cluster. A prerequisite of that’s the skill to measure the similarity between two photos, which signifies the likelihood that the 2 photos belong to the identical room. This can be a supervised classification downside, with the enter being two photos, and the output being a binary label of 0 or 1. As proven in Determine 2, We employed a Siamese community that concurrently processes two photos, by making use of the identical picture embedding mannequin to every picture, and subsequently computing the cosine similarity of the ensuing embeddings.

Determine 2: An illustration of Siamese community for picture similarity

Our evaluation discovered that the amount of coaching information is vital to increased prediction accuracy. Doubling the coaching information quantity sometimes results in a discount of error price of ≈5% on common, with the impact being extra vital within the earlier phases.

Determine 3: correlation between information quantity and accuracy

Sadly, it is extremely costly to amass high-quality coaching information because it requires human labeling. Subsequently, we would have liked to seek out different methods to enhance mannequin accuracy with a restricted quantity of coaching information. We adopted these steps to enhance mannequin accuracy:

Step 1 — Pre-training: We began from a pre-trained mannequin on ImageNet. We took that mannequin and educated it with a considerable amount of host-provided information, which has decrease accuracy and solely covers a few of our class labels. This offered a baseline mannequin for switch studying within the following steps.

Step 2 — Multi-task coaching: We fine-tuned the mannequin from the earlier step utilizing each higher-accuracy coaching information for the goal process (e.g., room-type classification), and a further sort of coaching information that has been labeled for an additional associated process (e.g., object detection). This offered further coaching information and created a number of completely different fashions for future steps.

Step 3 — Ensemble studying: We created an ensemble from a number of fashions in Step 2, which was achieved via coaching with completely different auxiliary duties, and by utilizing completely different variations of ViTs (e.g., ViT-base vs. ViT-large, and/or these consuming photos of dimension 224 vs 384). This strategy allowed us to generate a various set of fashions, from which we chosen the most effective performers to assemble the ultimate ensemble mannequin.

Step 4 — Distillation: Though the ensemble mannequin has increased accuracy than any particular person mannequin, it requires extra computational assets and thus will increase the latency and price of our product. We educated a distilled mannequin to mimic the conduct of the ensemble mannequin, which has related accuracy however diminished computational price by a number of folds.

Our pretraining course of concerned harnessing the huge repository of Airbnb itemizing images, comprising of hundreds of thousands of photos, to coach a Imaginative and prescient Transformer (ViT) mannequin. Whereas leveraging the Airbnb itemizing images for pretraining gives a considerable benefit, there are additionally limitations within the dataset. There have been inaccuracies or mislabels within the human-labeled dataset and so they materially impacted the mannequin’s skill to discern patterns successfully. One other notable limitation is the protection of solely 4 out of the entire 16 room classifications inside the pre-training dataset.

Subsequently, increasing the protection of fine-tuning to incorporate further courses is crucial. We developed an in depth and up to date guideline and generated a human-label dataset with everything of 16 room classifications. Iterative fine-tuning processes steadily encompassed everything of the 16 room sorts, contributing to a extra complete and versatile mannequin.

Buying high-quality human-labeled coaching information is a problem as a result of expensive and time-consuming labeling course of. Regardless of this, we had already gathered a big repository of labeled information throughout different numerous duties, together with room-type classification, picture high quality prediction, same-room classification, class classification, and object detection. By totally using this in depth and diversely labeled dataset, we considerably improved the prediction accuracy in our duties. To realize this, we carried out multi-task coaching that comes with further label courses from current duties, as demonstrated in Determine 4. Every learner is a imaginative and prescient transformer, and along with predicting a single set of labels, we allowed completely different learners to be taught different label sorts, reminiscent of facilities and ImageNet21k labels, which additional boosts general efficiency as proven in Desk 1.

Determine 4: Multi-task studying illustration

Ensemble studying is a strong method in machine studying that leverages numerous fashions with related accuracies to attain higher accuracy and generalization.

We utilized ensemble studying on numerous fashions with completely different architectures, mannequin sizes, and auxiliary duties reminiscent of facilities and ImageNet21k class predictions. Upon aggregating the predictions of the person fashions, we noticed a notable improve within the general accuracy in comparison with any single mannequin. The noticed enchancment is credited to the ensemble’s functionality to deal with and cut back each misclassifications and inaccuracies of particular person fashions, resulting in extra correct predictions, regardless of the restricted human-labeled coaching information.

Whereas ensemble studying provides substantial positive factors in accuracy, it requires heightened computational assets as a number of massive fashions are concerned in every inference process. To prioritize mannequin effectivity with out compromising efficiency, we turned to data distillation, a way centered round transferring data from a complicated ensemble of fashions to a extra compact single mannequin.

Our distillation course of transfers the data encoded in each laborious targets and the gentle targets of a fancy ensemble to a smaller and less complicated mannequin. Exhausting targets are ground-truth labels whereas the gentle targets are the ensemble’s probabilistic predictions, enabling the smaller mannequin to seize the nuanced determination boundaries realized by the ensemble. The general coaching goal is a weighted mixture of the 2 losses:

the place the primary loss is the cross-entropy loss based mostly on laborious targets, the second loss is Kullback-Leibler divergence to judge the cross entropy between gentle targets from the ensemble and the predictions of the coed mannequin, and the distillation coefficient determines the load assigned to the distillation loss.

Remarkably, our distilled mannequin achieved efficiency metrics on par with the ensemble fashions, regardless of its considerably diminished inference time and useful resource necessities. This end result demonstrates the efficacy of data distillation in preserving the ensemble’s collective intelligence inside a extra streamlined mannequin.

As a part of the preparations for the launch of our end-to-end Picture Tour, we employed a rigorous analysis course of referred to as “Golden Analysis”, which mimics the precise person expertise by calculating the minimal variety of modifications required to make the Picture Tour generated by our mannequin an identical to the human-labeled floor fact (i.e., the Golden Analysis). In distinction to coaching information that’s evenly distributed throughout courses, the golden analysis processes on the Airbnb itemizing degree, aiming to duplicate the person’s perspective. We sampled listings, every containing a median of 25–30 images, and outlined accuracy because the minimal variety of corrections required to make assignments in line with human labels. These corrections consult with modifications in room project, the place a photograph’s preliminary room prediction is modified to match the consensus room label offered by a number of human labels. For instance, if a photograph of bed room 1 is falsely assigned to the lounge, one correction is required to maneuver it from the lounge to bed room 1.

There are images that can’t be correctly assigned to a named area. We categorised miscellaneous images, together with close-up photographs, photos containing people or animals, in addition to close by images of procuring areas, eating places, and parks, into the class labeled as “Others”. Moreover, if a photograph is of an empty area in a room such that we can not decide its room location, we’re allowed to designate some images as “Unassigned”, which don’t rely within the accuracy calculation. This situation happens sometimes (as proven in Desk 3), and is primarily used to let customers determine in probably the most ambiguous circumstances. This analysis served as the ultimate launch standards. In the end, we efficiently diminished the error price to five.28%, passing the interior analysis normal at Airbnb and Picture Tour was launched as a showcase function within the November 2023 product launch.

Our exploration of utilizing Imaginative and prescient Transformers to enhance our photograph tour product has been profitable and rewarding. By incorporating pretraining, multi-task studying, ensemble studying, and data distillation, we’ve considerably enhanced mannequin accuracy. Pretraining offered a powerful basis, whereas multi-task studying enriched the mannequin’s skill to interpret numerous visuals. Ensemble studying mixed mannequin strengths for strong predictions, and data distillation enabled environment friendly deployment with out sacrificing accuracy.

The AI-powered photograph tour was launched as a part of Airbnb’s 2023 Winter Launch. Since then, we have now been diligently monitoring the efficiency of this product and proceed to refine our fashions additional for an much more seamless person expertise.

We wish to thank everybody concerned within the challenge. A particular due to all the Airbnb person, itemizing, and platform staff for his or her relentless efforts in growing and launching the product, guaranteeing its continued excellence. Moreover, we prolong our gratitude to the Airbnb Machine Studying Infra staff for his or her essential help in constructing a strong infrastructure that photograph tour depends upon.

If any such work pursuits you, take a look at a few of our associated roles!