Our journey in making use of embedding-based retrieval methods to construct an correct and scalable candidate retrieval system for Airbnb Houses search
Authors: Mustafa (Moose) Abdool, Soumyadip Banerjee, Karen Ouyang, Do-Kyum Kim, Moutupsi Paul, Xiaowei Liu, Bin Xu, Tracy Yu, Hui Gao, Yangbo Zhu, Huiji Gao, Liwei He, Sanjeev Katariya
Search performs an important function in serving to Airbnb company discover the proper keep. The objective of Airbnb Search is to floor essentially the most related listings for every consumer’s question — however with thousands and thousands of accessible properties, that’s no simple activity. It’s particularly tough when searches embrace giant geographic areas (like California or France) or high-demand locations (like Paris or London). Latest improvements — similar to versatile date search, which permits company to discover stays with out mounted check-in and check-out dates — have added one more layer of complexity to rating and discovering the appropriate outcomes.
To deal with these challenges, we’d like a system that may retrieve related properties whereas additionally being scalable sufficient (when it comes to latency and compute) to deal with queries with a big candidate depend. On this weblog put up, we share our journey in constructing Airbnb’s first-ever Embedding-Based mostly Retrieval (EBR) search system. The objective of this method is to slender down the preliminary set of eligible properties right into a smaller pool, which might then be scored by extra compute-intensive machine studying fashions later within the search rating course of.
Determine 1: The final phases and scale for the assorted forms of rating fashions utilized in Airbnb Search
We’ll discover three key challenges in constructing this EBR system: (1) developing coaching information, (2) designing the mannequin structure, and (3) growing a web-based serving technique utilizing Approximate Nearest Neighbor (ANN) options.
Step one in constructing our EBR system was coaching a machine studying mannequin to map each properties and de-identified search queries into numerical vectors. To realize this, we constructed a coaching information pipeline (Determine 3) that leveraged contrastive studying — a method that includes figuring out pairs of positive- and negative-labeled properties for a given question. Throughout coaching, the mannequin learns to map a question, a optimistic dwelling, and a adverse dwelling right into a numerical vector, such that the similarity between the question and the optimistic dwelling is way larger than the similarity between the question and the adverse dwelling.
To assemble these pairs, we devised a sampling methodology based mostly on consumer journeys. This was an necessary design resolution, since customers on Airbnb usually bear a multi-stage search journey. Information exhibits that earlier than making a last reserving, customers are likely to carry out a number of searches and take varied actions — similar to clicking into a house’s particulars, studying critiques, or including a house to a wishlist. As such, it was essential to develop a method that captures this whole multi-stage journey and accounts for the varied forms of listings a consumer would possibly discover.
Diving deeper, we first grouped all historic queries of customers who made bookings, utilizing key question parameters similar to location, variety of company, and size of keep — our definition of a “journey.” For every journey, we analyzed all searches carried out by the consumer, with the ultimate booked itemizing because the optimistic label. To assemble (optimistic, adverse) pairs, we paired this booked itemizing with different properties the consumer had seen however not booked. Unfavorable labels had been chosen from properties the consumer encountered in search outcomes, together with these they’d interacted with extra intentfully — similar to by wishlisting — however finally didn’t e book. This alternative of adverse labels was key: Randomly sampling properties made the issue too simple and resulted in poor mannequin efficiency.
Determine 2: Instance of developing (optimistic, adverse) pairs for a given consumer journey. The booked house is at all times handled as a optimistic. Negatives are chosen from properties that appeared within the search outcome (and had been probably interacted with) however that the consumer didn’t find yourself reserving.
Determine 3: Instance of total information pipeline used to assemble coaching information for the EBR mannequin.
The mannequin structure adopted a conventional two-tower community design. One tower (the itemizing tower) processes options concerning the dwelling itemizing itself — similar to historic engagement, facilities, and visitor capability. The opposite tower (the question tower) processes options associated to the search question — such because the geographic search location, variety of company, and size of keep. Collectively, these towers generate the embeddings for dwelling listings and search queries, respectively.
A key design resolution right here was selecting options such that the itemizing tower could possibly be computed offline every day. This enabled us to pre-compute the house embeddings in a day by day batch job, considerably lowering on-line latency, since solely the question tower needed to be evaluated in real-time for incoming search requests.
Determine 4: Two-tower structure as used within the EBR mannequin. Word that the itemizing tower is computed offline day by day for all properties.
The ultimate step in constructing our EBR system was selecting the infrastructure for on-line serving. We explored quite a few approximate nearest neighbor (ANN) options and narrowed them down to 2 essential candidates: inverted file index (IVF) and hierarchical navigable small worlds (HNSW). Whereas HNSW carried out barely higher when it comes to analysis metrics — utilizing recall as our essential analysis metric — we finally discovered that IVF supplied one of the best trade-off between pace and efficiency.
The core purpose for that is the excessive quantity of real-time updates per second for Airbnb dwelling listings, as pricing and availability information is regularly up to date. This brought on the reminiscence footprint of the HNSW index to develop too giant. As well as, most Airbnb searches embrace filters, particularly geographic filters. We discovered that parallel retrieval with HNSW alongside filters resulted in poor latency efficiency.
In distinction, the IVF answer, the place listings are clustered beforehand, solely required storing cluster centroids and cluster assignments inside our search index. At serving time, we merely retrieve listings from the highest clusters by treating the cluster assignments as a normal search filter, making integration with our present search system fairly easy.
Determine 5: Total serving circulation utilizing IVF. Houses are clustered beforehand and, throughout on-line serving, properties are retrieved from the closest clusters to the question embedding.
On this method, our alternative of similarity operate within the EBR mannequin itself ended up having attention-grabbing implications. We explored each dot product and Euclidean distance; whereas each carried out equally from a mannequin perspective, utilizing Euclidean distance produced way more balanced clusters on common. This was a key perception, as the standard of IVF retrieval is extremely delicate to cluster dimension uniformity: If one cluster had too many properties, it might vastly cut back the discriminative energy of our retrieval system.
We hypothesize that this imbalance arises with dot product similarity as a result of it inherently solely considers the course of characteristic vectors whereas ignoring their magnitudes — whereas a lot of our underlying options are based mostly on historic counts, making magnitude an necessary issue.
Determine 6: Instance of the distribution of cluster sizes when utilizing dot product vs. Euclidean distance as a similarity measure. We discovered that Euclidean distance produced way more balanced cluster sizes.
The EBR system described on this put up was absolutely launched in each Search and E-mail Advertising and marketing manufacturing and led to a statistically-significant acquire in total bookings when A/B examined. Notably, the bookings carry from this new retrieval system was on par with a few of the largest machine studying enhancements to our search rating up to now two years.
The important thing enchancment over the baseline was that our EBR system successfully included question context, permitting properties to be ranked extra precisely throughout retrieval. This finally helped us show extra related outcomes to customers, particularly for queries with a excessive variety of eligible outcomes.
We want to particularly thank the complete Search and Information Infrastructure & ML Infrastructure org (led by Yi Li) and Advertising and marketing Expertise org (led by Michael Kinoti) for his or her nice collaborations all through this venture!