by: Charles Covey-Brandt
Airbnb not too long ago accomplished our first large-scale, LLM-driven code migration, updating almost 3.5K React element take a look at information from Enzyme to make use of React Testing Library (RTL) as a substitute. We’d initially estimated this might take 1.5 years of engineering time to do by hand, however — utilizing a mix of frontier fashions and sturdy automation — we completed your complete migration in simply 6 weeks.
On this weblog publish, we’ll spotlight the distinctive challenges we confronted migrating from Enzyme to RTL, how LLMs excel at fixing this explicit sort of problem, and the way we structured our migration tooling to run an LLM-driven migration at scale.
In 2020, Airbnb adopted React Testing Library (RTL) for all new React element take a look at growth, marking our first steps away from Enzyme. Though Enzyme had served us nicely since 2015, it was designed for earlier variations of React, and the framework’s deep entry to element internals now not aligned with trendy React testing practices.
Nevertheless, due to the basic variations between these frameworks, we couldn’t simply swap out one for the opposite (learn extra concerning the variations right here). We additionally couldn’t simply delete the Enzyme information, as evaluation confirmed this might create vital gaps in our code protection. To finish this migration, we wanted an automatic approach to refactor take a look at information from Enzyme to RTL whereas preserving the intent of the unique exams and their code protection.
In mid-2023, an Airbnb hackathon crew demonstrated that giant language fashions might efficiently convert a whole bunch of Enzyme information to RTL in only a few days.
Constructing on this promising outcome, in 2024 we developed a scalable pipeline for an LLM-driven migration. We broke the migration into discrete, per-file steps that we might parallelize, and configurable retry loops, and considerably expanded our prompts with further context. Lastly, we carried out breadth-first immediate tuning for the lengthy tail of advanced information.
We began by breaking down the migration right into a collection of automated validation and refactor steps. Consider it like a manufacturing pipeline: every file strikes by way of levels of validation, and when a test fails, we carry within the LLM to repair it.
We modeled this movement like a state machine, shifting the file to the subsequent state solely after validation on the earlier state handed:
This step-based strategy offered a stable basis for our automation pipeline. It enabled us to trace progress, enhance failure charges for particular steps, and rerun information or steps when wanted. The step-based strategy additionally made it easy to run migrations on a whole bunch of information concurrently, which was essential for each shortly migrating easy information, and chipping away on the lengthy tail of information later within the migration.
Early on within the migration, we experimented with totally different immediate engineering methods to enhance our per-file migration success price. Nevertheless, constructing on the stepped strategy, we discovered the simplest route to enhance outcomes was merely brute drive: retry steps a number of instances till they handed or we reached a restrict. We up to date our steps to make use of dynamic prompts for every retry, giving the validation errors and the newest model of the file to the LLM, and constructed a loop runner that ran every step as much as a configurable variety of makes an attempt.
With this easy retry loop, we discovered we might efficiently migrate a lot of our simple-to-medium complexity take a look at information, with some ending efficiently after a number of retries, and most by 10 makes an attempt.
For take a look at information as much as a sure complexity, simply growing our retry makes an attempt labored nicely. Nevertheless, to deal with information with intricate take a look at state setups or extreme indirection, we discovered the most effective strategy was to push as a lot related context as doable into our prompts.
By the tip of the migration, our prompts had expanded to anyplace between 40,000 to 100,000 tokens, pulling in as many as 50 associated information, an entire host of manually written few-shot examples, in addition to examples of current, well-written, passing take a look at information from throughout the identical venture.
Every immediate included:
- The supply code of the element underneath take a look at
- The take a look at file we have been migrating
- Validation failures for the step
- Associated exams from the identical listing (sustaining team-specific patterns)
- Basic migration pointers and customary options
Right here’s how that regarded in observe (considerably trimmed down for readability):
// Code instance reveals a trimmed down model of a immediate
// together with the uncooked supply code from associated information, imports,
// examples, the element supply itself, and the take a look at file emigrate.const immediate = [
'Convert this Enzyme test to React Testing Library:',
`SIBLING TESTS:n${siblingTestFilesSourceCode}`,
`RTL EXAMPLES:n${reactTestingLibraryExamples}`,
`IMPORTS:n${nearestImportSourceCode}`,
`COMPONENT SOURCE:n${componentFileSourceCode}`,
`TEST TO MIGRATE:n${testFileSourceCode}`,
].be a part of('nn');
This wealthy context strategy proved extremely efficient for these extra advanced information — the LLM might higher perceive team-specific patterns, frequent testing approaches, and the general structure of the codebase.
We must always word that, though we did some immediate engineering at this step, the principle success driver we noticed was selecting the proper associated information (discovering close by information, good instance information from the identical venture, filtering the dependencies for information that have been related to the element, and so on.), slightly than getting the immediate engineering good.
After constructing and testing our migration scripts with retries and wealthy contexts, after we ran our first bulk run, we efficiently migrated 75% of our goal information in simply 4 hours.
That 75% success price was actually thrilling to get to, nevertheless it nonetheless left us with almost 900 information failing our step-based validation standards. To deal with this lengthy tail, we wanted a scientific approach to perceive the place remaining information have been getting caught and enhance our migration scripts to handle these points. We additionally wished to do that breadth first to aggressively chip away at our remaining information with out getting caught on probably the most troublesome migration instances.
To do that, we constructed two options into our migration tooling.
First, we constructed a easy system to present us visibility into frequent points our scripts have been going through by stamping information with an automatically-generated remark to file the standing of every migration step. Right here’s what that code remark regarded like:
// MIGRATION STATUS: {"enyzme":"completed","jest":{"handed":8,"failed":2,"complete":10,"skipped":0,"successRate":80},"eslint":"pending","tsc":"pending",}
And second, we added the power to simply re-run single information or path patterns, filtered by the particular step they have been caught on:
$ llm-bulk-migration --step=fix-jest --match=project-abc/**
Utilizing these two options, we might shortly run a suggestions loop to enhance our prompts and tooling:
- Run all remaining failing information to search out frequent points the LLM is getting caught on
- Choose a pattern of information (5 to 10) that exemplify a typical difficulty
- Replace our prompts and scripts to handle that difficulty
- Re-run in opposition to the pattern of failing information to validate our repair
- Repeat by working in opposition to all remaining information once more
After working this “pattern, tune, sweep” loop for 4 days, we had pushed our accomplished information from 75% to 97% of the overall information, and had slightly below 100 information remaining. By this level, we had retried many of those lengthy tail information anyplace between 50 to 100 instances, and it appeared we have been pushing right into a ceiling of what we might repair by way of automation. Relatively than spend money on extra tuning, we opted to manually repair the remaining information, working from the baseline (failing) refactors to cut back the time to get these information over the end line.
With the validation and refactor pipeline, retry loops, and expanded context in place, we have been capable of routinely migrate 75% of our goal information in 4 hours.
After 4 days of immediate and script refinement utilizing the “pattern, tune, and sweep” technique, we reached 97% of the three.5K authentic Enzyme information.
And for the remaining 3% of information that didn’t full by way of automation, our scripts offered an incredible baseline for guide intervention, permitting us to finish the migration for these remaining information in one other week of labor.
Most significantly, we have been capable of change Enzyme whereas sustaining authentic take a look at intent and our general code protection. And even with excessive retry counts on the lengthy tail of the migration, the overall value — together with LLM API utilization and 6 weeks of engineering time — proved way more environment friendly than our authentic guide migration estimate.
This migration underscores the ability of LLMs for large-scale code transformation. We plan to increase this strategy, develop extra refined migration instruments, and discover new purposes of LLM-powered automation to reinforce developer productiveness.
Wish to assist form the way forward for developer instruments? We’re hiring engineers who love fixing advanced issues at scale. Take a look at our careers web page to study extra.
All product names, logos, and types are property of their respective house owners. All firm, product and repair names used on this web site are for identification functions solely. Use of those names, logos, and types doesn’t indicate endorsement.