Written by: Michael Bachand, Xianwen Chen
At Airbnb, we run a comprehensive suite of continuous integration (CI) jobs before each iOS code change is merged. These jobs ensure that the main branch remains stable by executing critical developer workflows like building the iOS application and running tests. We also schedule jobs that perform periodic tasks like reporting metrics and uploading artifacts.
Many of our iOS CI jobs execute on Macs, which enables running developer tools provided by Apple. CI jobs for all other platforms at Airbnb execute in containers on Amazon EC2 Linux instances. To fulfill the macOS requirement of iOS CI jobs we have historically maintained alternate CI infrastructure outside of AWS specifically for iOS development. The introduction of Macs to AWS provided an opportunity for us to rethink our approach to iOS CI.
We designed the next iteration of our iOS CI system in late 2021, finished the migration to the new system in mid 2022, and polished the system through the end of 2022. CI for iOS and all other platforms at Airbnb already leveraged Buildkite for dispatching jobs. Now, we deploy iOS CI infrastructure to AWS using Terraform, which helps align CI for iOS with CI for other platforms at Airbnb.
In this article, we are excited to share with you details of the flexible and easy-to-maintain iOS CI system that we’ve implemented with Amazon EC2 Mac instances.
The Challenges with Running CI on Physical Macs
Historically we ran Airbnb iOS CI on physical Macs. We enjoyed the speed of running CI without virtualization but we paid a substantial maintenance cost to run CI jobs directly on physical hardware. An iOS infrastructure engineer individually logged into over 300 machines to perform administrative tasks like enrolling the Mac in our MDM (Mobile Device Management) tool and upgrading macOS. Manual maintenance requirements limited the scalability of the fleet and consumed engineer time that could be better spent on higher-value projects.
Our old CI machines were rarely restarted and too often drifted into a bad state. When this occurred, the best-case scenario was that an engineer could log into the machine, diagnose what configuration drift was causing issues, and manually bring the machine back to a good state. More commonly, we shut down the corrupted machine so that it could no longer accept new CI jobs. Periodically, we asked the vendor who managed our physical Macs to restore the corrupted machines to a clean installation of macOS. When the machines eventually came back online, we manually re-enrolled each machine in MDM to bring our fleet back to its full capacity.
Updating to a new version of Xcode was quite error-prone as well. We strive to roll out new Xcode versions regularly since many iOS engineers at Airbnb follow Swift and Xcode releases closely and are eager to adopt new language features and IDE improvements. However, the fixed capacity of our Mac fleet made it difficult for us to verify iOS CI jobs thoroughly against new versions; any machine allocated to testing a new version of Xcode could no longer accept CI jobs from the previous Xcode version. The risk of tackling each Xcode update was increased by the fact that rolling back to a previous version of Xcode across our fleet was not practical.
Upgrading CI with Custom macOS AMIs
When evaluating AWS, we were excited by the possibility of launching instances from Amazon Machine Images (AMIs). An AMI is a snapshot of an instance’s state, including its file system contents and other metadata. Amazon provides base AMIs for each macOS version and allows customers to create their own AMIs from running instances.
AMIs allow us to add new instances to our fleet without human intervention. An EC2 Mac bare-metal instance launched from a properly configured AMI is immediately ready to accept new work after initialization. When updating macOS, we no longer need to log into every machine in our fleet. Instead, we log into a single instance launched from the Amazon base AMI for the new macOS version. After performing a handful of manual configuration steps, like enabling automatic login, we create an Airbnb base AMI from that instance.
Initially, we powered our EC2 Mac fleet with manually created AMIs. An engineer would configure a single instance and create an AMI from that instance’s state. Then we could launch any number of additional instances from that AMI. This was a major improvement over managing physical machines since we could spin up an entire fleet of identical instances after configuring only a single instance successfully.
Now, we build AMIs using Packer. Packer programmatically launches and configures an EC2 instance using a template defined in the HashiCorp configuration language (HCL). Packer then creates an AMI from the configured EC2 instance. A Ruby wrapper script invokes Packer consistently and performs helpful validations like checking that the user has assumed the proper AWS role. We check the HCL template code into source control and all changes to our Packer template and companion scripts are made via GitHub pull requests.
We initially ran Packer from developer laptops, but the laptop needed to be awake and online for the duration of the Packer build. Eventually, we created a dedicated pipeline to build AMIs in the cloud. A developer can trigger a new build on this pipeline with a couple of clicks. A successful build will produce freshly baked and verified AMIs for both the x86 and Arm (Apple Silicon) CPU architectures within a few hours.
Defining CI Environments in Terraform
Our new CI system leveraging these AMIs consists of many environments, each of which can be managed independently. The central AWS component of each CI environment is an Auto Scaling group, which is responsible for launching the EC2 Mac instances. The number of instances in the Auto Scaling group is determined by the desired capacity property on the group and is bounded by min and max size properties.
An Auto Scaling group creates new instances using a launch template. The launch template specifies the configuration of each instance, including the AMI, and allows a “user data” script to run when the instance is launched. Launch templates can be versioned, and each Auto Scaling group is configured to launch instances from a specific version of its launch template.
Although the introduction of environments has made our CI topology more complex, we find that complexity manageable when our infrastructure is defined in code. All of our AWS infrastructure for iOS CI is specified in Terraform code that we check into source control. Each time we merge a pull request related to iOS CI, Terraform Enterprise will automatically apply our changes to our AWS account. We have defined a Terraform module that we can call whenever we want to instantiate a new CI environment.
An internal scaling service manages the desired capacity of each environment’s Auto Scaling group. This service, a modified fork of buildkite-agent-scaler, increases the desired capacity of an environment’s Auto Scaling group as CI job volume for that environment increases. We specify a maximum number of instances for each CI environment in part because On-Demand EC2 Mac Dedicated Hosts currently have a minimum host allocation and billing duration of 24 hours.
Each CI environment has a unique Buildkite queue name. Individual CI jobs can target instances in a specific environment by specifying the corresponding queue name. Jobs will fall back to the default CI environment when no queue name is explicitly specified.
Benefits of Our New iOS CI System
CI Environments Are Highly Flexible
With this new Terraform setup we are able to support an arbitrary number of CI environments with minimal overhead. We create a new CI environment per CPU architecture and version of Xcode. We can even duplicate these environments across multiple versions of macOS when performing an operating system update across our fleet. We use dedicated staging environments to test CI jobs on instances launched from a new AMI before we roll out that AMI broadly.
When we are no longer regularly using a CI environment, we can specify a minimum capacity of zero when calling the Terraform module, which will set the same value on the underlying Auto Scaling group. Then the Auto Scaling group will only launch instances when its desired capacity is increased by the scaling service. In practice, we tend to delete older environments from our Terraform code. However, even once an environment has been wound down, reinstating that environment is as simple as reverting a couple of commits in Git and redeploying the scaling service.
Rotation of Instances Increases CI Consistency
To minimize the opportunity for EC2 instances to drift, we terminate all instances each night and replace them daily. This way, we can be confident that our CI fleet is in a known good state at the start of each day.
When an instance is terminated, the underlying Dedicated Host is scrubbed before a new instance can be launched on that host. We terminate instances at a time when CI demand is low to allow for the EC2 Mac scrubbing process to complete before we need to launch fresh instances on the same hosts. When an instance terminates itself overnight, it will decrement the desired capacity of the Auto Scaling group to which it belongs. As engineers start pushing commits the next day, the scaling service will increment the desired capacity on the appropriate Auto Scaling groups, causing new instances to be launched.
When an instance does experience configuration drift, we can disconnect that instance from Buildkite with one click. The instance will remain running but will no longer accept new CI jobs. An engineer can log into the instance to investigate its state until the instance is eventually terminated at the end of the day. To keep overall CI capacity stable, we can manually add an additional instance to our fleet, or a replacement will be launched automatically if we terminate the instance early.
We Ship Xcode Versions More Quickly
We appreciate the new capabilities of our upgraded CI system. We can lease additional Dedicated Hosts from Amazon on demand to weather unexpected spikes in CI usage and to test software updates thoroughly. We roll out new AMIs gradually and can roll back painlessly if we encounter unexpected issues.
Together, these capabilities get Airbnb iOS developers access to Swift language features and Xcode IDE improvements more quickly. In fact, with the tailwind of our new CI system, we have seen the pace at which we update Xcode increase by over 20%. As of the time of writing, we have internally rolled out all available major and minor versions of Xcode 14 (14.0–14.3) as they have been released.
The Migration is Complete
Our new CI system ran over 10 million minutes of CI jobs in the last three months of 2022. After upgrading to EC2, we spend meaningfully fewer hours on maintenance despite a growing codebase and consistently high job volume. Our newfound ability to scale CI to meet the evolving needs of the Airbnb iOS community justifies the increased complexity of the rebuilt system.
After the migration to AWS, iOS CI benefits more from shared infrastructure that is already being used successfully within Airbnb. For example, the new iOS CI architecture enabled us to avoid implementing an iOS-specific solution for automatically scaling capacity. Instead, we leverage the aforementioned fork of buildkite-agent-scaler that Airbnb engineers had already converted to an internal Airbnb service complete with a dedicated deployment pipeline. Additionally, we used existing Terraform modules that are maintained by other teams to integrate with IAM and SSM.
We have found that EC2 Mac instances launched from custom AMIs provide many of the benefits of virtualization without the performance penalty of executing within a virtual machine. We consider AWS, Packer, and Terraform to be essential technologies for building a flexible CI system for large-scale iOS development in 2023.
Xianwen Chen, the technical lead of this project, designed the topology of the iOS CI system, implemented the design with Terraform, and later enabled creation of AMIs in the cloud. Michael Bachand built the initial version of our Packer tooling and used this tooling to create the first programmatically built AMIs capable of completing iOS CI jobs. Steven Hepting productionized our Packer tooling by adding support for Arm AMIs and evolving the Packer template so that all of Airbnb’s iOS CI jobs could run successfully on both CPU architectures.
We received invaluable support from numerous subject-matter experts at Airbnb who were very generous with their time. Many thanks to Brandon Kurtz for advising on content and voice through multiple revisions of this article.
If you are interested in joining us on our quest to make the best iOS app in the App Store, please see our careers page for open iOS roles.