Rebuilding Payment Orchestration at Airbnb

Introduction

Airbnb’s payment orchestration system is responsible for ensuring reliable money movement between hosts, guests, and Airbnb. In short, guests should be charged the right amount at the right time using their selected payment methods; hosts should be paid the right amount at the right time to their desired payout methods. For historical reasons, Airbnb’s billing data, payment APIs, payment orchestration, and user experiences were tightly coupled with the concept of a reservation for a stay. Unfortunately, this meant that a payment-related feature for stays had to be rebuilt for other products — for example, Airbnb Experiences — and each implementation may have its own product-specific quirks. As you can imagine, this approach is neither scalable nor easy to maintain.

For several years, Airbnb has been migrating away from our monolithic Ruby on Rails application toward a service-oriented architecture (SOA). This migration has been discussed extensively in several Airbnb tech blog posts. We will gloss over some of the technical discussions common to those migrations and instead focus on some of the aspects that were unique to migrating our payments systems. While many teams at Airbnb chose to create a one-to-one replacement when migrating to SOA, the payments organization instead decided to use it as an opportunity to fundamentally redesign our services to provide a sound technical foundation for future growth. As a consequence of this decision, the migration process took longer to complete than a more straightforward one-to-one replacement.

Why Redesign?

https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FMssx8PleeYc%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DMssx8PleeYc&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FMssx8PleeYc%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtubeAirbnb CEO Brian Chesky tells a story about the origin of payments at Airbnb

As Brian shared in the video above, support for on-platform payments has played a critical role in establishing trust among Airbnb’s hosts and guests. Airbnb has grown significantly since our first payment system was created over a decade ago and, with that growth, the scope and scale of payments at Airbnb have also grown and changed. Many of the original payment models were tied closely to reservations for a stay. This made sense in the early days of Airbnb as there was only one product, and the engineers working on payments at that time did an excellent job developing a solution that solved the needs of guests and hosts. While these original models used for payments have proven extremely versatile and powerful, this tight coupling between stays and payments has led to increased complexity when adding new products like Experiences or features like the Resolution Center.

When planning for the SOA migration, Airbnb’s payments teams made a bold decision to fundamentally redesign the payments system. Our goal was to create a payment platform that would allow teams across Airbnb to quickly, easily, and safely integrate new features and products with payments. It’s not feasible to list all of the enhancements in a single blog post, so this post will focus on some design highlights affecting the new payment orchestration system: idempotency, platformization, and data immutability.

Idempotent Orchestration

As discussed in an earlier blog post, idempotency is a common technique to maintain consistency among distributed services. The new payment orchestration system was designed around Orpheus (the idempotency framework described in that post). Every major workflow is divided into a directed acyclic graph (DAG) of retryable idempotent steps, each with well-defined behavior. This allows the payment orchestration layer to maintain eventual consistency with other key services (such as the payment gateway layer and product fulfillment services). This approach has led to five 9s (99.999%) of consistency for payments.

The idempotency framework works well for both synchronous and asynchronous communication between services. For asynchronous communication, payments services primarily use a Kafka-based message bus to send “events” to one another. Event processors use the idempotency framework to enhance the at-least-once guarantee of Kafka into an exactly-once guarantee. The transactional integrity analysis tools described in this post provide an additional layer of confidence by ensuring consistency between events and transactional data sources.

Product-Agnostic Platform

The payments SOA migration decoupled product fulfillment, payment orchestration, and pricing

A significant disadvantage of our legacy payment data models is that they were closely tied to a single product, reservations for stays. For this reason, our new payment orchestration service was intentionally architected to avoid tightly coupling the payments system to any particular product. Instead, the new orchestration layer was designed around generic payment-specific workflows (e.g., validation, payment processing, financial reporting) with payment-specific logic and product-specific logic isolated from one another, with the exception of a few well-defined integration points. When combined with the generic billing and pricing APIs described in this blog post, this approach allows new products to integrate quickly and easily with existing generic payment flows, drastically reducing both engineering effort and time to delivery. Additionally, as new features are added to the payment systems, these features can be easily adopted by other products.

Data Immutability

Immutable data is easier to understand, audit, and reconcile. All of the new payment services were built around the idea of data immutability. For payment orchestration, data immutability manifests in two major forms: persistent events and versioning. Events are naturally append-only. It is the responsibility of the event consumer to determine if a new event represents a modification to an existing event. When an existing product is altered (e.g., adding another night to a stay), the modifications to the payment orchestration plan are modeled as a new version in a sequence of plans for that product. The combined information from all the versions provides a complete history of the intended and actual money movement related to that product.

A Phased Migration

Various teams at Airbnb took different approaches to the migration towards a service-oriented architecture (SOA). Many teams chose to migrate functionality in small blocks, replacing the legacy implementation with an equivalent SOA one. Generally, with this approach, the existing system would be broken down into discrete, cohesive, functional blocks. Each block could be migrated mostly independently of the others. The behavior of each block would be well defined and the result could be trivially compared across both systems to ensure consistent results.

The Airbnb payments organization took a different approach for the migration of the various payments systems. Instead of small functional blocks, the migration for the payments systems was broken down into four major phases: Pricing, Payouts, Bookings, and Data Migration. The Pricing phase remodeled each of the product-specific pricing models into a generic model that could be used across all Airbnb products. The Payouts and Bookings phases fundamentally redesigned the way that money movement is orchestrated at Airbnb to more easily support new products, features, and business needs. The majority of the work related to payment orchestration was contained within these phases. The Data Migration phase migrated existing bookings from the legacy system to SOA, allowing the legacy system to be wound down and deprecated.

Within each phase, the migration was divided into smaller migrations, usually by feature or product. For example, in the Bookings phase, bookings for stays were migrated independently from bookings for experiences. When reasonable, those subphases were further broken down as well. The migration of bookings for stays was subdivided into over 30 milestones based on characteristics of the bookings. The relatively small scope of each milestone allowed engineers and data scientists to thoroughly test and validate each set of migrations. Additionally, the relatively independent nature of each milestone allowed many of them to be completed in parallel.

Maintaining Two Systems

The new payment orchestration system introduced a fundamentally redesigned data model based around the concept of a bill. Unlike the legacy model, the new data model is not tied to any specific product, but rather focuses on being sufficiently powerful, extensible, and generic to be useful for existing and future Airbnb products. One important consequence of fundamentally redesigning the payment data model was that it became non-trivial to convert from one data model to another.

In general, historical bookings and payouts were not moved from one system to another as part of the initial migration process. Rather, new bookings and payouts would be routed to SOA if they were deemed eligible. Otherwise, they would continue to be routed to the legacy system. Throughout most of the migration process, existing bookings would continue to proceed through their lifecycle in the legacy monolithic system. Only at the tail end of the migration were active bookings transitioned from the legacy system to SOA. As a result, engineering teams needed to maintain two parallel payment orchestration systems throughout virtually the entire migration process.

Most consumers of payments data don’t actually care whether the data is stored in the legacy or SOA system; they just want the data. In order to provide an easy and consistent experience for those client services, a new transformation layer was built to transparently retrieve data from the correct underlying source and to seamlessly convert them into a unified data model that could be consumed by all clients. The translation layer proved incredibly valuable as it decoupled the work of the teams working on the migration from the work of the client teams.

Nothing happens in a vacuum. While the migration was in progress, business needs arose and features had to be added to the payment orchestration system. For each feature, teams had to decide whether the changes should be implemented in only one system or in both. In many cases, this led to twice as much work in order to maintain a consistent user experience across both systems. In other cases, features were simply deferred or redesigned to avoid duplication of effort.

Finally, special care had to be taken to ensure that both systems behaved in the way that our guests and hosts expected. Ideally, guests and hosts wouldn’t even notice the difference apart from some improvements in performance. Additional tooling and workflows were created to ensure that Airbnb’s support ambassadors continued to provide a consistent experience for our guests and hosts regardless of which system was used to orchestrate payments.

One key learning from this experience was how critical it is to communicate with all stakeholders to ensure that everyone is aligned on timelines, constraints, and priorities. Maintaining two parallel systems over an extended period of time creates a lot of overhead and slows down iteration speeds for new features. It is vital to ensure that the broader organization is aligned on the timeline so that product teams aren’t unnecessarily slowed down by unexpected work related to a partially migrated system. Splitting the migration into phases helps reduce the time during which teams are impacted.

Commitment to Craft

Perhaps the most important part of the migration process was ensuring that the new system was built with Airbnb’s Commitment to Craft in mind and thoroughly validated before being rolled out. A dedicated team of quality assurance engineers performed comprehensive manual testing of hundreds of scenarios to help to ensure consistency with the legacy system across a wide spectrum of use cases. In addition, an extensive set of unit tests, shallow integration tests, and end-to-end integration tests were created across the entire payments engineering organization to ensure the correct behavior of key payment flows. As an additional safeguard, whenever possible, asynchronous “matchup” jobs would compare the new data model and the old data model to validate that both codepaths produced consistent results.

Conclusion

Payments systems are complex. Taking the time to thoughtfully redesign the system can lead to improvements in maintainability, extensibility, performance, and resiliency. However, there are also noteworthy disadvantages to a long-lived migration process. The process can lead to uncertainty among clients of the service and consume resources that might otherwise be spent creating new features or optimizing existing flows. It is possible to mitigate some of these concerns by dividing the migration into smaller, well-defined milestones and ensuring regular communication with stakeholders. A thorough testing and validation plan is vital for ensuring that the new service can seamlessly replace legacy systems. By following this approach, we were able to launch a new payment orchestration system that is faster, easier to maintain, and can more easily support new products, features, and business needs.

Watch the recording of the Make Money Moves tech talk for a more in-depth discussion of the migration of payments services to SOA.