Recently we went through an exercise where we moved all of our database masters between data centers. We planned on doing this online with minimal user impact. Obviously when performing this sort of action there are a variety of considerations such as cache consistency and other pieces of shared state in stores like HBase, but the focus of this post will be primarily on MySQL.
During this move we had a number of constraints. As mentioned above this was to be online when serving production traffic with minimal user impact. In aggregate we service hundreds of thousands of database queries per second. Additionally we needed to encrypt all data transferring between data centers. MySQL replication supports encryption, but connections to the servers themselves present several challenges. Specifically, from a performance standpoint the handshake to establish a connection across a WAN can impact latency if there is significant connection churn. Additionally, servicing read queries across a backhaul link adds latency, which is never desirable.
We decided to tackle these issues in several ways. We were able to leverage a number of existing features of our applications and infrastructure, as well as developing new automation to fill gaps in functionality. Our configuration and applications in various runtimes, were able to support a read/write split (which may seem obvious to some, but isn’t always easy to accomplish in every scenario). We used the read/write split, along with encrypted replication, to provide a local read replica. Some runtimes can set up a persistent encrypted connection to a remote master, which serviced read requests in those cases, as the per-connection latency was amortized over a large number of queries. For runtimes which have a high churn rate, such as PHP, we used a MySQL proxy, ProxySQL, which provided persistent, encrypted connections, as well as meeting our performance requirements. We built automation to deploy proxies for numerous database pools, servicing thousands of requests per second, per pool.
When performing the cutover, our workflow was as follows. In each data center, there was a config which pointed to a local read slave, a remote master, and a local proxy with the master (remote or local) as a backend. When moving masters between datacenters, our database automation, Jetpants(new release coming soon!), reparented all replicas, and our automation updated the proxy backend to point to the new master. This resulted in seconds of read-only state per database pool and minimal user impact.
More coming soon!