Peter Zaitsev: The Multi-Source GTID Replication Maze

存储架构 Planet MariaDB (源链)

In this blog post, we’ll look at how to navigate some of the complexities of multi-source GTID replication.

GTID replication is often a real challenge for DBAs, especially if this has to do with multi-source GTID replication. A while back, I came across a really interesting customer environment with shards where multi-master, multi-source, multi-threaded MySQL 5.6 MIXED replication was active. This is a highly complex environment that has both pros and cons, introducing risks as a trade-off for specific customer requirements.

This is the set up of part of this environment:

I started looking into this setup when a statement broke replication between db1 and db10 . Replication broke due to a statement executed on a schema that was not present on db10. This also resulted in changes originating from db1 to not being pushed down to db100 as db10 , as we stopped the replication thread (for db1 channel).

On the other hand, replication was not stopped on db2 because the schema in question was present on db2 . Replication between db2 and db20 was broken as well because the schema was not present in db20 .

In order to fix db1 -> db10 replication, four GTID sets were injected in db10 .

Here are some interesting blog posts regarding how to handle/fix GTID replication issues:

After injecting the GTID sets, we started replication again and everything ran fine.

After that, we had to check the db2 -> db20 replication, which, as I’ve already said, was broken as well. In this case, injecting only the first GTID trx into db20 instead of all of those causing issues on db10 was enough!

You may wonder how this is possible. Right? The answer is that the rest of them were replicated from db10 to db20 , although the channel was not the same.

Another strange thing is the fact that although the replication thread for the db2 -> db20 channel was stopped (broken), checking the slave status on db20 showed that Executed_Gtid_Set was moving for all channels even though Retrieved_Gtid_Set for the broken one was stopped! So what was happening there?

This raised my curiosity, so I decided to do some further investigation and created scenarios regarding other strange things that could happen. An interesting one was about the replication filters. In our case, I thought “What would happen in the following scenario … ?”

Let’s say we write a row from db1 to db123.table789 . This row is replicated to db10 (let’s say using channel 1) and to db2 (let’s say using channel2). On channel 1, we filter out the db123.% tables, on channel2 we don’t. db1 writes the row and the entry to the binary log. db2 writes the row after reading the entry from the binary log and subsequently writes the entry to its own binary log and replicates this change to db20 . This change is also replicated to db10 . So now, on db10 (depending on which channel finds the GTID first) it either gets filtered on channel1 and written to its own bin log at just startcommit with any actual DDL/DML removed, or if it is read first on channel2 ( db1 -> db2 and then db20 -> db10 ) then it is NOT filtered out and executed instead. Is this correct? It definitely ISN’T!

Points of interest

You can find answers to the above questions in the points of interest listed below. Although it’s not really clear through the official documentation, this is what happens with GTID replication and multi-source GTID replication:

  • As we know GTID sets are unique across all nodes in a given cluster. In multi-source replication, Executed_Gtid_Set is common for all channels. This means that regardless the originating channel, when a GTID transaction is executed it is recorded in all channels’ Executed_Gtid_Set . Although it’s logical (each database is unique, so if a trx is going to affect a database it shouldn’t be tightened to a single channel regardless of the channel it uses), the documentation doesn’t provide much info around this.
  • When we have multi-source, multi-level replication, there are cases where the GTID sets originating from one master can end up on one slave via different replication paths. It’s not clear if it applies any special algorithm (although it doesn’t seem that there could be one), but the preferred method seems to be FIFO. The fastest wins! This means that GTID sets can travel to the slave via different channels, and it’s related to how fast the upper-level slaves can commit changes. In fact, the path doesn’t really matter as it only executes each GTID trx once.
  • Replication filters are global regardless the channel. This means they apply each filter to all channels. This is normal as we can’t define a replication filter per channel. In order to be able to debug such cases, adding a small replication delay per channel seems a good idea.


CB Insights 数据库:值得关注的 9 家早期阶段企业级 AI 初创企业... 当下,初创企业纷纷利用机器学习算法来进行企业内部的业务分析、预测,并实现自动化,服务对象覆盖各个领域,既有半导体界巨头企业,也有四大会计师事务所。各个企业组织现在收集和分析的信息量也比以前更多,它们开始依靠人工智能技术来预测用户行为、开发新型解决方案并且实现流程的自动化。 人工智能领域不断发...
每篇半小时1天入门MongoDB——4.MongoDB索引介绍及数据库命令操作... 准备工作 继续连接到mongo C:Userszouqi>mongo MongoDB shell version: 3.0.7 connecting to: test 查看数据库和集合 > show dbs demo 0.078GB lo...
Sharepoint: What happens to lists based ... I have kind of hypothetical question (at least for now :)) Let's say I create list based on some custom content-type. I add some 1000 items into tha...
MongoDB And CouchDB – Which NoSQL Database Should ... Developers commonly choose databases as per the requirements of a project. The snapshot below is a pie chart which marks the database selected as per ...
Innovation Starts At Home…? Mention was made a couple of times last week in the VC’s presentation to the OU about the need to be more responsive in our curriculum design and c...
责编内容来自:Planet MariaDB (源链) | 更多关于

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » Peter Zaitsev: The Multi-Source GTID Replication Maze

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录