Database Disaster Recovery | In-depth comparison between MySQL MGR and Alibaba Cloud PolarDB-X Paxos

2024-07-12

Open Source Ecosystem

As we all know, MySQL master-slave database (two nodes) generally achieves high data availability through asynchronous replication and semi-synchronous replication (Semi-Sync). However, in abnormal scenarios such as network failure in the computer room and host hang, the master-slave architecture is likely to have data inconsistency problems after HA switching (RPO!=0 for short).Therefore, as long as the business data is of a certain importance, you should not choose a database product with a MySQL master-slave architecture (two nodes). It is recommended to choose a multi-copy architecture with RPO=0.

MySQL community, regarding the evolution of multi-copy technology with RPO=0:

MySQL is officially open source and has launched the MySQL Group Replication (MGR) high-availability solution based on group replication. It internally encapsulates the Paxos protocol through XCOM to provide data consistency guarantees.
Ali Cloud PolarDB-X, which originated from the business polishing and verification of Alibaba e-commerce Double Eleven and multi-site active-active, was fully open sourced in October 2021, fully embracing the MySQL open source ecosystem. PolarDB-X is positioned as a centralized distributed integrated database. Its data node Data Node (DN) uses the self-developed X-Paxos protocol and is highly compatible with MySQL 5.7/8.0. It not only provides financial-grade high availability, but also has the characteristics of highly scalable transaction engine, flexible operation and maintenance disaster recovery, and low-cost data storage. Reference:PolarDB-X open source | MySQL triple replica based on Paxos》。

PolarDB-XThe concept of centralized and distributed integration: Data node DN can be separated as a centralized (standard version) form, which is fully compatible with the stand-alone database form. When the business grows to the point where distributed expansion is required, the architecture is upgraded to a distributed form in situ, and the distributed components are seamlessly connected to the original data nodes. No data migration or modification on the application side is required, and the availability and scalability brought by distribution can be enjoyed. Architecture description:Centralized and Distributed Integration

MySQL's MGR and PolarDB-X's standard version DN both use the Paxos protocol at the most basic level. So what are their specific performances and differences in actual use? This article will elaborate on them in detail from the aspects of architecture comparison, key differences, and test comparison.

MGR/DN abbreviation explanation: MGR represents the technical form of MySQL MGR, and DN represents the technical form of PolarDB-X single DN centralized (standard version).

TL;DR

The detailed comparative analysis is quite long, so you can take a look at the summary and conclusion first. If you are interested, you can follow the summary to find clues in subsequent articles.

MySQL MGR is not recommended for general businesses and companies because it requires professional technical knowledge and an operation and maintenance team to use it well. This article also reproduces three long-standing "pitfalls" of MySQL MGR in the industry:

Pitfall 1: MySQL MGR, XCOM protocol uses full memory mode, which does not meet the data consistency guarantee of RPO=0 by default (there is a test case to reproduce the problem of data loss later in this article). You need to explicitly configure a parameter to ensure it. The current design of MGR cannot achieve both performance and RPO.
Pitfall 2: MySQL MGR has poor performance under network latency. The article tested four network scenarios (including three data centers in the same city and three centers in two locations). Under the performance parameter configuration, the performance of cross-city is only 1/5 of that of the same city. If RPO=0 data guarantee is enabled, the performance is even worse. Therefore, MySQL MGR is more suitable for use in the same data center scenario, and cross-data center disaster recovery is not suitable.
Pitfall 3: In the multi-copy architecture of MySQL MGR, the failure of the standby node will cause the traffic of the primary node Leader to drop to 0, which is not in line with common sense. The article focuses on trying to enable the single leader mode of MGR (compared with the previous master-slave copy architecture of MySQL), simulating the downtime and recovery of the standby copy. The operation and maintenance of the standby node will also cause the traffic of the primary node (Leader) to drop to 0 (for about 10 seconds), and the overall operability is relatively poor. Therefore, MySQL MGR has relatively high requirements for host operation and maintenance, and requires a professional DBA team.

Compared with MySQL MGR, PolarDB-X Paxos does not have the same pitfalls as MGR in terms of data consistency, cross-data center disaster recovery, and node operation and maintenance, but it also has some minor shortcomings and advantages in disaster recovery:

In a simple scenario in the same data center, the read-only performance under low concurrency and the pure write performance under high concurrency are about 5% lower than MySQL MGR. At the same time, there is room for further performance optimization in the network transmission of multiple copies.
Advantages: 100% compatible with the features of MySQL 5.7/8.0. At the same time, it has made many streamlined optimizations in the replication of multiple replicas of standby databases and the failover path. The high-availability switching RTO is <= 8 seconds. It performs well in the common 4-hour disaster recovery scenario in the industry. It can replace semi-sync, MGR, etc.

1. Architecture comparison

Glossary

MGR/DN abbreviation description:

MGR: The technical form of MySQL MGR, the subsequent content is referred to as MGR
DN: The technical form of Alibaba Cloud PolarDB-X centralized (standard edition), in which the distributed data node DN can be separated as a centralized (standard edition) form, which is fully compatible with stand-alone databases. The following content is referred to as: DN

MGR

MGR supports single-master and multi-master modes, and fully reuses MySQL's replication system, including Event, Binlog & Relaylog, Apply, Binlog Apply Recovery, and GTID. The key difference from DN is that the entry point for the majority of MGR transaction logs to reach consensus is before the master database transaction is committed.

Leader：

1. Before the transaction is committed, the before_commit hook function group_replication_trans_before_commit is called to enter the majority replication of MGR
2. MGR uses the Paxos protocol to synchronize the Binlog Events cached on THD to all online nodes
3. After receiving the majority response, MGR determines that the transaction can be submitted
4. THD enters the transaction group submission process, starts writing local Binlog, updates Redo, and replies to the client with an OK message

Follower：

1. The Paxos Engine of MGR continuously listens for protocol messages from the Leader
2. After a complete Paxos consensus process, it is confirmed that this (batch) event has reached the majority in the cluster.
3. Write the received Event to the Relay Log, IO Thread Apply Relay Log
4. The Relay Log application goes through the complete group submission process, and the standby database will eventually generate its own binlog file

The reason why MGR adopts the above process is that MGR defaults to multi-master mode and each node can write. Therefore, the Follower node in a single Paxos Group needs to convert the received log into RelayLog first, and then combine it with the write transaction submitted by itself as the Leader to generate a Binlog file in the two-stage group submission process to submit the final transaction.

DN

DN reuses the basic data structure and function-level code of MySQL, but closely integrates log replication, log management, log playback, and crash recovery with the X-Paxos protocol to form its own set of majority replication and state machine mechanisms. The key difference from MGR is that the entry point for the DN transaction log majority to reach consensus is during the transaction submission process of the master database.

Leader：

1. Enter the transaction group commit process. In the Flush phase of group commit, write the Events on each THD into the Binlog file, and then broadcast the log asynchronously to all Followers through X-Paxos.
2. In the Sync phase of group commit, Binlog is persisted first, and then the X-Paxos persistence location is updated
3. In the Commit phase of group submission, we must first wait for X-Paxos to receive a majority response, then commit the group of transactions, and finally reply to the client with an OK message.

Follower：

1. X-Paxos continuously listens for protocol messages from the Leader
2. Receive a (group of) Events, write to local Binlog, and reply
3. Receive the next message, which carries the commit index of the site where the majority has been reached
4. The SQL Apply thread continues to apply the received Binlog logs in the background, and applies them only to the majority of the dispatch points at most.

The reason for this design is that DN currently only supports single-master mode, so the log at the X-Paxos protocol level is the Binlog itself, and the Follower also omits the Relay Log. The data content of its persistent log is equivalent to that of the Leader and other logs.

2. Key Differences

2.1. Paxos Protocol Efficiency

MGR

MGR's Paxos protocol is based on the Mencius protocol and belongs to the Multi-Paxos theory. The difference is that Mencius has made optimization improvements in reducing the master node load and improving throughput.
The Paxos protocol of MGR is implemented by the XCOM component, which supports multi-master and single-master mode deployment. In single-master mode, the Binlog on the Leader is atomically broadcast to the Follower node. The broadcast of each batch of messages (a transaction) is a standard Multi-Paxos process.
To satisfy the majority of a transaction, XCOM needs to go through at least three message interactions: Accept+AckAccept+Learn, that is,At least 1.5 RTT overhead.At most, three message interactions are required: Prepare+AckPrepare+Accept+AckAccept+Learn.That is, the total overhead is at most 2.5 RTTs.
Since the Paxos protocol is highly cohesive within the XCOM module, the MySQL replication system is not aware of it. Therefore, the Leader must wait for the entire Paxos process to complete before committing the transaction locally, including Binlog persistence and group commit.
After completing the majority submission, the Follower will asynchronously persist the Events to the Relay Log, and then the SQL Thread application and group will submit the production Binlog
Since the log synchronized by Paxos is an unsorted Binlog before entering the group submission process, the order of Binlog Events on the Leader may not be the same as the order of Events in the Relay Log on the Follower node.

DN's Paxos protocol is based on the Raft protocol and also belongs to the Multi-Paoxs theory. The difference is that the Raft protocol has stronger leadership guarantee and engineering stability guarantee.
The Paxos protocol of DN is implemented by the X-Paoxs component. The default mode is single-master mode. In single-master mode, the Binlog on the Leader is atomically broadcast to the Follower nodes. The broadcast of each batch of messages is a standard Raft process.
To satisfy the majority of a transaction, X-Paoxs only needs to go through the Append+AckAppend message interaction.1 RTT overhead
After the leader sends the log to the follower, as long as the majority is satisfied, it commits the transaction without waiting for the second phase of the Commit Index broadcast.
Follower needs to persist all transaction logs before completing majority submission. This is significantly different from MGR's XCOM. MGR only needs to receive it in XCOM memory.
The Commit Index is carried over in subsequent messages and heartbeat messages. After the CommitIndex is pushed up, the Follower will apply the Event.
The order of the Binlog contents of the Leader and Follower is consistent, there are no holes in the Raft log, and the Batching/Pipeline mechanism is used to increase the log replication throughput.
Compared with MGR, the Leader always has only one round-trip delay when the transaction is committed., which is critical for latency-sensitive distributed applications

2.2. RPO

Theoretically, both Paxos and Raft can ensure data consistency and that the logs that have reached a majority are not lost after Crash Recovery, but there are still differences in specific projects.

MGR

XCOM completely encapsulates the Paxos protocol, and all its protocol data is cached in memory first. By default, transactions reaching a majority do not require log persistence. In the case of majority downtime or Leader failure, there will be a serious problem of RPO != 0.Assume an extreme scenario:

The MGR cluster consists of three nodes, ABC. AB is in the same city and C is in another city. A is the leader node and BC is the follower node.
Transaction 001 is initiated on Leader A node. Leader A broadcasts the log of transaction 001 to BC node. The transaction is considered to be committed when the majority is satisfied through the Paxos protocol. AB nodes constitute the majority. Node C does not receive the log of transaction 001 due to cross-city network delay.
The next moment, Leader A commits transaction 001 and returns a success message to the Client, indicating that transaction 001 has been committed to the database.
At this moment, on the Follower of node B, the log of transaction 001 is still in the XCOM cache and has not yet been flushed to the RelayLog; at this moment, the Follower of node C has still not received the log of transaction 001 sent by the Leader of node A.
At this time, nodes AB are down, node A cannot be restored for a long time, node B is quickly restarted and restored, and nodes BC continue to provide read and write services
Since the transaction 001 log was not persisted to the RelayLog of node B and was not received by node C during the downtime, both nodes BC have actually lost transaction 001 and cannot retrieve it.
In this scenario where the majority of servers are down, RPO!=0 is caused

Under the community default parameters, transactions reaching majority do not require log persistence and do not guarantee RPO=0, which can be considered a trade-off for performance in the XCOM project implementation. To ensure absolute RPO=0, the parameter group_replication_consistency that controls read-write consistency needs to be configured as AFTER, but in this case, in addition to the 1.5 RTT network overhead, transactions reaching majority also require one log IO overhead, which will result in poor performance.

PolarDB-X DN uses X-Paxos to implement the distributed protocol, which is deeply bound to MySQL's Group Commit process. When a transaction is committed, it is mandatory for the majority to confirm that the transaction is persisted before it can be truly committed. Here, the majority means that the Binlog of the master database is persisted to disk, and the IO thread of the standby database receives the log of the master database and persists it to its own Binlog. Therefore, even in extreme scenarios where all nodes fail, data will not be lost, and RPO=0 can be guaranteed.

2.3. RTO

The RTO time is closely related to the time overhead of the system's cold restart, which is reflected in the specific basic functions:Fault detection mechanism -> crash recovery mechanism -> master election mechanism -> log balancing

2.3.1. Fault Detection

MGR

Each node periodically sends heartbeat packets to other nodes to detect whether other nodes are healthy. The heartbeat period is fixed at 1s and cannot be adjusted.
If the current node finds that other nodes have not responded after exceeding group_replication_member_expel_timeout (default 5s), it will be considered as a failed node and kicked out of the cluster.
For exceptions such as network disconnection or abnormal restart, after the network is restored, the single faulty node will automatically try to join the cluster and then catch up the logs.

The leader node periodically sends heartbeat packets to other nodes to check whether other nodes are healthy. The heartbeat period is 1/5 of the election timeout. The election timeout is controlled by the parameter consensus_election_timeout, which defaults to 5s, so the leader node heartbeat period defaults to 1s
If the Leader finds that other nodes are offline, it will continue to periodically send heartbeat packets to all other nodes to ensure that other nodes can be connected in time after the crash. However, the Leader node will no longer send transaction logs to the offline nodes.
Non-Leader nodes do not send heartbeat detection packets, but if a non-Leader node finds that it has not received the heartbeat of the Leader node for more than the consensus_election_timeout time, it triggers a re-election.
For exceptions such as network outages or abnormal restarts, the failed node will automatically join the cluster after the network is restored.
Therefore, in terms of fault detection, DN provides more operation and maintenance configuration interfaces, which can more accurately identify faults in cross-city deployment scenarios.

2.3.2. Crash Recovery

MGR

- The Paxos protocol implemented by XCOM is in memory state. The majority does not require persistence, and the protocol state is based on the memory state of the surviving majority nodes.If all nodes fail, the protocol cannot be restored. Manual intervention is required to restore the cluster after it is restarted.
- If only a single node crashes and needs to be recovered, but the Follower node lags behind the Leader node by a large number of transaction logs, and the transaction logs cached by the XCOM on the Leader have been cleared, you can only use the Global Recovery or Clone process.
- The XCOM cache size is controlled by group_replication_message_cache_size, the default is 1GB
- Global Recovery means that when a node rejoins the cluster, it recovers data by obtaining the required missing transaction logs (Binary Logs) from other nodes. This process relies on at least one node in the cluster retaining all the required transaction logs.
- Clone relies on Clone Plugin and is used for recovery when the data volume is large or a lot of logs are missing. It copies a snapshot of the entire database to the crashed node and then performs the final synchronization through the latest transaction log.
- The Global Recovery and Clone processes are usually automated, but in some special cases, such as network problems or the XCOM caches of the other two nodes are cleared, manual intervention is required.

- The X-Paxos protocol uses Binlog persistence. When recovering from a crash, the committed transactions will be fully restored first. For suspended transactions, it is necessary to wait for the XPaxos protocol layer to reach a consensus on the master-slave relationship before committing or rolling back. The entire process is fully automated. Even if all nodes fail, cluster restart can automatically recover.
- In the scenario where the Follower node lags behind the Leader node by a lot of transaction logs, as long as the Binlog file on the Leader is not deleted, the Follower node will definitely catch up.
- Therefore, in terms of crash recovery, DN does not require any manual intervention.

2.3.3. Master Selection

In single-leader mode, the basic principle followed by the election of the leader in MGR's XCOM and DN X-Paxos, a strong leader mode, is the same - the logs that the cluster has reached consensus on cannot be rolled back. However, there are differences when it comes to logs that have not reached consensus.

MGR

Leader election is more about which node will serve as the leader next.This leader may not have the latest consensus log when it is elected, so it needs to synchronize the latest log from other nodes in the cluster and provide read and write services after the logs are balanced.
The advantage of this is that the selection of the Leader itself is a strategic product, such as weight, such as order. MGR controls the weight of each node through the group_replication_member_weight parameter
The disadvantage is that the newly elected leader may have a large replication delay and need to continue to catch up with the log, or a large application delay and need to continue to catch up with the log application before it can provide read and write services. This will result in a longer RTO time.

Leader election is a protocol-based process. The node that has the majority of logs in the cluster can be elected as the Leader. Therefore, this node may have been a Follower or a Logger before.
The Logger cannot provide read and write services. After it synchronizes the logs to other nodes, it will voluntarily give up the Leader role.
In order to ensure that the designated node becomes the Leader, DN uses an optimistic weight strategy + a mandatory weight strategy to limit the order of becoming the leader, and uses a strategic majority mechanism to ensure that the new leader can immediately provide read and write services with zero latency.
Therefore, in terms of master selection, DN not only supports strategic selection like MGR, but also supports mandatory weight strategy.

2.3.4. Log balancing

Log balancing means that there is a log replication delay between the primary and standby databases, and the standby database needs to catch up with the logs. For nodes that are restarted and recovered, the standby database is usually used for recovery, and a log replication delay has occurred with the primary database, so the log needs to be caught up with the primary database. For nodes that are physically far away from the Leader, the majority is usually irrelevant to them, and they always have a log replication delay and are constantly catching up with the logs. These situations require specific engineering implementation to ensure that the log replication delay is resolved in a timely manner.

MGR

The transaction logs are all in the XCOM cache, and the cache is only 1G by default, so when the Follower node that is far behind in replication requests the log, it is easy for the cache to have been cleared
At this point, the lagging Follower will be automatically kicked out of the cluster, and then the Global Recovery or Clone process mentioned above will be used for crash recovery, and it will automatically join the cluster after catching up.If you encounterIf there is a network problem, or the XCOM cache of the other two nodes is cleared, manual intervention is required to solve the problem
Why must we kick out the cluster first? Because the multi-write mode faulty node greatly affects the performance, and the Leader's cache has no effect on it. Add it back after asynchronously catching up.
Why can't we directly read the local Binlog file of the Leader? Because the XCOM protocol mentioned above is in full memory state, and there is no protocol information about XCOM in the Binlog and Relay Log.

The data is all in the Binlog file. As long as the Binlog is not cleaned up, it can be sent on demand. There is no situation of being kicked out of the cluster.
In order to reduce the IO jitter caused by the master database reading old transaction logs from the Binlog file, DN first reads the most recently cached transaction logs from the FIFO Cache. The FIFO Cache is controlled by the parameter consensus_log_cache_size, which is 64M by default.
If the old transaction log in the FIFO Cache has been eliminated by the updated transaction log, DN will try to read the previously cached transaction log from the Prefetch Cache. The Prefetch Cache has a parameter consensus_prefetch_cache_size to control it, and the default value is 64M.
If the required old transaction log is not available in the Prefetch Cache, DN will try to initiate an asynchronous IO task to batch read several consecutive logs before and after the specified transaction log from the Binlog file, place them in the Prefetch Cache, and wait for the next retry of DN to read them.
Therefore, DN does not require any human intervention in catching up with the logs.

2.4. Backup database playback delay

The playback delay of the standby database is the delay between the time when the same transaction is committed in the primary database and the time when the transaction is applied in the standby database. This tests the performance of applying application logs in the standby database. It affects how long it takes for the standby database to complete its own data application and provide read and write services after becoming the new primary database when an exception occurs.

MGR

The MGR standby database receives the RelayLog file from the primary database. When applying, the RelayLog needs to be read again, and the complete two-stage group submission process is performed to produce the corresponding data and Binlog files.
The transaction application efficiency here is the same as the transaction commit efficiency on the primary database. The default dual-one configuration (innodb_flush_log_at_trx_commit, sync_binlog) will cause the standby database application to have a large resource overhead.

The DN standby database receives the Binlog file from the primary database. When applying, the Binlog needs to be read again. It only needs to go through the one-stage group submission process to produce the corresponding data.
Because DN supports complete Crash Recovery, standby database applications do not need to enable innodb_flush_log_at_trx_commit=1, so they are not actually affected by the dual-one configuration.
Therefore, in terms of backup database playback delay, the DN backup database playback efficiency is much greater than that of MGR.

2.5. Impact of Large Transactions

Large transactions not only affect the submission of ordinary transactions, but also affect the stability of the entire distributed protocol in a distributed system. In severe cases, a large transaction will cause the entire cluster to be unavailable for a long time.

MGR

MGR does not have any optimization for supporting large transactions. It only adds the parameter group_replication_transaction_size_limit to control the upper limit of large transactions. The default value is 143M and the maximum value is 2GB.
When the transaction log exceeds the large transaction limit, an error will be reported directly and the transaction cannot be submitted.

In order to solve the instability problem of distributed systems caused by large transactions, DN adopts the solution of large transaction splitting + large object splitting to solve it. DN will split the transaction log of large transactions into small blocks by logical splitting + physical splitting. Each small block of transaction log uses the complete Paxos commit guarantee.
Based on the large transaction splitting solution, DN does not impose any restrictions on the size of large transactions, and users can use them at will, and RPO=0 can also be guaranteed.
For detailed description, seePolarDB-X Storage Engine Core Technology | Large Transaction Optimization
Therefore, when dealing with large transactions, DN can be unaffected by large transactions.

2.6. Deployment form

MGR

MGR supports single-master and multi-master deployment modes. In multi-master mode, each node can read and write. In single-master mode, the master database can read and write, while the standby database can only read.
MGR high availability deployment requires at least three nodes, that is, at least three copies of data and logs. Logger replication is not supported.
MGR does not support the expansion of read-only nodes, but supports the combination of MGR + master-slave replication mode to achieve similar topology expansion.

DN supports single-primary mode deployment. In single-primary mode, the primary database is readable and writable, while the standby database is read-only.
DN high availability deployment requires at least three nodes, but supports log replica Logger form, that is, Leader and Follower are full-featured replicas, while Logger only has logs but no data, and has no right to be elected. In this way, three-node high availability deployment only requires 2 copies of data + 3 copies of log storage overhead, low-cost deployment
DN supports read-only node deployment and read-only replica Learner form. Compared with full-function replicas, Learner replicas only do not have voting rights. Subscription consumption of the master database by downstream is realized through Learner replicas.

2.7. Feature Summary

		MGR	DN
Protocol efficiency	Transaction commit time	1.5~2.5 RTT	1 RTT
	Majority persistence	XCOM Memory Saving	Binlog persistence
reliability	RPO=0	No guarantee by default	Fully guaranteed
	Fault detection	All nodes check each other, high network load The heart rate cycle cannot be adjusted	The master node periodically checks other nodes Heart rate parameters are adjustable
	Majority Crash Recovery	Human intervention	Automatic recovery
	Minority Crash Recovery	Most of the recovery is done automatically, but manual intervention is required in special cases	Automatic recovery
	Select the master	Freely specify the order of selecting the master	Freely specify the order of selecting the master
	Log tying	The log behind cannot exceed XCOM 1GB cache	BInlog files are not deleted
	Standby database playback delay	Two stages + double one, very slow	One stage + double zero, faster
	Big Business	The default limit is no more than 143MB	No size limit
form	High availability cost	Full-featured triple replicas, 3 copies of data storage overhead	Logger log copy, 2 copies of data storage
	Read-only nodes	Implemented with master-slave replication	The protocol comes with a Leaner read-only replica implementation

3. Test comparison

MGR was introduced in MySQL 5.7.17, but more MGR-related features are only available in MySQL 8.0, and MySQL 8.0.22 and later versions will be more stable and reliable overall. Therefore, we chose the latest version 8.0.32 of both parties for comparison testing.

Considering the differences in test environment, compilation method, deployment method, operating parameters, and test methods when comparing PolarDB-X DN and MySQL MGR, which may lead to inaccurate test comparison data, this article adopts the following methods in various details:

Test Preparation	PolarDB-X DN	MySQL MGR[1]
Hardware Environment	Use the same physical machine with 96C 754GB memory and SSD disk
operating system	Linux 4.9.168-019.ali3000.alios7.x86_64
Kernel version	Use a kernel baseline based on the community 8.0.32 version
Compilation method	Compile with the same RelWithDebInfo
Operating parameters	Use the same PolarDB-X official website to sell 32C128G with the same specifications and parameters
Deployment	Single-primary mode

Note:

MGR has flow control enabled by default, while PolarDB-X DN has flow control disabled by default. Therefore, you should configure MGR's group_replication_flow_control_mode to be disabled separately, so that MGR's performance will be the best.
MGR has an obvious so read bottleneck when checking, so the replication_optimize_for_static_plugin_config of MGR is configured to be turned on separately, so that the read-only performance of MGR will be the best

Performance

Performance testing is the first thing that everyone pays attention to when selecting a database. Here we use the official sysbench tool to build 16 tables, each with 10 million data, to perform performance testing in OLTP scenarios, and compare the performance of the two in different OLTP scenarios and different concurrency. Considering the different actual deployment situations, we simulate the following four deployment scenarios respectively.

Three nodes are deployed in the same computer room, and the network delay between machines is 0.1ms when pinging each other
Three centers in the same city, three nodes deployed in three data centers in the same region, with 1ms network latency when pinging between data centers (for example: three data centers in Shanghai)
Three centers in two locations, three nodes deployed in three computer rooms in two locations, 1ms ping between computer rooms in the same city, and 30ms network latency between the same city and different locations (for example: Shanghai/Shanghai/Shenzhen)
Three centers in three locations, three nodes deployed in three data centers in three locations (for example: Shanghai/Hangzhou/Shenzhen), the ping between Hangzhou and Shanghai has a network delay of about 5ms, and the longest distance from Hangzhou/Shanghai to Shenzhen has a network delay of 30ms

illustrate:

a. Considering the horizontal comparison of the performance of four deployment scenarios, the two-site three-center and three-site three-center deployment modes both adopt a three-replica deployment mode. The actual production business can be expanded to a five-replica deployment mode.

b. Considering the strict restrictions on RPO=0 when actually using high-availability database products, but the default configuration of MGR is RPO<>0, here in each deployment scenario, we continue to add MGR RPO<>0 and RPO=0 comparison tests.

MGR_0 represents the data when MGR RPO = 0
MGR_1 Data indicating the case where MGR RPO <> 0
DN represents the data for the case where DN RPO = 0

3.1.1. Same computer room

		1	4	16	64	256
oltp_read_only	MGR_1	688.42	2731.68	6920.54	11492.88	14561.71
	MGR_0	699.27	2778.06	7989.45	11590.28	15038.34
	DN	656.69	2612.58	7657.03	11328.72	14771.12
	MGR_0 vs MGR_1	1.58%	1.70%	15.45%	0.85%	3.27%
	DN vs MGR_1	-4.61%	-4.36%	10.64%	-1.43%	1.44%
	DN vs MGR_0	-6.09%	-5.96%	-4.16%	-2.26%	-1.78%
oltp_read_write	MGR_1	317.85	1322.89	3464.07	5052.58	6736.55
	MGR_0	117.91	425.25	721.45	217.11	228.24
	DN	360.27	1485.99	3741.36	5460.47	7536.16
	MGR_0 vs MGR_1	-62.90%	-67.85%	-79.17%	-95.70%	-96.61%
	DN vs MGR_1	13.35%	12.33%	8.00%	8.07%	11.87%
	DN vs MGR_0	205.55%	249.44%	418.59%	2415.07%	3201.86%
oltp_write_only	MGR_1	761.87	2924.1	7211.97	10374.15	16092.02
	MGR_0	309.83	465.44	748.68	245.75	318.48
	DN	1121.07	3787.64	7627.26	11684.37	15137.23
	MGR_0 vs MGR_1	-59.33%	-84.08%	-89.62%	-97.63%	-98.02%
	DN vs MGR_1	47.15%	29.53%	5.76%	12.63%	-5.93%
	DN vs MGR_0	261.83%	713.78%	918.76%	4654.58%	4652.96%

From the test results we can see that:

In the read-only scenario, whether comparing MGR_1 (RPO<>0) or MGR_0 (RPO=0), the difference between DN and MGR is stable between -5% and 10%, which can be considered to be basically the same. Whether RPO is equal to 0 has no effect on read-only transactions
In mixed read-write and write-only transaction scenarios, the performance of DN (RPO=0) is improved by 5%~47% compared to MGR_1 (RPO<>0), and DN has a significant performance advantage at low concurrency, but not at high concurrency. This is because the protocol efficiency of DN is higher at low concurrency, but the performance hotspots of DN and MGR at high concurrency are all due to flushing.
Under the same RPO=0 premise, in mixed read-write and write-only transaction scenarios, the performance of DN is 2 to 46 times higher than that of MGR_0, and as the concurrency increases, the performance advantage of DN is strengthened. It is no wonder that MGR also abandons RPO=0 by default for performance.

3.1.2. Three centers in the same city

TPS Comparison		1	4	16	64	256
oltp_read_only	MGR_1	695.69	2697.91	7223.43	11699.29	14542.4
	MGR_0	691.17	2708.6	7849.98	11636.94	14670.99
	DN	645.11	2611.15	7628.39	11294.36	14647.22
	MGR_0 vs MGR_1	-0.65%	0.40%	8.67%	-0.53%	0.88%
	DN vs MGR_1	-7.27%	-3.22%	5.61%	-3.46%	0.72%
	DN vs MGR_0	-6.66%	-3.60%	-2.82%	-2.94%	-0.16%
oltp_read_write	MGR_1	171.37	677.77	2230	3872.87	6096.62
	MGR_0	117.11	469.17	765.64	813.85	812.46
	DN	257.35	1126.07	3296.49	5135.18	7010.37
	MGR_0 vs MGR_1	-31.66%	-30.78%	-65.67%	-78.99%	-86.67%
	DN vs MGR_1	50.17%	66.14%	47.82%	32.59%	14.99%
	DN vs MGR_0	119.75%	140.01%	330.55%	530.97%	762.86%
oltp_write_only	MGR_1	248.37	951.88	2791.07	5989.57	11666.16
	MGR_0	162.92	603.72	791.27	828.16	866.65
	DN	553.69	2173.18	5836.64	10588.9	13241.74
	MGR_0 vs MGR_1	-34.40%	-36.58%	-71.65%	-86.17%	-92.57%
	DN vs MGR_1	122.93%	128.30%	109.12%	76.79%	13.51%
	DN vs MGR_0	239.85%	259.96%	637.63%	1178.61%	1427.92%

From the test results we can see that:

In the read-only scenario, whether comparing MGR_1 (RPO<>0) or MGR_0 (RPO=0), the difference between DN and MGR is stable between -7% and 5%, which can be considered to be basically the same. Whether RPO is equal to 0 has no effect on read-only transactions
In mixed read-write and write-only transaction scenarios, the performance of DN (RPO=0) is improved by 30%~120% compared to MGR_1 (RPO<>0), and DN has a significant performance advantage at low concurrency, but not at high concurrency. This is because the protocol efficiency of DN is higher at low concurrency, but the performance hotspots of DN and MGR at high concurrency are all due to flushing.
Under the same RPO=0 premise, in mixed read-write and write-only transaction scenarios, the performance of DN is improved by 1-14 times compared with MGR_0, and as the concurrency increases, the performance advantage of DN is strengthened. It is no wonder that MGR also abandons RPO=0 by default for performance.

3.1.3. Two Locations and Three Centers

TPS Comparison		1	4	16	64	256
oltp_read_only	MGR_1	687.76	2703.5	7030.37	11580.36	14674.7
	MGR_0	687.17	2744.41	7908.44	11535.35	14656
	DN	657.06	2610.58	7591.21	11174.94	14545.45
	MGR_0 vs MGR_1	-0.09%	1.51%	12.49%	-0.39%	-0.13%
	DN vs MGR_1	-4.46%	-3.44%	7.98%	-3.50%	-0.88%
	DN vs MGR_0	-4.38%	-4.88%	-4.01%	-3.12%	-0.75%
oltp_read_write	MGR_1	29.13	118.64	572.25	997.92	2253.19
	MGR_0	26.94	90.8	313.64	419.17	426.7
	DN	254.87	1146.57	3339.83	5307.85	7171.95
	MGR_0 vs MGR_1	-7.52%	-23.47%	-45.19%	-58.00%	-81.06%
	DN vs MGR_1	774.94%	866.43%	483.63%	431.89%	218.30%
	DN vs MGR_0	846.07%	1162.74%	964.86%	1166.28%	1580.79%
oltp_write_only	MGR_1	30.81	145.54	576.61	1387.64	3705.51
	MGR_0	28.68	108.86	387.48	470.5	476.4
	DN	550.11	2171.64	5866.41	10381.72	14478.38
	MGR_0 vs MGR_1	-6.91%	-25.20%	-32.80%	-66.09%	-87.14%
	DN vs MGR_1	1685.49%	1392.13%	917.40%	648.16%	290.73%
	DN vs MGR_0	1818.10%	1894.89%	1413.99%	2106.53%	2939.12%

From the test results we can see that:

In the read-only scenario, whether comparing MGR_1 (RPO<>0) or MGR_0 (RPO=0), the difference between DN and MGR is stable between -4% and 7%, which can be considered to be basically the same. Whether RPO is equal to 0 has no effect on read-only transactions
In mixed read-write and write-only transaction scenarios, the performance of DN (RPO=0) is improved by 2 to 16 times compared to MGR_1 (RPO<>0), and DN has a significant performance advantage at low concurrency, but not at high concurrency. This is because the protocol efficiency of DN is higher at low concurrency, but the performance hotspots of DN and MGR at high concurrency are all due to flushing.
Under the same RPO=0 premise, in mixed read-write and write-only transaction scenarios, the performance of DN is improved by 8 to 29 times compared with MGR_0, and as the concurrency increases, the performance advantage of DN is strengthened. It is no wonder that MGR also abandons RPO=0 by default for performance.

3.1.4. Three locations and three centers

TPS Comparison		1	4	16	64	256
oltp_read_only	MGR_1	688.49	2747.69	7853.91	11722.71	15292.73
	MGR_0	687.66	2756.3	8005.11	11567.89	15055.69
	DN	656.06	2600.35	7657.85	11227.56	14562.86
	MGR_0 vs MGR_1	-0.12%	0.31%	1.93%	-1.32%	-1.55%
	DN vs MGR_1	-4.71%	-5.36%	-2.50%	-4.22%	-4.77%
	DN vs MGR_0	-4.60%	-5.66%	-4.34%	-2.94%	-3.27%
oltp_read_write	MGR_1	26.01	113.98	334.95	693.34	2030.6
	MGR_0	23.93	110.17	475.68	497.92	511.99
	DN	122.06	525.88	1885.7	3314.9	5889.79
	MGR_0 vs MGR_1	-8.00%	-3.34%	42.02%	-28.19%	-74.79%
	DN vs MGR_1	369.28%	361.38%	462.98%	378.11%	190.05%
	DN vs MGR_0	410.07%	377.34%	296.42%	565.75%	1050.37%
oltp_write_only	MGR_1	27.5	141.64	344.05	982.47	2889.85
	MGR_0	25.52	155.43	393.35	470.92	504.68
	DN	171.74	535.83	1774.58	4328.44	9429.24
	MGR_0 vs MGR_1	-7.20%	9.74%	14.33%	-52.07%	-82.54%
	DN vs MGR_1	524.51%	278.30%	415.79%	340.57%	226.29%
	DN vs MGR_0	572.96%	244.74%	351.15%	819.15%	1768.36%

From the test results we can see that:

In the read-only scenario, whether comparing MGR_1 (RPO<>0) or MGR_0 (RPO=0), the difference between DN and MGR is stable between -5% and 0%, which can be considered to be basically the same. Whether RPO is equal to 0 has no effect on read-only transactions
In mixed read-write and write-only transaction scenarios, the performance of DN (RPO=0) is 2 to 5 times better than that of MGR_1 (RPO<>0), and DN has a significant performance advantage at low concurrency, but not at high concurrency. This is because the protocol efficiency of DN is higher at low concurrency, but the performance hotspots of DN and MGR at high concurrency are all due to flushing.
Under the same RPO=0 premise, in mixed read-write and write-only transaction scenarios, the performance of DN is 2 to 17 times higher than that of MGR_0, and as the concurrency increases, the performance advantage of DN is strengthened. It is no wonder that MGR also abandons RPO=0 by default for performance.

3.1.5. Deployment comparison

In order to clearly compare the performance changes under different deployment methods, we selected the TPS data of MGR and DN under different deployment methods under oltp_write_only scenario 256 concurrency in the above test, took the data of the computer room test as the baseline, and calculated the ratio of TPS data under different deployment methods compared with the baseline, so as to perceive the performance changes when deploying across cities.

	MGR_1 (256 concurrent)	DN (256 concurrent)	Performance advantages of DN compared to MGR
Same computer room	16092.02	15137.23	-5.93%
Three centers in the same city	11666.16 (72.50%)	13241.74 （87.48%）	+13.50%
Two locations, three centers	3705.51 (23.03%)	14478.38 (95.64%)	+290.72%
Three locations, three centers	2889.85 (17.96%)	9429.24 (62.29%)	+226.28%

From the test results we can see that:

With the expansion of deployment methods, the TPS of MGR_1 (RPO<>0) has dropped significantly. Compared with deployment in the same data center, the performance of deployment across data centers in the same city has dropped by 27.5%, and the performance of deployment across cities (two locations and three centers, three locations and three centers) has dropped by 77%~82%. This is due to the increase in RT in cross-city deployment.
DN (RTO=0) is relatively stable. Compared with deployment in the same data center, the performance of deployment in different data centers in the same city and in two locations and three centers drops by 4%~12%. The performance of deployment in three locations and three centers drops by 37% under high network latency. This is also due to the increase in RT in cross-city deployment. However, thanks to DN's Batch & Pipeline mechanism, the impact of cross-city deployment can be solved by increasing concurrency. For example, the performance throughput of the three-location and three-center architecture can basically match that of the same-city and two-location and three-center architectures under >= 512 concurrency.
This shows that cross-city deployment has a great impact on MGR_1 (RPO<>0)

3.1.6. Performance Jitter

In actual use, we not only pay attention to the performance data, but also need to pay attention to the performance jitter. After all, if the jitter is like a roller coaster, the actual user experience will be very poor. We monitor and display the real-time output data of TPS. Considering that the sysbenc tool itself does not support the output of performance jitter monitoring data, we use the mathematical coefficient of variation as a comparison indicator:

Coefficient of Variation (CV): The coefficient of variation is the standard deviation divided by the mean and is often used to compare the fluctuations of different data sets, especially when the means vary widely. The larger the CV, the greater the fluctuation of the data relative to the mean.

Taking the oltp_read_write scenario with 256 concurrency as an example, we statistically analyzed the TPS jitter of MGR_1 (RPO<>0) and DN (RPO=0) in five deployment modes: same data center, three centers in the same city, three centers in two locations, and three centers in three locations. The actual jitter graph is as follows, and the actual jitter indicator data for each scenario is as follows

CV	Same computer room	Three centers in the same city	Two locations, three centers	Three locations, three centers
MGR_1	10.04%	8.96%	6.02%	8.63%
DN	3.68%	3.78%	2.55%	4.05%
MGR_1/DN	272.83%	237.04%	236.08%	213.09%

From the test results we can see that:

In the oltp_read_write scenario, the TPS of MGR is unstable, and it suddenly drops for no reason. This phenomenon has been found in multiple tests in multiple deployment scenarios. In comparison, DN is very stable.
Calculating the coefficient of variation CV, the CV of MGR is very large, 6%~10%, and even reaches a maximum value of 10% when the delay in the same data center is the smallest. The CV of DN is relatively stable at 2%~4%. The performance stability of DN is basically twice that of MGR.
This shows that the performance jitter of MGR_1 (RPO<>0) is relatively large

3.2. RTO

The core feature of a distributed database is high availability. The failure of any node in the cluster does not affect the overall availability. For a typical deployment of three nodes with one master and two backups in the same data center, we tried to test the availability of the following three scenarios:

Interrupt the master database, then restart it, and observe the RTO time for the cluster to recover availability during the process.
Interrupt any standby database, then restart it and observe the availability of the primary database during the process.

3.2.1. Master database downtime + restart

In the case of no load, kill the leader, monitor the status changes of each node in the cluster and whether it is writable,

	MGR	DN
Normal start	0	0
kill leader	0	0
Time when abnormal node is discovered	5	5
Time to reduce from 3 nodes to 2 nodes	23	8

	MGR	DN
Normal start	0	0
kill leader, automatically pull up	0	0
Time when abnormal node is discovered	5	5
Time to reduce from 3 nodes to 2 nodes	23	8
Time for node 2 to recover node 3	37	15

From the test results, we can see that under no pressure:

The RTO of DN is 8-15 seconds. It takes 8 seconds to reduce to 2 nodes and 15 seconds to restore to 3 nodes.
The RTO of MGR is 23-37s. It takes 23s to reduce to 2 nodes and 37s to restore to 3 nodes.
RTO performance DN is better than MGR overall

3.2.2. Backup Database Downtime + Restart

Use sysbench to perform concurrent stress testing of 16 threads in the oltp_read_write scenario. At the 10th second in the figure, manually kill a standby node and observe the real-time TPS data output by sysbench.

From the test results chart we can see:

After the standby database was interrupted, the TPS of the MGR main database dropped significantly, and it took about 20 seconds to return to normal. According to log analysis, there were two processes: detecting that the faulty node became unreachable and kicking the faulty node out of the MGR cluster. This test confirmed a defect that has been circulating in the MGR community for a long time. Even if only one of the three nodes is unavailable,The entire cluster was unavailable due to severe jitter for a period of time.
In order to solve the problem that the entire instance is unavailable when a single node fails in MGR, the community introduced the MGR paxos single leader function in 8.0.27, but it is disabled by default. Here we turn on group_replication_paxos_single_leader and continue to verify. After the standby database is interrupted, the performance of the primary database remains stable and even slightly improved. The reason should be related to the reduction of network load.
For DN, after the standby database is interrupted, the TPS of the primary database increases by about 20% immediately, and then remains stable, and the cluster is always available. This is the opposite performance of MGR because after interrupting one standby database, the primary database only needs to send logs to the remaining standby database each time, and the network packet sending and receiving process is more efficient.

Continuing the test, we restarted the standby database and observed the changes in the TPS data of the primary database.

From the test results chart we can see:

MGR recovered from 2 nodes to 3 nodes in 5 seconds.However, the main database is still unavailable, which lasts for about 12 seconds.Although the standby node eventually joins the cluster, the MEMBER_STATE status remains in RECOVERING, indicating that data is being tracked.
The scenario where group_replication_paxos_single_leader is enabled is also verified by restarting the standby database. As a result, MGR recovers from 2 nodes to 3 nodes in 10 seconds.However, there was still an unavailable time lasting about 7 seconds.It seems that this parameter cannot completely solve the problem that when there is a single master MGR, a single node failure will cause the entire instance to be unavailable.
For DN, the standby database recovers from 2 nodes to 3 nodes in 10 seconds, and the primary database remains available. The TPS will fluctuate briefly here because the backup database log replication delay is large after the restart, and the lagging logs need to be pulled from the primary database, which has a small impact on the primary database. After the logs are reviewed, the overall performance is stable.

3.3. RPO

In order to construct the MGR majority failure RPO<>0 scenario, we use the MTR Case method provided by the community to perform fault injection testing on MGR. The designed case is as follows:


--echo
--echo ############################################################
--echo # 1. Deploy a 3 members group in single primary mode.
--source include/have_debug.inc
--source include/have_group_replication_plugin.inc
--let $group_replication_group_name= aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
--let $rpl_group_replication_single_primary_mode=1
--let $rpl_skip_group_replication_start= 1
--let $rpl_server_count= 3
--source include/group_replication.inc
 
--let $rpl_connection_name= server1
--source include/rpl_connection.inc
--let $server1_uuid= `SELECT @@server_uuid`
--source include/start_and_bootstrap_group_replication.inc
 
--let $rpl_connection_name= server2
--source include/rpl_connection.inc
--source include/start_group_replication.inc
 
--let $rpl_connection_name= server3
--source include/rpl_connection.inc
--source include/start_group_replication.inc
 
--echo
--echo ############################################################
--echo # 2. Init data
--let $rpl_connection_name = server1
--source include/rpl_connection.inc
CREATE TABLE t1 (c1 INT PRIMARY KEY);
INSERT INTO t1 VALUES(1);
 
--source include/rpl_sync.inc
SELECT * FROM t1;
 
--let $rpl_connection_name = server2
--source include/rpl_connection.inc
SELECT * FROM t1;
 
--let $rpl_connection_name = server3
--source include/rpl_connection.inc
SELECT * FROM t1;
 
--echo
--echo ############################################################
--echo # 3. Mock crash majority members
 
--echo # server 2 wait before write relay log
--let $rpl_connection_name = server2
--source include/rpl_connection.inc
SET GLOBAL debug = '+d,wait_in_the_middle_of_trx';
 
--echo # server 3 wait before write relay log
--let $rpl_connection_name = server3
--source include/rpl_connection.inc
SET GLOBAL debug = '+d,wait_in_the_middle_of_trx';
 
 
--echo # server 1 commit new transaction
--let $rpl_connection_name = server1
--source include/rpl_connection.inc
INSERT INTO t1 VALUES(2);
# server 1 commit t1(c1=2) record
SELECT * FROM t1;
select CHANNEL_NAME,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,MEMBER_ROLE,MEMBER_VERSION,MEMBER_COMMUNICATION_STACK from performance_schema.replication_group_members order by MEMBER_PORT;
--echo # server 1 crash
--source include/kill_mysqld.inc
 
--echo # sleep enough time for electing new leader
sleep 60;
 
--echo 
--echo # server 3 check
--let $rpl_connection_name = server3
--source include/rpl_connection.inc
SELECT * FROM t1;
select CHANNEL_NAME,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,MEMBER_ROLE,MEMBER_VERSION,MEMBER_COMMUNICATION_STACK from performance_schema.replication_group_members order by MEMBER_PORT;
--echo # server 3 crash and restart
--source include/kill_and_restart_mysqld.inc
 
--echo # sleep enough time for electing new leader
sleep 60;
 
--echo 
--echo # server 2 check
--let $rpl_connection_name = server2
--source include/rpl_connection.inc
SELECT * FROM t1;
select CHANNEL_NAME,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,MEMBER_ROLE,MEMBER_VERSION,MEMBER_COMMUNICATION_STACK from performance_schema.replication_group_members order by MEMBER_PORT;
--echo # server 2 crash and restart
--source include/kill_and_restart_mysqld.inc
 
--echo # sleep enough time for electing new leader
sleep 60;
 
--echo
--echo ############################################################
--echo # 4. Check alive members, lost t1(c1=2) record
 
--echo # server 3 check
--let $rpl_connection_name= server3
--source include/rpl_connection.inc
select CHANNEL_NAME,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,MEMBER_ROLE,MEMBER_VERSION,MEMBER_COMMUNICATION_STACK from performance_schema.replication_group_members order by MEMBER_PORT;
--echo # server 3 lost t1(c1=2) record
SELECT * FROM t1;
 
--echo 
--echo # server 2 check
--let $rpl_connection_name = server2
--source include/rpl_connection.inc
select CHANNEL_NAME,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,MEMBER_ROLE,MEMBER_VERSION,MEMBER_COMMUNICATION_STACK from performance_schema.replication_group_members order by MEMBER_PORT;
--echo # server 2 lost t1(c1=2) record
SELECT * FROM t1;


!include ../my.cnf
 
[mysqld.1]
loose-group_replication_member_weight=100
 
[mysqld.2]
loose-group_replication_member_weight=90
 
[mysqld.3]
loose-group_replication_member_weight=80
 
[ENV]
SERVER_MYPORT_3=		@mysqld.3.port
SERVER_MYSOCK_3=		@mysqld.3.socket

The results of the case operation are as follows:


 
############################################################
# 1. Deploy a 3 members group in single primary mode.
include/group_replication.inc [rpl_server_count=3]
Warnings:
Note	####	Sending passwords in plain text without SSL/TLS is extremely insecure.
Note	####	Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
[connection server1]
[connection server1]
include/start_and_bootstrap_group_replication.inc
[connection server2]
include/start_group_replication.inc
[connection server3]
include/start_group_replication.inc
 
############################################################
# 2. Init data
[connection server1]
CREATE TABLE t1 (c1 INT PRIMARY KEY);
INSERT INTO t1 VALUES(1);
include/rpl_sync.inc
SELECT * FROM t1;
c1
1
[connection server2]
SELECT * FROM t1;
c1
1
[connection server3]
SELECT * FROM t1;
c1
1
 
############################################################
# 3. Mock crash majority members
# server 2 wait before write relay log
[connection server2]
SET GLOBAL debug = '+d,wait_in_the_middle_of_trx';
# server 3 wait before write relay log
[connection server3]
SET GLOBAL debug = '+d,wait_in_the_middle_of_trx';
# server 1 commit new transaction
[connection server1]
INSERT INTO t1 VALUES(2);
SELECT * FROM t1;
c1
1
2
select CHANNEL_NAME,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,MEMBER_ROLE,MEMBER_VERSION,MEMBER_COMMUNICATION_STACK from performance_schema.replication_group_members order by MEMBER_PORT;
CHANNEL_NAME	MEMBER_HOST	MEMBER_PORT	MEMBER_STATE	MEMBER_ROLE	MEMBER_VERSION	MEMBER_COMMUNICATION_STACK
group_replication_applier	127.0.0.1	13000	ONLINE	PRIMARY	8.0.32	XCom
group_replication_applier	127.0.0.1	13002	ONLINE	SECONDARY	8.0.32	XCom
group_replication_applier	127.0.0.1	13004	ONLINE	SECONDARY	8.0.32	XCom
# server 1 crash
# Kill the server
# sleep enough time for electing new leader
 
# server 3 check
[connection server3]
SELECT * FROM t1;
c1
1
select CHANNEL_NAME,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,MEMBER_ROLE,MEMBER_VERSION,MEMBER_COMMUNICATION_STACK from performance_schema.replication_group_members order by MEMBER_PORT;
CHANNEL_NAME	MEMBER_HOST	MEMBER_PORT	MEMBER_STATE	MEMBER_ROLE	MEMBER_VERSION	MEMBER_COMMUNICATION_STACK
group_replication_applier	127.0.0.1	13002	ONLINE	PRIMARY	8.0.32	XCom
group_replication_applier	127.0.0.1	13004	ONLINE	SECONDARY	8.0.32	XCom
# server 3 crash and restart
# Kill and restart
# sleep enough time for electing new leader
 
# server 2 check
[connection server2]
SELECT * FROM t1;
c1
1
select CHANNEL_NAME,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,MEMBER_ROLE,MEMBER_VERSION,MEMBER_COMMUNICATION_STACK from performance_schema.replication_group_members order by MEMBER_PORT;
CHANNEL_NAME	MEMBER_HOST	MEMBER_PORT	MEMBER_STATE	MEMBER_ROLE	MEMBER_VERSION	MEMBER_COMMUNICATION_STACK
group_replication_applier	127.0.0.1	13002	ONLINE	PRIMARY	8.0.32	XCom
group_replication_applier	127.0.0.1	13004	UNREACHABLE	SECONDARY	8.0.32	XCom
# server 2 crash and restart
# Kill and restart
# sleep enough time for electing new leader
 
############################################################
# 4. Check alive members, lost t1(c1=2) record
# server 3 check
[connection server3]
select CHANNEL_NAME,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,MEMBER_ROLE,MEMBER_VERSION,MEMBER_COMMUNICATION_STACK from performance_schema.replication_group_members order by MEMBER_PORT;
CHANNEL_NAME	MEMBER_HOST	MEMBER_PORT	MEMBER_STATE	MEMBER_ROLE	MEMBER_VERSION	MEMBER_COMMUNICATION_STACK
group_replication_applier		NULL	OFFLINE			
# server 3 lost t1(c1=2) record
SELECT * FROM t1;
c1
1
 
# server 2 check
[connection server2]
select CHANNEL_NAME,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,MEMBER_ROLE,MEMBER_VERSION,MEMBER_COMMUNICATION_STACK from performance_schema.replication_group_members order by MEMBER_PORT;
CHANNEL_NAME	MEMBER_HOST	MEMBER_PORT	MEMBER_STATE	MEMBER_ROLE	MEMBER_VERSION	MEMBER_COMMUNICATION_STACK
group_replication_applier		NULL	OFFLINE			
# server 2 lost t1(c1=2) record
SELECT * FROM t1;
c1
1

The logic of reproducing the case of missing numbers is as follows:

MGR consists of 3 nodes in single-master mode, Server 1/2/3, where Server 1 is the master database and initializes 1 record c1=1
Fault injection: Server 2/3 hangs when writing relay logs
Connected to Server 1, wrote the record with c1=2, and the transaction commit also returned success.
Then the mock server 1 crashes (machine failure, cannot be recovered, and cannot be accessed), and the remaining servers 2/3 form a majority.
Server 2/3 is restarted normally (quick recovery), but Server 2/3 cannot restore the cluster to available status
Connect to Server 2/3 nodes and query the database records. Only the record with c1=1 is seen (Server 2/3 has lost c1=2)

According to the above case, for MGR, when the majority of parties fail and the primary database is unavailable, after the standby database is restored, there is a situation where the RPO of data loss is <> 0, and the records that were originally returned to the client for successful commit are lost.

For DN, the achievement of majority requires that the logs are persisted in the majority, so even in the above scenario, data will not be lost and RPO=0 can be guaranteed.

3.4. Standby database playback delay

In the traditional master-slave mode of MySQL, the slave database generally includes IO threads and Apply threads. After the introduction of the Paxos protocol, the IO thread is replaced to synchronize the binlogs of the master and slave databases. The replication delay of the slave database mainly depends on the overhead of the Apply playback of the slave database, which is called the slave database playback delay.

We use sysbench to test the oltp_write_only scenario, and test the duration of the delay in the standby database playback when the number of events is different under 100 concurrent transactions. The standby database playback delay time is determined by monitoring the APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP column of the performance_schema.replication_applier_status_by_worker table to check in real time whether each worker is working and whether the replication is finished.

From the test results chart we can see:

With the same amount of written data, the time it takes to complete the playback of all logs in the DN standby database is much better than that in MGR, and the time consumed by DN is only 3% to 4% of that of MGR. This is very critical to the timeliness of the primary and standby switchover.
As the number of writes increases, DN's advantage in backup database playback latency over MGR continues to be maintained and is very stable.
Analyzing the reasons for the delay in the backup database playback, the backup database playback strategy of MGR uses the default value of group_replication_consistency as EVENTUAL, that is, RO and RW transactions do not wait for the previous transaction to be applied before execution. This can maximize the write performance of the primary database, but the backup database delay will be relatively large (by sacrificing the backup database delay and RPO=0 in exchange for high-performance write of the primary database, turning on the current limiting function of MGR can balance the performance and backup database delay, but the performance of the primary database will be compromised)

3.5. Test Summary

		MGR	DN
performance	Read Transaction	Flat	Flat
	Write Transactions	When RPO<>0, the performance is not as good as DN When RPO=0, the performance is far worse than DN Cross-city deployment performance drops significantly by 27% to 82%	Write transaction performance is much higher than MGR The performance of cross-city deployment is slightly reduced by 4% to 37%.
	Jitter	The performance jitter is severe, with a jitter range of 6~10%	Relatively stable 3%, only half of MGR
RTO	The main database is down	Anomalies were detected in 5 seconds and the number of nodes was reduced to two in 23 seconds.	Anomalies were detected in 5 seconds and the number of nodes was reduced to two in 8 seconds.
	Restart the master database	The abnormality was discovered in 5 seconds and the three nodes were restored in 37 seconds.	The abnormality was found in 5 seconds and the three nodes were restored in 15 seconds.
	Backup database downtime	The traffic of the main database drops to 0 for 20 seconds You need to explicitly enable group_replication_paxos_single_leader to alleviate this problem.	Maintain high availability of the main database
	Restarting the Standby Database	The traffic of the main database drops to 0 for 10 seconds Explicitly enabling group_replication_paxos_single_leader is also invalid	Maintain high availability of the main database
RPO	Case Recurrence	When the majority of the servers are down, RPO<>0 You can’t have both performance and RPO=0	RPO = 0
Backup database delay	Backup database playback takes time	The delay between the master and the backup is very large. You can’t have both performance and active/standby latency	The total time consumed for overall backup database playback is 4% of MGR, which is 25 times that of MGR.
parameter	key parameter	group_replication_flow_control_mode flow control is enabled by default and needs to be disabled to improve performance replication_optimize_for_static_plugin_configStatic plugin optimization is disabled by default and needs to be enabled to improve performance group_replication_paxos_single_leader is disabled by default and needs to be enabled to improve the stability of the primary database when the standby database crashes. group_replication_consistency is disabled by default and does not guarantee RPO=0. To force RPO=0, you need to configure AFTER The default value of group_replication_transaction_size_limit is 143M. It needs to be increased when encountering large transactions. The default value of binlog_transaction_dependency_tracking is COMMIT_ORDER. It needs to be adjusted to WRITESET during MGR to improve the playback performance of the standby database.	Default configuration, no need for professional customization

4. Conclusion

After in-depth technical analysis and performance comparison,PolarDB-X DN has demonstrated many advantages over MySQL MGR in terms of performance, correctness, availability, and resource overhead by relying on its self-developed X-Paxos protocol and a series of optimized designs. However, MGR also occupies an important position in the MySQL ecosystem, but it is necessary to consider various situations such as backup database downtime jitter, cross-computer room disaster recovery performance fluctuations, and stability. Therefore, if you want to make good use of MGR, you must have the support of professional technology and operation and maintenance teams.

When faced with large-scale, high-concurrency, and high-availability requirements, the PolarDB-X storage engine, with its unique technical advantages and excellent performance, is more suitable than MGR in out-of-the-box scenarios.PolarDB-XThe centralized (standard edition) based on DN has achieved a good balance between functionality and performance, making it a highly competitive database solution.

Technology Sharing