MySQL Galera Cluster TroubleShooting
MariaDB / Percona MYSQL use Galera Cluster to achieve high availability.
The process involves a use of a minimum 3 nodes to have quorum voting nodes to decide if a node is not accessible or its data is out of sync and rejects it from the cluster.
The Galera Cluster is based on Synchronous replication where every node contains the same data (there is no Sharding), every commit needs to be acknowledged by all the nodes before the transaction is saved .
This means that Galera Cluster does not scale writes but only reads.
When a user connects to Galera Cluster using Load Balancer, he/she connects to a specific node and all the queries will run on that node.
What are the best practices and limitations :
- Use Small and not large transactions
- Every Table should contain a primary key – otherwise replication will fail and will not work properly.
- Support only Tables based on Innodb engine. MySQL has several database engines for different use cases,the InnoDB is the transactional ACID engine .
- Does not support Lock Commands – As every MySQL Instance work stands alone and communicates with the other nodes . locking on specific instances will not lock the records on the other instances .
- Synchronous replication requires a fast and stable network between the nodes .
- Does not support query cache and actually absolute from MySQL 8 (Query cache is when query results are saved in the cache and returned immediately for same query)
- Every Cluster has a Load Balancer in the front, which directs the connections to the active nodes. Load Balancer Best Practice is to have 1 primary node, 1 passive (in case the active is down) and third is the backup. This way it will avoid lots of transaction conflicts between the nodes .
- Every user query runs on specific nodes so the slow log and the error log files are located in that machine and for investigation you need to check the machine the query was run.
Fault and Recovery cases
- Server or MySQl Service is down – In case MySQl service or the server is down for a short time – starting the server / service the Galera cluster will sync quite fast this node from the cache area of the other nodes. The Instance will not be open until it fully syncs. This way of sync is called IST – Incremental State Transfer.
- Long Shutdown or adding a new Node – In case the node was down for a long time or when adding new nodes, there is a need to do SST – State Snapshot Transfer which is full sync.
There is a need to delete the content in /var/lib/mysql/* and start the node. it will get synced from the other nodes (Backup and Standby) which will serve as donors and will not be open for connections as well. - Node Is rejected by the cluster – in case node is rejected from the cluster due to inconsistent data , network issue or other issues . need to repeat step 2 to recover the node.
- MySQL error Logs – MySQL Error logs are located under /var/log/mysql/error.log
- Galera Cluster Status – to check Galera Cluster status, you need to run on the checked instance – show a statue like ‘%wsrep%’. Following status are the important ones:
- wsrep_local_state_comment – synced – following other values exists : Donor/Desynced (Node Serve as Donor, Joining or joined – when node join the cluster
- wsrep_incoming_addresses – you can see all 3 nodes addresses . like :10.10.10.1 ,10.10.10.2 and 10.10.10.3
- wsrep_cluster_status – Primary
- MySQL config files –