Using Percona MySQL 5.6 with sql_slave_parallel_workers=5
on Debian 8. Sometimes GTID replication breaks and I don't know why. I thought that the GTIDs are executed in a consecutive order, but when looking at status
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: d22.local
Master_User: xyz
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.039232
Read_Master_Log_Pos: 219044
Relay_Log_File: mysqld-relay-bin.072392
Relay_Log_Pos: 90640
Relay_Master_Log_File: mysql-bin.036196
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB: xyz_etl
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1032
Last_Error: Could not execute Update_rows event on table xyz.sessions; Can't find record in 'sessions', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.036196, end_log_pos 78709552
Skip_Counter: 0
Exec_Master_Log_Pos: 78708927
Relay_Log_Space: 1337994488
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1032
Last_SQL_Error: Could not execute Update_rows event on table xyz.sessions; Can't find record in 'sessions', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.036196, end_log_pos 78709552
Replicate_Ignore_Server_Ids:
Master_Server_Id: 22
Master_UUID: 0e7b97a8-a689-11e5-8b79-901b0e8b0f53
Master_Info_File: /var/lib/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 161219 20:32:20
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 0e7b97a8-a689-11e5-8b79-901b0e8b0f53:60397-45157441
Executed_Gtid_Set: 0e7b97a8-a689-11e5-8b79-901b0e8b0f53:1-42679868:42679870-42679876:42679878-42679879:42679881-42679890:42679892-42679908:42679910:42679913:42679916-42679917:42679919-42679927:42679929-42679932:42679934:42679936:42679938-42679939:42679944:42679946-42679950:42679952-42679955:42679957-42679964:42679966:42679969-42679970:42679972:42679974-42679977:42679979-42679980:42679984-42679986:42679988-42679990:42679994-42679996:42679998:42680000-42680001:42680003-42680006:42680009-42680011:42680013-42680018:42680021:42680024:42680026:42680030:42680032:42680035:42680038,
aea3618e-bacf-11e6-9506-b8ca3a67f830:1-10937274
Auto_Position: 1
1 row in set (0.00 sec)
I'm a bit confused. sql_slave_parallel_workers
is set to 0
now. But the error claimed above is GTID 42679909
instead of 42679868
as expected. What's the reason for this. And what are the correct steps to solve a broken replication like above?
What I don't understand is, that the transaction with GTID 42679869
can be executed without problems, theoretically. But doing a STOP SLAVE; START SLAVE;
does not process them?!
To answer it and help others, here the steps I've done:
slave_parallel_workers=0
Executed_Gtid_Set
only and handle all gaps in GTID list one after another withSTOP SLAVE; SET GTID_NEXT="[...]"; BEGIN; COMMIT; SET GTID_NEXT="AUTOMATIC"; START SLAVE;
slave_parallel_workers
to previous value