As an extension of my post yesterday about skipping corrupt queries in the relay log, I found out that my problem is due to some network problems between the servers which triggers a MySQL bug.
The connection and replication errors in my MySQL log looks like this:
080930 12:26:52 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013) 080930 12:26:52 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000249' position 747239037 080930 12:26:53 [Note] Slave: connected to master 'replicate@mysqltunnel:13306',replication resumed in log 'mysql-bin.000249' at position 747239037 080930 13:18:49 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013) 080930 13:18:49 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000249' position 783782486 080930 13:18:49 [ERROR] Slave: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '!' at line 6' on query. Default database: 'database'. Query: 'INSERT INTO `sometable` SET somecol = 3, comeothercol = 8, othervalue = NULL!', Error_code: 1064 080930 13:18:49 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000249' position 783781942 080930 13:18:50 [Note] Slave: connected to master 'replicate@mysqltunnel:13306', replication resumed in log 'mysql-bin.000249' at position 783782486
When there are network problems between the server, there was some issue where the master didn’t properly detect and notify the slave of the failure. This resulted in parts of queries missing, duplicated, or replaced by random bits in the relay log on the slave. When the slave tries to execute the corrupt query, it will likely generate an error that begins with:
Error You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near . .
This bug has been fixed in MySQL releases since February 2008, but still hasn’t made its way into the CentOS 5 repositories. Until then, that bug report contained a work-around which forces the slave to re-request the binary log from the master. Run ‘SHOW SLAVE STATUS’ and make note of the Master_Log_File and Exec_Master_Log_Pos columns. Then run ‘STOP SLAVE’ to suspend replication, and run this SQL:
CHANGE MASTER TO master_log_file='<Value from Relay_Master_Log_File>', master_log_pos=<Value from Exec_master_log_pos>;
After that, simply run ‘START SLAVE’ to have replication pick up from there again. That evidently has the slave re-request the rest of the master’s binary log, which it should (hopefully) get without becoming corrupt, and replication will continue as normal.
I guess the network connection between my servers is problematic lately. I’ve had to fix this several times in the past couple days. If that keeps up, I may add this fix to my Replication checking-script until I’m able to upgrade to a version of MySQL that contains this fix.
(I know it’s an old post, but your post turns up at the top of google results when searching for relay log corruption .. so maybe someone else can benefit from this)
One correction to the solution: be sure to use the Relay_Master_Log_File value instead of the Master_Log_File (for an explanation see: http://www.mysqlperformanceblog.com/2008/07/07/how-show-slave-status-relates-to-change-master-to/)
In our case the relay log was corrupted, but reading from the binary log of the master continued. As we noticed this problem only a few days after the corruption occured the Relay_Master_Log_File and Master_Log_File where completely different.
This solved my problem, Thank you.
That fixed it. Thank you sir!