I’ve had this error occur when something went terribly wrong on a master server and it had to be rebooted. It appears that the master’s binary log file was not completely written to the server before it was rebooted. However part of the master log file was transmitted to the slave before that time. The error is coming from the slave server when it tried to reconnect to the master. It is trying to connect to the master server and start copying the binary log file from a position which the master does not have.
The proper way to fix this error is to completely resync the data from the master and begin replication over again. However, in many cases, that is impractical, or not worth the hassle, so I was able to fix the problem by setting the slave back to a valid position on the master and then skipping forward over the missing entries.
First, you need to find out what the last valid position is on the master. Run ‘SHOW SLAVE STATUS’ on the slave to figure out which master log file it was reading when the master crashed. You’ll need the value of the Relay_Master_Log_File parameter. Also take note of the Exec_Master_Log_Pos value.
Next, go to the master, and ensure that log file exists in the MySQL data directory. The file size will probably be less than the Exec_Master_Log_Pos value that the slave had stored. To find the last valid position in the log file use the mysqlbinlog utility. The command
mysqlbinlog [LOGFILE] |tail -n 100
should show the final entries in the binary log. Find the last line in the file that looks similar to this:
#120512 13:30:44 server id 3 end_log_pos 8796393 Query thread_id=138884124 exec_time=0 error_code=0
The number after end_log_pos is the last valid entry that the master has available. The difference between the master log position that the slave had, and this highest one on the master will give you some idea how far ahead the slave got before the master crashed.
You can now go back to the slave and issue the commands:
STOP SLAVE; CHANGE MASTER TO MASTER_LOG_POS=8796393; START SLAVE;
This will tell the slave to try to start from that valid position, and continue from there.
There is a strong possibility that you’ll run into some duplicate key errors if the slave was very far ahead of where the master’s log ended. In that case, you can issue this command to bypass those one at a time (or more if you want to skip more than one)
STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
!! Note that if you have to do this, then your data between the master and slave is almost certainly inconsistent !! Be sure to understand your data to make sure that this is something you can live with.
I’ve had to deal with these issues as well when I lost a relay server, a server that sits between the master and multiple slaves. In my case, once I had brought up the new relay server, the end slaves were at a completely different position on the old relay server. By looking at the bin logs on the slaves, I was able to find the last transaction they ran and then able to find the same transaction on the new relay server’s bin logs (once it was synced up with the master again). In this case you will most likely also need to specify a new log file in addition to the position (MASTER_LOG_FILE = ‘master_log_name’) in order to get them replicating again.
Another hint for people is to use the “show processlist” command on both the slaves and master. On the master, it shows which slaves are connected and whether it has sent all of the changes yet. On the slave, it shows what transactions it is executing as it tries to catchup with the master. Also, the “Time” column shows how many seconds behind the master it is. Once it has caught up, that column shows how long it has been in sync.