Categories
DevOps

How to deal with missing snapshot after ZooKeeper upgrade

Deal with missing snapshot after ZooKeeper upgrade from version 3.4 to 3.5 or later.

I have encountered an error after upgrade from ZooKeeper 3.4.13 to 3.5.9.

Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]: [2021-08-03 22:24:07,002] ERROR Unable to load database on disk (org.apache.zookeeper.server.quorum.QuorumPeer)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]: java.io.IOException: No snapshot found, but there are log entries. Something is broken!
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:904)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:890)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]: [2021-08-03 22:24:07,004] ERROR Unexpected exception, exiting abnormally (org.apache.zookeeper.server.quorum.QuorumPeerMain)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]: java.lang.RuntimeException: Unable to run quorum server
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:941)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:890)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]: Caused by: java.io.IOException: No snapshot found, but there are log entries. Something is broken!
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:904)
Aug 03 22:24:07 kafka2.example.org zookeeper-server-start.sh[62023]:         ... 4 more

The solution is to alter the configuration and append snapshot.trust.empty=true option to skip this check.

$ sudo -u kafka cat /opt/kafka/kafka/config/zookeeper.properties 
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/opt/kafka/zookeeper_data
clientPort=2181
server.1=kafka1.example.org:2888:3888
server.2=kafka2.example.org:2888:3888
server.3=kafka3.example.org:2888:3888
snapshot.trust.empty=true
4lw.commands.whitelist=*

Remember to restart ZooKeeper.

The snapshot will be created automatically when the node becomes a leader or a snapCount is reached which by default is 10,000 requests, so do not remove this option immediately after the upgrade.

Please read Fails to load database with missing snapshot file but valid transaction log file for more information.