-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk line ending issue upgrading from tile38 1.29.1 to 1.33.4 #756
Comments
Found more details on this issue, seems this issue arises after version 1.32.2, we have had to revert to this version before looking for solutions for upgrading further. |
To be clear going from 1.32.2 to 1.32.3+ breaks your cluster? What OS are you using? |
Essentially when the version is 1.33.0 the new nodes added to the cluster fail to catch up, and keep getting bulk line ending issue. Note this is a cluster managing leader with sentinel nodes. Using AL2 from AWS on arm64 arch. |
In addition, when a node is master with v1.33.0 and a sentinel failover is triggered, the failover doesnt work smoothly in a live cluster and sometimes requires a restart of one of the nodes. Either the followers or master hang. We had to revert to 1.32.2, it seems that in 1.33.0 protobuf and net library were updated. |
Thanks for the additional context. That's very helpful. I took a look at the diff between the two 1.32.2 and 1.33 and didn't find any obvious change that would cause a break like that. Some dependencies were updated, as you mentioned, and so was the workflow file to use the latest Go version to 1.22. I don't recall encountering problems when upgrading those dependencies before. I'll try to reproduce the issue on my side. Are you using the Docker builds? |
Not using docker builds. The way to fix the bulk line ending issue and the failover issue we were facing was to downgrade in the following way, we did it during maintenance:
Note: I also went through AOF formatting and did an AOF validity check, our AOF was passing the check. |
Unfortunately I still haven't been able to reproduce the issue. I tried upgrading and downgrading between 1.32, 1.33 and 1.34 as described above. I also tried with other things that I thought might cause the issue, but to no avail. |
Hi, during our (Samar and mine; same company) experiments I had created the following test scripts which downloaded and installed the specified version of tile38. This was to iterate over each version, trying to find where the error message pops up. I am attaching the zip file here. In the experiment, I had installed HashiCorp Consul is also needed for the first time leader election logic in the script. (start it another shell using The experiment in brief would be to run the run the Then start the Observe the outputs of all the three tile38-servers. Further observer when you press CTRL+C for the current leader, how the sentinels do their thing to switch leader and how the error messages pop up. On this single machine test setup, versions 1.33.1 (one thirty-three one) and onwards caused the error log quite predictably:
|
Hi Shantanu, After a little tweaking I was able to get the tests scripts to run.
Once I got the cluster running I started pumping data into it, then killed the leader, and let the sentinel switch to a new leader. I then began pumping data to the new leader, and switched on the previously killed server and let the sentinel bring the cluster back up to a three node system. All while the data was still pushing to the new leader. I did get some "Protocol error: invalid bulk line ending" errors like shown, but the follower remedied the issue and finally caught up. The output you show has the "[INFO] caught up" in the end too, which should mean that the follower is fixed and ready to use. This isn't exactly the same issue that Samar described. In Samar's output there're additional errors such as "follow: invalid argument 'obje*3'". It looks like in that case there may have been a corruption in network transit or on disk, but it's difficult to tell. I also noticed "aof shrink ended" and "reloading aof commands". These indicate that the follower may have ran an aofshrink before restarting the process. I tried the same, ran AOFSHRINK then killed the follower and restarted. And sure enough I got some parsing errors:
That went on for 22 seconds until:
The follower ended up healing itself and the cluster is back up again. Did the follower in the original issue, as described by Samar, ever get "caught up"? And if so, how long did it take. If not, did the parse error messages just continue for a very long time before killing the process? |
Thanks for clarifying!
I think we can discuss internally and post replies here for further investigation. |
@tidwall the follower in some cases did catch up but other times it would keep loading the AOF perpetually, it would keep pulling the AOF file then having shrink issues, then it would start again from scratch, I noticed that the AOF file would go from being 1GB to being almost complete (~6-8GB) then back to 1 GB, sometimes taking minutes up to hours. If I left it running over night some of the followers added would eventually join but it's not guaranteed. Also, there's really two issues here:
I think you guys have seen the first issue let me recreate the second and describe it here step-by-step. |
Describe the bug
Upgrade from tile38 version 1.29.1 to 1.33.4. Did an upgrade by adding 1.33.4 nodes to the existing cluster and removing old nodes one node at a time, ensuring nodes were caught up before removing old nodes. To replace the master did a failover from sentinel and then removed the old master. After two hours of the nodes being stable and caught up, noticed that latency had significantly increased. Noticed the error “follow: Protocol error: invalid bulk line ending” on the follower nodes. Had to bring down the follower nodes to stabilize the cluster. Every time we try to bring up follower nodes we get the same error. If I copy the AOF to a new cluster and add followers the error is not there. Are there any steps we can take to allow following on the existing cluster.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
When follower nodes are added should connect to master and be caught up
Logs
![IMG_8995](https://private-user-images.githubusercontent.com/20006218/386441179-7e0e6216-404e-40ba-b3cd-87101404dc79.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMDAzNTAsIm5iZiI6MTczOTIwMDA1MCwicGF0aCI6Ii8yMDAwNjIxOC8zODY0NDExNzktN2UwZTYyMTYtNDA0ZS00MGJhLWIzY2QtODcxMDE0MDRkYzc5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDE1MDczMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTU4N2IwMWIzOGJhOWJhNjA0NDFkMzY5NzdiMjYwOTgxNWE2MmY0ZTc0NWZjOTE1Yzc3ZTAyNTVjOTI4OTc0ZGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.mzGJ8js29lclXHQzLo_DOBalsSPiyxhitgCEqrie6Tg)
The text was updated successfully, but these errors were encountered: