-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix concurrency issues #993
Conversation
To work around this issue when subscriptions are inactive for more than 5 minutes: NomicFoundation/hardhat#2053 Use 100 millisecond polling; default polling interval of 4 seconds is too close to the 5 second timeout for `check eventually`.
confirm(0) doesn't wait at all, confirm(1) waits for the transaction to be mined
includes fixes for http polling and .confirm()
allow for a bit more time to withdraw funds
to ensure that the transaction has been processed before continuing with the test
there were two logic errors in this test: - a slot is freed anyway at the end of the contract - when starting the request takes a long time, the first slot can already be freed because there were too many missing proofs
currentTime() doesn't always correctly reflect the time of the next transaction
otherwise the windows runner in the CI won't be able to start the request before it expires
allow for a bit more time for a request to be submitted
d239164
to
ec962a1
Compare
windows ci is so slow, it can take up to 40 seconds just to submit a storage request to hardhat
ea0e636
to
523e8cb
Compare
reason: with the increased period length of 90 seconds, it can take longer to wait for a stable challenge at the beginning of a period.
apparently it takes windows 2-3 seconds to resolve "localhost" to 127.0.0.1 for every json-rpc connection that we make 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm expecting this to make our tests even slower. But it does seem correct.
I've performed some tests to check if slowness on Windows can be related to IPv6. Results
🟢 - IPv6 is enabled or primary detailsServer
Commands# Clone
git clone https://github.com/codex-storage/nim-codex.git repos/nim-codex && cd repos/nim-codex
# Branch
git switch -C fix-concurrency-issues origin/fix-concurrency-issues
# Update
time make -j 8 update # 7m55.096s
# Build
time make -j 8 # 8m36.045s
# Tests
time make -j8 testAll # 66m21.431s
# --> localhost
find tests -type f -name "*.nim" -exec sed -i 's/127.0.0.1:8545/localhost:8545/g' {} \;
grep -r 'localhost:8545' tests
# --> 127.0.0.1
find tests -type f -name "*.nim" -exec sed -i 's/localhost:8545/127.0.0.1:8545/g' {} \;
grep -r '127.0.0.1:8545' tests Disable IPv6Guidance for configuring IPv6 in Windows for advanced users :: Prefer IPv4 over IPv6 in prefix policies
netsh interface ipv6 show prefixpolicies
netsh interface ipv6 set prefix ::/96 60 3
netsh interface ipv6 set prefix ::ffff:0:0/96 55 4
:: Permanently
:: Prefer IPv4 over IPv6
:: reg add HKLM\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters /v DisabledComponents /t REG_DWORD /d 0x20 /f
:: Disable IPv6
:: reg add HKLM\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters /v DisabledComponents /t REG_DWORD /d 0xFF /f
:: restart
:: shutdown -r -t 0
:: reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters /v DisabledComponents :: check
ping localhost -n 1
:: Pinging WIN-IP4MNH4JT01 [127.0.0.1] with 32 bytes of data:
:: Reply from 127.0.0.1: bytes=32 time<1ms TTL=128 Enable IPv6:: Prefer IPv6 over IPv4 in prefix policies
netsh interface ipv6 show prefixpolicies
netsh interface ipv6 set prefix ::/96 1 3 store=persistent
netsh interface ipv6 set prefix ::ffff:0:0/96 35 4 store=persistent
:: delete the key
:: reg delete HKLM\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters /v DisabledComponents /f
:: restart
:: shutdown -r -t 0
:: reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters /v DisabledComponents :: check
ping localhost -n 1
:: Pinging WIN-IP4MNH4JT01 [::1] with 32 bytes of data:
:: Reply from ::1: time<1ms RPC request time# localhost - 0m0.326s
time curl -m 5 -X POST \
-s http://localhost:8545 \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}'
# 127.0.0.1 - 0m0.116s
time curl -m 5 -X POST \
-s http://127.0.0.1:8545 \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}' We can fix that for CI, but it will require to update the docs and perform one more step, locally as well. - name: Prefer IPv4 over IPv6 in prefix policies
if: matrix.os == 'windows'
run: |
netsh interface ipv6 set prefix ::/96 60 3
netsh interface ipv6 set prefix ::ffff:0:0/96 55 4 |
Thanks @veaceslavdoina, interesting to see that it is indeed IPv6 related, and not just DNS.
I prefer to keep the windows environment on the CI as it is, because it reflects how a windows machine works out of the box. To ensure that users of Codex do not run into this issue, I think I'd rather update the docs and our CLI defaults to replace |
I agree that this is useful, but it means that tests take 3x as long, from sub-1h to 2.5h. In my opinion, that is a substantial tradeoff. Perhaps we could run a single test using default Windows settings, to cover the default Windows settings case and not slowdown the rest of the Windows tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, well done mark 👏
nodes = 3, | ||
tolerance = 1).get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 I know this is probably faster but having 5 nodes has picked up on some issues in the past (around erasure coding).
It should be fast now, as we are using |
Still, integration tests on the last commit are taking 15-22 mins longer than macos and linux, respectively. (53m on Linux, 1h on macos, 1h15m on Windows). According to your data, disabling IPv6 would save around 10 minutes. |
I had same impression and did a test on GitHub Actions. Branch was based on the PR branch, but it was deleted already. Just a note, the tests above were performed on a Cloud VM, not GitHub Actions.
Just created a new branch again and run a new test, let's see the results. Will check the speed of |
The slow connection to localhost is solved in this PR by ensuring that we specify 127.0.0.1 instead. The tests are slower to run because we now do polling instead of websockets and because we now have a proof period of 90 seconds instead of 60 seconds. The good news is that the windows test runs are now no longer the bottleneck. So we could split the CI test runs on Mac and Linux just like we did for Windows and probably reduce the overall CI time to about 30 minutes. |
Btw, CI run on Linux fails now
nim-codex/.github/actions/nimbus-build-system/action.yml Lines 15 to 17 in 9b7f3f4
|
Probably just in some cases, because as it shown on my screenshot, integration tests on Windows are still For some reason, in some cases all tests on Linux and macOS might take |
CI configuration
Results
Conclussions
As Mark mentioned, it make sense to try to split tests for Linux and mcOS like we do it for Windows. Next stepsPerformed tests does not answer the question if IPv4/IPv6 settings improve run time on Windows with the IP over hostname we already using for RPC calls. Now we are performing next set of runs to check
|
Tests
🟢 - IPv6 is primary Conclusion
Based on the Usage limits, we can run concurrently Next stepsI will create a PR with CI split for Linux and macOS. |
@markspanbroek let's merge this so we can try it in a release? |
* Use http subscriptions instead of websocket for tests To work around this issue when subscriptions are inactive for more than 5 minutes: NomicFoundation/hardhat#2053 Use 100 millisecond polling; default polling interval of 4 seconds is too close to the 5 second timeout for `check eventually`. * use .confirm(1) instead of confirm(0) confirm(0) doesn't wait at all, confirm(1) waits for the transaction to be mined * speed up partial payout integration test * update nim-ethers to version 0.10.0 includes fixes for http polling and .confirm() * fix timing of marketplace tests allow for a bit more time to withdraw funds * use .confirm(1) in marketplace tests to ensure that the transaction has been processed before continuing with the test * fix timing issue in validation unit test * fix proof integration test there were two logic errors in this test: - a slot is freed anyway at the end of the contract - when starting the request takes a long time, the first slot can already be freed because there were too many missing proofs * fix intermittent error in contract tests currentTime() doesn't always correctly reflect the time of the next transaction * reduce number of slots in integration test otherwise the windows runner in the CI won't be able to start the request before it expires * fix timing in purchasing test allow for a bit more time for a request to be submitted * fix timing of request submission in test windows ci is so slow, it can take up to 40 seconds just to submit a storage request to hardhat * increase proof period to 90 seconds * adjust timing of integration tests reason: with the increased period length of 90 seconds, it can take longer to wait for a stable challenge at the beginning of a period. * increase CI timeout to 2 hours * Fix slow builds on windows apparently it takes windows 2-3 seconds to resolve "localhost" to 127.0.0.1 for every json-rpc connection that we make 🤦
Switches nim-ethers to http subscriptions instead of websocket subscriptions. This changes the timing of subscriptions and unearths some concurrency issues. This PR fixes those issues.
Includes:
.confirm(1)
, previously we used.confirm(0)
, which didn't workThis is split off from the work in #988