Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing: Use a private context for set_start_method #1270

Closed

Conversation

zmedico
Copy link
Member

@zmedico zmedico commented Feb 14, 2024

Wrap the multiprocessing module to return a private multiprocessing context instead. It's safe to call set_start_method on this and it will only affect portage internals, which makes it compatible with the pytest-xdist plugin.

Bug: https://bugs.gentoo.org/924416

Seeing pytest-xdist worker crashes less frequently now:
https://github.com/gentoo/portage/actions/runs/7897425289/job/21552979356

2024-02-14T06:42:44.7157130Z =================================== FAILURES ===================================
2024-02-14T06:42:44.7157559Z ______________ lib/portage/tests/process/test_spawn_returnproc.py ______________
2024-02-14T06:42:44.7157935Z [gw1] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python
2024-02-14T06:42:44.7158702Z worker 'gw1' crashed while running 'lib/portage/tests/process/test_spawn_returnproc.py::SpawnReturnProcTestCase::testSpawnReturnProcWait'

https://github.com/gentoo/portage/actions/runs/7897425289/job/21553620257?pr=1270

2024-02-14T07:14:15.4605629Z =================================== FAILURES ===================================
2024-02-14T07:14:15.4605883Z ______________ lib/portage/tests/locks/test_asynchronous_lock.py _______________
2024-02-14T07:14:15.4606199Z [gw1] linux -- Python 3.12.2 /opt/hostedtoolcache/Python/3.12.2/x64/bin/python
2024-02-14T07:14:15.4606935Z worker 'gw1' crashed while running 'lib/portage/tests/locks/test_asynchronous_lock.py::AsynchronousLockTestCase::testAsynchronousLockWaitKill'

Maybe pytest-dev/pytest-rerunfailures#158 could be a solution at this point, since it handles pytest-xdist -n worker crashes.

NOTE: Temporarily added the commits from #1267 to see if they trigger more pytest-xdist runner crashes.

The egencache usage in ResolverPlayground that was used to trigger
bug 924319 triggered a pickling error for AuxdbTestCase.test_anydbm
with the multiprocessing spawn start method, so fix the anydbm
cache module to omit the unpicklable database object from pickled
state, and regenerate it after unpickling.

Bug: https://bugs.gentoo.org/924319
Signed-off-by: Zac Medico <[email protected]>
Use a deallocate_config future to release self.settings when
it is no longer needed. It's necessary to manage concurrency
since commit c95fc64 because mutation of self.settings
is no longer limited to the EbuildMetadataPhase _start method,
where exclusive access was guaranteed within the main thread.

Add support to the isAlive() method to detect when the
EbuildMetadataPhase has started but the pid is not yet
available (due to async_check_locale usage from commit
c95fc64). This can be used to check if an
EbuildMetadataPhase instance has been successfully started
so that it can be relied upon to set the result of the
deallocate_config future.

Bug: https://bugs.gentoo.org/924319
Signed-off-by: Zac Medico <[email protected]>
For EbuildMetadataPhase consumers like MetadataRegen and
depgraph, store a pool of config instances in a config_pool
list, and return instaces to the list when the deallocate_config
future is done.

Bug: https://bugs.gentoo.org/924319
Fixes: c95fc64 ("EbuildPhase: async_check_locale")
Signed-off-by: Zac Medico <[email protected]>
Make the ResolverPlayground _create_ebuild_manifests method
call egencache --jobs, which reliably triggers the KeyError
from bug 924319 for multiple tests:

lib/portage/tests/bin/test_doins.py::DoIns::testDoInsFallback Exception in callback EbuildMetadataPhase._async_start_done(<Task finishe...Error('EAPI')>)
handle: <Handle EbuildMetadataPhase._async_start_done(<Task finishe...Error('EAPI')>)>
Traceback (most recent call last):
  File "/usr/lib/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "lib/_emerge/EbuildMetadataPhase.py", line 154, in _async_start_done
    future.cancelled() or future.result()
                          ^^^^^^^^^^^^^^^
  File "lib/_emerge/EbuildMetadataPhase.py", line 130, in _async_start
    retval = portage.doebuild(
             ^^^^^^^^^^^^^^^^^
  File "lib/portage/package/ebuild/doebuild.py", line 1030, in doebuild
    doebuild_environment(
  File "lib/portage/package/ebuild/doebuild.py", line 519, in doebuild_environment
    eapi = mysettings.configdict["pkg"]["EAPI"]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "lib/portage/util/__init__.py", line 1684, in __getitem__
    return UserDict.__getitem__(self, item_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/portage/cache/mappings.py", line 175, in __getitem__
    return self.data[key]
           ~~~~~~~~~^^^^^
KeyError: 'EAPI'

Bug: https://bugs.gentoo.org/924319
Signed-off-by: Zac Medico <[email protected]>
Wrap asyncio.Lock for compatibility with python 3.9 where the
deprecated loop parameter is required in order to avoid "got
Future <Future pending> attached to a different loop" errors.

The pordbapi async_aux_get method can use asyncio.Lock to
serialize access to its doebuild_settings attribute in order
to prevent issues like bug 924319.

Bug: https://bugs.gentoo.org/924319
Signed-off-by: Zac Medico <[email protected]>
For the portdbapi async_aux_get method, there is not a very
good place to store a config pool, so instead use asyncio.Lock
to manage access to the portdbapi doebuild_settings attribute
when using the main event loop in the main thread. For other
threads, clone a config instance since we do not have a
thread-safe config pool. This cloning is expensive, but since
portage internals do not trigger this case, it suffices for now
(an AssertionError ensures that internals do not trigger it).
For the main event loop running in the main thread, performance
with the asyncio.Lock should not be significantly different to
performance prior to commit c95fc64, since check_locale
results are typically cached and before there was only a single
shared doebuild_settings instance with access serialized via
the EbuildMetadataPhase _start method.

Update async_aux_get callers to use asyncio.ensure_future on
the returned coroutine when needed, since it used to return
a future instead of a coroutine, and sometimes a future is
needed for add_done_callback usage.

In the portdbapi async_fetch_map method, fix a broken reference
to "future" which should have been "aux_get_future", an error
discovered while testing this patch.

Bug: https://bugs.gentoo.org/924319
Fixes: c95fc64 ("EbuildPhase: async_check_locale")
Signed-off-by: Zac Medico <[email protected]>
Change config.environ() check_locale calls to async_check_locale
calls in the EbuildPhase _async_start method in order to eliminate
synchronous waiting for child processes in the main event loop
thread.

Bug: https://bugs.gentoo.org/923841
Signed-off-by: Zac Medico <[email protected]>
@zmedico zmedico marked this pull request as draft February 14, 2024 03:33
@zmedico zmedico force-pushed the bug_924416_private_multiprocessing_context branch 6 times, most recently from 4afc501 to 56d749f Compare February 14, 2024 06:35
@zmedico zmedico marked this pull request as ready for review February 14, 2024 06:36
@zmedico zmedico requested a review from thesamesam February 14, 2024 06:36
@zmedico zmedico marked this pull request as draft February 14, 2024 06:46
Wrap the multiprocessing module to return a private multiprocessing
context instead. It's safe to call set_start_method on this and
it will only affect portage internals, which makes it compatible
with the pytest-xdist plugin.

Bug: https://bugs.gentoo.org/924416
Signed-off-by: Zac Medico <[email protected]>
@zmedico zmedico force-pushed the bug_924416_private_multiprocessing_context branch 2 times, most recently from 8464a80 to d15cfc6 Compare February 14, 2024 17:27
Since pytest-xdist workers crash intermittently for the multiprocessing
spawn start method, use pytest-rerunfailures to detect and handle this
case. Only use pytest-rerunfailures for the spawn start-method since
that is the only case where we've observed intermittent pytest-xdist
worker crashes, and use --only-rerun 'worker .* crashed while running'
to ensure that rerun only triggers for worker crashes.

Bug: https://bugs.gentoo.org/924416
Signed-off-by: Zac Medico <[email protected]>
@zmedico zmedico force-pushed the bug_924416_private_multiprocessing_context branch from d15cfc6 to beeffdb Compare February 14, 2024 17:42
@zmedico
Copy link
Member Author

zmedico commented Feb 14, 2024

I'll close this since it seems like the private multiprocessing context does not really solve the pytest-xdist worker crashes. Opened #1273 to just handle it with pytest-rerunfailures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant