Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bittensor/axon.py: thread and exception handling #2227

Open
wants to merge 9 commits into
base: staging
Choose a base branch
from

Conversation

mvds00
Copy link

@mvds00 mvds00 commented Aug 13, 2024

Bug

Various issues were encountered trying to run and understand e2e tests:

  • if uvicorn fails to start, an uncaught exception is emitted to stderr
  • axon keeps spinning, waiting for self.started, indefinitely
  • exceptions are not propagated from threads
  • there is no way to (simply) test from the outside whether an axon started and/or runs
  • axon creates a thread that only creates another thread, which seems redundant

Description of the Change

This patch addresses some of these issues, in FastAPIThreadedServer:

  • add thread safe set/get_exception() to set/get exceptions
  • run_in_thread() yields the created thread, so that the code using it can check whether the thread is alive
  • uvicorn.Server.startup() is wrapped to set a thread-safe flag using self.set_started(True) to indicate startup succeeded
  • run_in_thread() times out after one second to prevent infinite loop in case self.get_started() never becomes True
  • run_in_thread() raises an exception if it fails to start the thread
  • _wrapper_run() tests whether the thread is still alive

and in class axon, the following are added:

  • @property axon.exception(), returning any exception
  • axon.is_running(), returning True when the axon is operational

The seemingly redundant thread is left in until feedback is received on the reasons for including it.

Alternate Designs

It is a deliberate choice to add status calls on class axon, and not expect the user instantiating an axon to look into implementation details such as axon.fast_server, even though they are not explicitly marked private by the naming convention used.

Possible Drawbacks

Not expected. Perhaps issues that are currently buried will suddenly pop up. This is then intended and an expected effect of improving error handling.

Verification Process

These changes are part of an effort to improve e2e tests, and as such they are part of various other developments and tested along with other changes.

Release Notes

N/A

Copy link
Contributor

@roman-opentensor roman-opentensor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments for changes. Also it required ruff formatting.
But looks pretty good!

_thread: threading.Thread = None
_started: bool = False

def set_exception(self, ex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls use more descriptive variable name instead of ex.
Also, pls add type annotation.

with self._lock:
return self._exception

def set_thread(self, thread):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls add type annotation.

with self._lock:
return self._thread

def set_started(self, started):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls add type annotation.

@property
def exception(self):
# for future use: setting self._exception to signal an exception
e = getattr(self,'_exception',None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls use more descriptive variable name instead of e

"""
thread = threading.Thread(target=self.run, daemon=True)
thread.start()
try:
while not self.started:
t0 = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls use more descriptive variable name instead of t0

# Our instantiator should be able to test axon.is_running() to see if all
# required threads etc are running.
def is_running(self):
t = self.fast_server.get_thread()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls use more descriptive variable name instead of t.

return e
return self.fast_server.get_exception()

# Our instantiator should be able to test axon.is_running() to see if all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to keep it in docstring. Then it will help not only the developer who reads the code, but also the one who calls help(function/method/property)

@@ -405,6 +458,24 @@ def info(self) -> "bittensor.AxonInfo":
placeholder2=0,
)

# Our instantiator should be able to test axon.exception to see if any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to keep it in docstring. Then it will help not only the developer who reads the code, but also the one who calls help(function/method/property)

@roman-opentensor roman-opentensor added the In Review Cortex Team Members Reviewing label Aug 13, 2024
@heeres heeres force-pushed the feature/mvds00/axon-threads-and-exception-handling branch from 7bb5ae9 to 25a7a47 Compare August 14, 2024 11:32
@mvds00
Copy link
Author

mvds00 commented Aug 14, 2024

Pushed fixes for first round of feedback, for now as a separate commit, plus some further tweaks.

thewhaleking
thewhaleking previously approved these changes Sep 3, 2024
Copy link
Contributor

@thewhaleking thewhaleking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with these changes. Just rebase, and we'll merge it in.

µ added 4 commits September 3, 2024 13:29
Various issues were encountered trying to run and understand e2e tests:
- if uvicorn fails to start, an uncaught exception is emitted to stderr
- axon keeps spinning waiting for self.started, indefinitely
- exceptions are not propagated from threads
- there is no way to (simply) test from the outside whether an axon
  started and/or runs
- axon creates a thread that only creates another thread, which seems
  redundant

This patch addresses some of these issues, in FastAPIThreadedServer:
- add thread safe set/get_exception() to set/get exceptions
- run_in_thread() yields the created thread, so that the code using it
  can check whether the thread is alive
- uvicorn.Server.startup() is wrapped to set a thread-safe flag using
  self.set_started(True) to indicate startup succeeded
- run_in_thread() times out after one second to prevent infinite loop in
  case self.get_started() never becomes True
- run_in_thread() raises an exception if it fails to start the thread
- _wrapper_run() tests whether the thread is still alive

and in class axon, the following are added:
- @Property axon.exception(), returning any exception
- axon.is_running(), returning True when the axon is operational

The seemingly redundant thread is left in until feedback is received on
the reasons for including it.
- ruff formatting
- docstrings
- variable names
- type annotations
- don't return ret = startup() which is annotated to return None
- moved time sleep intervals to singular global, for clarity
…after pip install

Although it seems common to have types.py (e.g. scalecodec, torch,
substrateinterface) this may lead to issues after applying pip install
-e to a package, as is suggested for bittensor (see git grep 'pip
install -e.* bittensor')

Issues were observed where circular imports would arise as Python's
native typing.py would include bittensor's types.py.
@heeres heeres force-pushed the feature/mvds00/axon-threads-and-exception-handling branch from 825145e to 1c73994 Compare September 3, 2024 13:29
thewhaleking
thewhaleking previously approved these changes Sep 3, 2024
Copy link
Contributor

@thewhaleking thewhaleking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed merge conflicts in #2459

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
In Review Cortex Team Members Reviewing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants