-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refinement] refine error handling, error report on status API now covers boot controller startup failure #259
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTA errors definition is simplified considering the real world condition and rethinking on the end-user's side.
With this PR, only network related errors, OTA busy error(OTA already in-progress) and invalid OTA status(attempt to rollback when ota-status is FAILURE) are treated as recoverable, as these errors can be resolved by user themselves.
Other errors are all listed as unrecoverable, and require user to contact technical supports from us(user cannot or hard to know what to do).
# boot controller starts up | ||
try: | ||
_bootctrl_inst = _bootctrl_cls() | ||
except ota_errors.OTAError as e: | ||
logger.error( | ||
e.get_error_report(title=f"boot controller startup failed: {e!r}") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boot controller is launched before otaclient core(OTAClient) starts, and we capture and handle the boot controller startup error here.
Error will be captured and parsed, error report(including full traceback) will be logged, failure reason will be available via status API. When DEBUG_MODE is on, full traceback will also be available via status API.
try: | ||
self._otaclient_inst = OTAClient( | ||
boot_controller=_bootctrl_inst, | ||
create_standby_cls=_standby_slot_creator, | ||
my_ecu_id=ecu_info.ecu_id, | ||
control_flags=control_flags, | ||
proxy=proxy, | ||
) | ||
except ota_errors.OTAError as e: | ||
logger.error( | ||
e.get_error_report(title=f"otaclient core startup failed: {e!r}") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTAClient startup failure also being captured and parsed here.
async def get_status(self) -> wrapper.StatusResponseEcuV2: | ||
return await self._run_in_executor(self._otaclient.status) | ||
# otaclient is not started due to boot control startup failed | ||
if self._otaclient_inst is None: | ||
return self._otaclient_startup_failed_status | ||
|
||
# otaclient core started, query status from it | ||
return await self._run_in_executor(self._otaclient_inst.status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If otaclient fails to start, status API will return the pre-built status API response, which is generated based on parsing boot controller startup failure/otaclient core startup failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If otaclient is started, then use the status report generated by running otaclient instance.
Rebase the PR to cleanup unwanted commits. Updated: Done. |
… default use \n as splitter
66dec89
to
9f1b41f
Compare
DEBUG_MODE = False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About DEBUG_MODE, when enabled, currently otaclient will behave as follow:
failure_traceback
will also be reported via status API(normally only being logged).
In the future, more features like logging to file when DEBUG_MODE is on might be supported.
In this PR, why do you change this behavior? If this required changes, please write the reason, and why the previous code was OK to include trace back? |
Sure! So when For technical support like us, when end-user side has problem, the fastest way for us to locate the problem is to check the otaclient logging on cloudwatch, Also the length of But the With the above considerations, I decide to make |
We can see the traceback on the cloudwatch log, right? (I think it is enough) |
If you think:
Do we need to enable traceback in the status API by using DEBUG_MODE flag? |
Yes, the same contents will be logged using |
When using Also if we are doing local testing using local network, status API's response length is not a problem I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment but basicall OK!
otaclient/app/errors.py
Outdated
module: OTAModules = OTAModules.API | ||
errcode: OTAErrorCode = OTAErrorCode.E_INVALID_STATUS_FOR_OTAROLLBACK | ||
desc: str = f"{_RECOVERABLE_DEFAULT_DESC}: current ota-status indicates it should not accept ota rollback" | ||
class OTAErrorUnRecoverable(OTAError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class OTAErrorUnRecoverable(OTAError): | |
class OTAErrorUnrecoverable(OTAError): |
or if we use UnRecoverable
, then we need to employee UN_RECOVERABLE
. Unrecoverable
is natural I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 16cc689
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Commits from v3.6.0 to v3.6.1 * Bug fix 1. [Fix] grub: fix unintended behavior when updating grub_default #253 2. (Major) [Fix] grub: workaround for OTA failure/otaclient crashing if rootfs is not sda #258 * Refinement 1. [Chore] minor update and fix to tools.status_monitor #256 2. (Major) [Refinement] refine error handling, error report on status API now covers boot controller startup failure #259 * Chore 1. [Chore] minor update and fix to tools.status_monitor #256
Description
Note
This PR is based on previous PR #258.
Error handling refinements
This PR introduces refinements to otaclient's error handling, including:
otaclient startup procedure refinements
otaclient core(
otaclient.OTAClient
) and boot controller are launched separately, exceptions during these two components startup is handled and captured, so that the whole otaclient startup procedure can finish and grpc server can be launched.The error information during startup is recorded and available via status API.
DEBUG_MODE and
failure_traceback
in status API responseAlso, previously
failure_traceback
is always included into the status API response, considering the traffic caused by unpredictable amount of bytes fromfailure_traceback
field, this PR introduces a new config optionDEBUG_MODE
(default is False), andfailure_traceback
is only included whenDEBUG_MODE
is on.Check list
Changes
OTA errors handling structure change
Check documentation for OTA errors handling design for more details.
OTA errors definition change and descriptions update
Most of the errors are changed to
unrecoverable OTA error
now, onlyE_NETWORK, E_OTAMETA_DOWNLOAD_FAILED
andE_OTA_BUSY, E_INVALID_STATUS_FOR_OTAROLLBACK
are still recoverable errors.Error descriptions are simpler and include necessary basic information for better understanding what is going on.
otaclient startup workflow change
Previously, otaclient core
OTAClient
launches boot controller when it is launching, but not handling the error raised by boot controller.With this PR, a new
otaclient.OTAServicer
is introduced as "otaclient composer". It is in charge of launch boot controller and otaclient core(otaclient.OTAClient
) separately, and then compose then together as a fully functional otaclient instance.Exceptions during launching otaclient core and boot controller are captured and handled, and error information is available via status API.
Behavior changes
Does this PR introduce behavior change(s)?
Previous behavior
failure_traceback
is always included into status API response.Behavior with this PR
failure_traceback
is included into the status API response only when DEBUG_MODE is on.Breaking change
Does this PR introduce breaking change?
Related links & tickets
RT4-7473