Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refinement] refine error handling, error report on status API now covers boot controller startup failure #259

Merged
merged 27 commits into from
Nov 7, 2023

Conversation

Bodong-Yang
Copy link
Member

@Bodong-Yang Bodong-Yang commented Nov 2, 2023

Description

Note

This PR is based on previous PR #258.

Error handling refinements

This PR introduces refinements to otaclient's error handling, including:

  1. error report now covers boot controller startup failure, boot controller startup failed will not crash the otaclient anymore, and error information is available via status API,
  2. simplify and flatten the error handling structure, regulate the error handling policy regarding each otaclient components,
  3. refine and redefine the OTA errors definition, refine error description,
  4. refine the logging when OTA error occurs.

otaclient startup procedure refinements

otaclient core(otaclient.OTAClient) and boot controller are launched separately, exceptions during these two components startup is handled and captured, so that the whole otaclient startup procedure can finish and grpc server can be launched.
The error information during startup is recorded and available via status API.

DEBUG_MODE and failure_traceback in status API response

Also, previously failure_traceback is always included into the status API response, considering the traffic caused by unpredictable amount of bytes from failure_traceback field, this PR introduces a new config option DEBUG_MODE(default is False), and failure_traceback is only included when DEBUG_MODE is on.

Check list

  • test file(s) that covers the change(s) is implemented.
    1. test files is updated to make tests work.
  • local test is passed.
    1. no regression is observed.
  • VM test is passed.
    1. manually patch otaclient code to raise exception in arbitrary points(especially at boot controller),
    2. confirm otaclient doesn't crash, exception is catched and status API shows the error information.

Changes

OTA errors handling structure change

Check documentation for OTA errors handling design for more details.

OTA errors definition change and descriptions update

Most of the errors are changed to unrecoverable OTA error now, only E_NETWORK, E_OTAMETA_DOWNLOAD_FAILED and E_OTA_BUSY, E_INVALID_STATUS_FOR_OTAROLLBACK are still recoverable errors.

Error descriptions are simpler and include necessary basic information for better understanding what is going on.

otaclient startup workflow change

Previously, otaclient core OTAClient launches boot controller when it is launching, but not handling the error raised by boot controller.

With this PR, a new otaclient.OTAServicer is introduced as "otaclient composer". It is in charge of launch boot controller and otaclient core(otaclient.OTAClient) separately, and then compose then together as a fully functional otaclient instance.

Exceptions during launching otaclient core and boot controller are captured and handled, and error information is available via status API.

Behavior changes

Does this PR introduce behavior change(s)?

  • Yes, internal behaivor (will not impact user experience).
  • Yes, external behaivor (will impact user experience).

Previous behavior

  1. otaclient will crash if boot controller startup failed, grpc server and otaclient core will not start,
  2. failure_traceback is always included into status API response.

Behavior with this PR

  1. otaclient will not crash on boot controller startup failed, otaclient grpc server will still launch and error information can be queried from status API,
  2. failure_traceback is included into the status API response only when DEBUG_MODE is on.

Breaking change

Does this PR introduce breaking change?

  • No.

Related links & tickets

RT4-7473

@Bodong-Yang Bodong-Yang added the refinement Improve the performance, code quality(like improving code structure/readability/error handling/robus label Nov 2, 2023
@Bodong-Yang Bodong-Yang requested a review from obi-t4 November 2, 2023 06:12
@Bodong-Yang Bodong-Yang self-assigned this Nov 2, 2023
Copy link
Contributor

github-actions bot commented Nov 2, 2023

Coverage

Coverage Report
FileStmtsMissCoverMissing
otaclient/app
   __init__.py00100% 
   __main__.py330%16–19
   common.py2681495%55, 185, 200, 275–277, 287, 296–298, 404, 416, 535–539
   configs.py760100% 
   copy_tree.py81396%45, 96, 125
   downloader.py2714484%70, 83–84, 299, 304, 308, 326–327, 377–381, 400–410, 431–449, 458, 533–535, 551, 571–589
   ecu_info.py59788%78–79, 92, 97, 120–121, 130
   errors.py1130100% 
   interface.py17476%32, 43, 47, 51
   log_setting.py26773%27–28, 55–65
   main.py31294%43–44
   ota_client.py37513065%51–52, 74, 180–204, 241–244, 256–272, 282–285, 329–332, 337–344, 357–380, 386–389, 416–440, 454–461, 468, 471–477, 524–527, 535, 571, 576–598, 659–675, 686–702, 713, 732, 743–744, 761, 780, 784–785, 800
   ota_client_call.py38684%42–44, 80–82
   ota_client_service.py29390%58–60
   ota_client_stub.py39510873%79–84, 92–102, 106–128, 133–142, 145–152, 160–165, 207–209, 214, 249, 274, 277, 280, 384, 410, 412, 438, 488, 550, 620–621, 660, 680–691, 695–709, 713–716, 775, 862–871, 901–948, 951
   ota_metadata.py3153190%147, 152, 188–189, 199–203, 215, 273, 306–308, 325–328, 408, 411, 419–421, 434, 443–448, 721–732
   ota_status.py15287%36, 44
   proxy_info.py51786%82–87, 122–129, 137–138
   update_stats.py105298%162, 172
otaclient/app/boot_control
   __init__.py40100% 
   _cboot.py28814251%82–90, 99–112, 118–121, 125, 132, 136, 140, 144–148, 152–156, 163–175, 178–221, 227, 230, 233, 236, 239, 242, 245, 248–249, 252–253, 256–260, 263–264, 279, 282, 316, 333–336, 349–352, 362–365, 373, 412, 429–432, 441–444, 449–451, 460–466, 469, 476, 506–509, 548–551, 556–567, 572–578
   _common.py36914361%65–74, 83–84, 88–89, 97–98, 109–110, 115–120, 125–130, 139, 148, 152, 156, 160, 174–177, 187–191, 196, 200, 210–214, 222–234, 245–249, 253–257, 269–271, 283–285, 296–310, 320–341, 360–376, 393–404, 412–418, 513–523, 582–583, 637, 641–642, 645, 653–656, 712–713, 716, 730–738, 753–754, 828–831, 852–855, 868–869, 878
   _errors.py471960%43–69, 89–90
   _grub.py40311472%217, 265–268, 274–278, 315–316, 323–347, 356–362, 371–377, 456–462, 514, 520, 546, 568–573, 588–590, 619–629, 684–690, 709–712, 737–740, 763–766, 808, 814, 834–837, 849, 852, 855, 858, 862–864, 882–885, 913–916, 921–929, 934–942
   _rpi_boot.py26612553%85–87, 93–148, 172–174, 180–182, 195–197, 203–205, 218–243, 248, 252, 256, 260, 294, 324, 327–329, 337, 346–349, 359–362, 366–373, 413–415, 457–461, 480–483, 488, 491, 512–515, 520–528, 533–541, 555–558, 564–566, 569
   configs.py58297%46–47
   firmware.py32584%63, 65, 76–78
   protocol.py28582%53, 57, 61, 65, 73
   selecter.py392731%45–69, 77–94
otaclient/app/create_standby
   __init__.py12558%30–36
   common.py2164579%74, 77–78, 82–95, 141, 189–197, 200–203, 207, 218, 292–300, 312, 356–361, 377–378, 392–396, 421–422
   interface.py19384%55, 59, 63
   rebuild_mode.py99991%83–101, 128
otaclient/app/proto
   __init__.py31390%37, 44–45
   _common.py4194988%85, 163, 170, 182–184, 203, 208, 219, 256, 262, 267, 298, 302, 306, 363, 380, 403, 464, 471, 474, 494, 501, 503, 529, 535, 538, 540, 565, 571, 574, 576, 605, 609, 611, 625, 642, 670, 673, 677, 680, 708, 714, 761–766, 797
   _ota_metafiles_wrapper.py841385%37, 40–42, 112–116, 122–125
   _otaclient_v2_pb2_wrapper.py2763787%87, 90–93, 132, 175, 183, 197, 207, 210–213, 258, 261, 264–265, 285, 305, 385, 452, 505, 513–515, 519–522, 525–530, 551, 559, 573, 581, 595
   streamer.py43881%32, 47, 65–66, 71, 80–81, 99
   wrapper.py40100% 
otaclient/ota_proxy
   __init__.py46393%86, 90, 94
   __main__.py22220%16–79
   _consts.py150100% 
   cache_control.py68494%71, 91, 113, 121
   config.py180100% 
   db.py1461590%75, 81, 103, 113–116, 145–147, 166, 199, 208–209, 229, 258, 300
   errors.py100100% 
   orm.py1151190%37, 91, 96, 101, 107, 113, 140–141, 154, 231, 235
   ota_cache.py4047581%95–96, 215, 226, 253–255, 275, 291–294, 317–330, 359–363, 379, 440–441, 483–484, 554, 567–570, 620, 639–640, 672–673, 684, 721–728, 735–736, 741–743, 747, 756, 803, 811–813, 892–895, 899–903, 917–934, 965, 971, 998, 1027–1029
   server_app.py1383972%75, 78, 84, 100, 102, 161–170, 212–243, 256–266, 292–298, 312–314, 320–322
   utils.py23196%31
TOTAL6010129778% 

@Bodong-Yang Bodong-Yang changed the base branch from main to fix/grub_slot November 2, 2023 06:28
@Bodong-Yang Bodong-Yang marked this pull request as ready for review November 2, 2023 06:39
Copy link
Member Author

@Bodong-Yang Bodong-Yang Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTA errors definition is simplified considering the real world condition and rethinking on the end-user's side.

With this PR, only network related errors, OTA busy error(OTA already in-progress) and invalid OTA status(attempt to rollback when ota-status is FAILURE) are treated as recoverable, as these errors can be resolved by user themselves.

Other errors are all listed as unrecoverable, and require user to contact technical supports from us(user cannot or hard to know what to do).

Comment on lines +656 to +662
# boot controller starts up
try:
_bootctrl_inst = _bootctrl_cls()
except ota_errors.OTAError as e:
logger.error(
e.get_error_report(title=f"boot controller startup failed: {e!r}")
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boot controller is launched before otaclient core(OTAClient) starts, and we capture and handle the boot controller startup error here.

Error will be captured and parsed, error report(including full traceback) will be logged, failure reason will be available via status API. When DEBUG_MODE is on, full traceback will also be available via status API.

Comment on lines +678 to +689
try:
self._otaclient_inst = OTAClient(
boot_controller=_bootctrl_inst,
create_standby_cls=_standby_slot_creator,
my_ecu_id=ecu_info.ecu_id,
control_flags=control_flags,
proxy=proxy,
)
except ota_errors.OTAError as e:
logger.error(
e.get_error_report(title=f"otaclient core startup failed: {e!r}")
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTAClient startup failure also being captured and parsed here.

Comment on lines 797 to +803
async def get_status(self) -> wrapper.StatusResponseEcuV2:
return await self._run_in_executor(self._otaclient.status)
# otaclient is not started due to boot control startup failed
if self._otaclient_inst is None:
return self._otaclient_startup_failed_status

# otaclient core started, query status from it
return await self._run_in_executor(self._otaclient_inst.status)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If otaclient fails to start, status API will return the pre-built status API response, which is generated based on parsing boot controller startup failure/otaclient core startup failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If otaclient is started, then use the status report generated by running otaclient instance.

Base automatically changed from fix/grub_slot to main November 6, 2023 02:11
@Bodong-Yang Bodong-Yang marked this pull request as draft November 6, 2023 02:13
@Bodong-Yang
Copy link
Member Author

Bodong-Yang commented Nov 6, 2023

Rebase the PR to cleanup unwanted commits.

Updated: Done.

@Bodong-Yang Bodong-Yang force-pushed the refine/boot_init_not_crash branch from 66dec89 to 9f1b41f Compare November 6, 2023 02:18
@Bodong-Yang Bodong-Yang marked this pull request as ready for review November 6, 2023 02:19
Comment on lines +187 to +188
DEBUG_MODE = False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About DEBUG_MODE, when enabled, currently otaclient will behave as follow:

  1. failure_traceback will also be reported via status API(normally only being logged).

In the future, more features like logging to file when DEBUG_MODE is on might be supported.

@obi-t4
Copy link
Contributor

obi-t4 commented Nov 7, 2023

Also, previously failure_traceback is always included into the status API response, considering the traffic caused by unpredictable amount of bytes from failure_traceback field, this PR introduces a new config option DEBUG_MODE(default is False), and failure_traceback is only included when DEBUG_MODE is on.

In this PR, why do you change this behavior? If this required changes, please write the reason, and why the previous code was OK to include trace back?

@Bodong-Yang
Copy link
Member Author

Also, previously failure_traceback is always included into the status API response, considering the traffic caused by unpredictable amount of bytes from failure_traceback field, this PR introduces a new config option DEBUG_MODE(default is False), and failure_traceback is only included when DEBUG_MODE is on.

In this PR, why do you change this behavior? If this required changes, please write the reason, and why the previous code was OK to include trace back?

Sure!

So when failure_traceback was introduced in status API response format v2 many months ago, I was thinking of exposing the full detailed error messages to the end-user side to help end-user debugging the problem.
But now I realize that, in normal case, we should not rely on end-user to debug/fix things on their own when unrecoverable errors occur, they should call for technical support and it is our responsibility to fix problem for them.
So the failure_traceback message included in the status API response is meaningless for most of the end-user(failure_reason that describing what is going on and how should the end-user do in brief is enough).

For technical support like us, when end-user side has problem, the fastest way for us to locate the problem is to check the otaclient logging on cloudwatch, failure_traceback in status API is not so useful for us.

Also the length of failure_traceback is unpredictable depending on what error is raised, I am worried about failure_traceback with unpredictable length will make the status API response to large and cause unwanted problem. (I didn't think of this potential issue when introducing this field).

But the failure_traceback can benefit the local OTA testing, when developer directly monitors the OTA progress with status_monitor helper tool.

With the above considerations, I decide to make failure_traceback in status API as developer oriented feature, and only enabled when DEBUG_MODE is on.

@obi-t4
Copy link
Contributor

obi-t4 commented Nov 7, 2023

@Bodong-Yang

the fastest way for us to locate the problem is to check the otaclient logging on cloudwatch

We can see the traceback on the cloudwatch log, right? (I think it is enough)

@obi-t4
Copy link
Contributor

obi-t4 commented Nov 7, 2023

If you think:

I am worried about failure_traceback with unpredictable length will make the status API response to large and cause unwanted problem.

Do we need to enable traceback in the status API by using DEBUG_MODE flag?
I think otaclient log is enough.

@Bodong-Yang
Copy link
Member Author

@Bodong-Yang

the fastest way for us to locate the problem is to check the otaclient logging on cloudwatch

We can see the traceback on the cloudwatch log, right? (I think it is enough)

Yes, the same contents will be logged using logger.error.

@Bodong-Yang
Copy link
Member Author

Bodong-Yang commented Nov 7, 2023

If you think:

I am worried about failure_traceback with unpredictable length will make the status API response to large and cause unwanted problem.

Do we need to enable traceback in the status API by using DEBUG_MODE flag? I think otaclient log is enough.

When using status_monitor helper tool(#251) with failure_traceback enabled, the failure_traceback can be directly shown in the status_monitor, I think it is a good developer oriented feature(if the otaclient being tested is running as background service).

Also if we are doing local testing using local network, status API's response length is not a problem I think.

Copy link
Contributor

@obi-t4 obi-t4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment but basicall OK!

module: OTAModules = OTAModules.API
errcode: OTAErrorCode = OTAErrorCode.E_INVALID_STATUS_FOR_OTAROLLBACK
desc: str = f"{_RECOVERABLE_DEFAULT_DESC}: current ota-status indicates it should not accept ota rollback"
class OTAErrorUnRecoverable(OTAError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class OTAErrorUnRecoverable(OTAError):
class OTAErrorUnrecoverable(OTAError):

or if we use UnRecoverable, then we need to employee UN_RECOVERABLE. Unrecoverable is natural I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 16cc689

Copy link
Contributor

@obi-t4 obi-t4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@Bodong-Yang Bodong-Yang merged commit 0b9d5af into main Nov 7, 2023
3 checks passed
@Bodong-Yang Bodong-Yang deleted the refine/boot_init_not_crash branch November 7, 2023 10:10
Bodong-Yang added a commit that referenced this pull request Nov 8, 2023
Commits from v3.6.0 to v3.6.1

* Bug fix
1. [Fix] grub: fix unintended behavior when updating grub_default #253
2. (Major) [Fix] grub: workaround for OTA failure/otaclient crashing if rootfs is not sda #258

* Refinement
1. [Chore] minor update and fix to tools.status_monitor #256
2. (Major) [Refinement] refine error handling, error report on status API now covers boot controller startup failure #259

* Chore
1. [Chore] minor update and fix to tools.status_monitor #256
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refinement Improve the performance, code quality(like improving code structure/readability/error handling/robus
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants