Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CH6178] Tracer/checkpoints (rebase) #1451

Open
wants to merge 22 commits into
base: develop
Choose a base branch
from
Open

Conversation

avtolstoy
Copy link
Member

submission notes
**Important:** Please sanitize/remove any confidential info like usernames, passwords, org names, product names/ids, access tokens, client ids/secrets, or anything else you don't wish to share.

Please Read and Sign the Contributor License Agreement ([Info here](https://github.com/spark/firmware/blob/develop/CONTRIBUTING.md)).

You may also delete this submission notes header if you'd like. Thank you for contributing!

Problem

This is a rebase of #1369 on develop with renaming according to [CH8404] and some additional backup SRAM overflow checks.

Solution

When a device hard faults, crashes, freezes or is otherwise misbehaving, it is presently difficult to know what the last action the device took is and whether the fault lies in application code, system code, peripheral code etc..

A checkpoints API would allow application and system firmware to record their current execution progress and provide an indication of where execution halted prior to the crash. This information is then published to the cloud on next successful connection.

It is also possible to get stack trace of all the threads using a naive stack unwinder, that just scans through the thread stack, looking for flash addresses (within system and user part bounds) and checks if there is a branching instruction at that location.

It would be useful to store per-thread stack traces in addition to checkpoint info whenever the device enters a hardfault, a panic state, or on demand. This will provide a better overview of the system state and help understand the cause of the crash or deadlock for difficult to reproduce issues.

System code may use the following macros defined in tracer_service.h:

  • TRACER_CHECKPOINT() - regular instruction address type checkpoint
  • TRACER_PANIC_CHECKPOINT() - special instruction address type checkpoint which should only be called from a panic handler, which will prevent further modification of diagnostic data
  • TRACER_CRASH_CHECKPOINT(pc) - special instruction address type checkpoint which should only be called from a crash (hardfault) handler, which will prevent further modification of diagnostic data. An instruction address needs to be manually passed as an argument.
  • TRACER_UPDATE() - forces a full system state update (including stacktraces).

All LOG statements, except for LOG_DUMP, make a call to TRACER_CHECKPOINT().
Both hardfault and panic handlers make a call to their respective macros: TRACER_CRASH_CHECKPOINT(), TRACER_PANIC_CHECKPOINT()

Application code may use the following macros defined in spark_wiring_tracer.h:

  • CHECKPOINT() - a standard variant of CHECKPOINT() macro. Implementation depends on a preprocessor macro TRACER_ELF_AVAILABLE.
    • When defined and equal to 1, CHECKPOINT() macro internally calls TRACER_CHECKPOINT(), making it an instruction address type checkpoint
    • When undefined or set to 0, CHECKPOINT() macro internally makes a call to TRACER_TEXT_CHECKPOINT(), which saves the location of the checkpoint in textual format: __FILENAME__:__LINE__
  • CHECKPOINT(text) - a specialized variant of CHECKPOINT() macro, which forces a textual checkpoint with the text passed as an argument

All Logger class calls, except for print, printf, write and dump, internally include a call to _LOG_CHECKPOINT() macro defined in spark_wiring_diagnostic.h, which is only enabled when TRACER_ELF_AVAILABLE == 1. It acquires an address of the calling function and uses that as the checkpoint instruction address.

Steps to Test

  • unit tests
  • app/checkpoint (see README.md)

Example App

  • app/checkpoint

References


Completeness

  • User is totes amazing for contributing!
  • Contributor has signed CLA (Info here)
  • Problem and Solution clearly stated
  • Run unit/integration/application tests on device
  • Added documentation
  • Added to CHANGELOG.md after merging (add links to docs and issues)

@avtolstoy avtolstoy mentioned this pull request Dec 21, 2017
6 tasks
Copy link
Contributor

@m-mcgowan m-mcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An excellent piece of work and fantastic to have this feature available! We will release it in 0.8.0-rc.2 so please ensure corresponding docs are available. In general, I'd like to see more source code comments describing why we are done - I've added some inline comments for areas that I feel need some explanation.

os_thread_t os_thread_current();

typedef struct {
uint8_t reserved;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a size field here for extensibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a question in the original PR as well :)

It's currently only used within a single module and is not exported, so that shouldn't be a concern at the moment. If we later decide to export it, the first reserved becomes a version field as usual.

@@ -15,14 +15,14 @@ AR = $(GCC_ARM_PATH)$(GCC_PREFIX)gcc-ar
#

# C compiler flags
CFLAGS += -g3 -gdwarf-2 -Os -mcpu=cortex-m3 -mthumb
CFLAGS += -g3 -gdwarf-2 -Os -mcpu=cortex-m3 -mthumb -fomit-frame-pointer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you say a few words about the omit-frame-pointer option. is it only an optimization or does it have a functional purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization is already enabled by -Os (actually by -O1, see https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html), but it's better to explicitly specify it.

Disabling this option will lead to increased flash and stack usage (at the very least +16 bytes on each function call without tail-call optimization) and will interfere with stacktracer implemented in this PR. MbedTLS bignum assembly optimizations will also fail to build as R7 is not available and stores the frame pointer.

#include <malloc.h>
#include "timer_hal.h"

extern "C" void vTaskGetStackInfo( TaskHandle_t pxTask, void** stack_ptr, void** start_stack_ptr, void** end_stack_ptr );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be pushed to a FreeRTOS header?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return 0;
}

os_result_t os_thread_dump(os_thread_t thread, os_thread_dump_callback_t callback, void* ptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document that the callback is executed in a critical section and should ensure it completes independently of the progress of other threads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
TracerService also only calls it with interrupts disabled.


int count = 0;

for (uint32_t* sp = (uint32_t*)info->stack_start; sp <= (uint32_t*)info->stack_end; sp++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

black magic! 🤘

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat documented blackmagic :)

} else {
// Insert new at the end
if (freeSpace() < (sizeof(ThreadEntry) + size(chkpt) + strnlen(info->name, maxThreadNameLength) + 1)) {
return TRACER_ERROR_NO_SPACE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use a circular buffer so that the most recent entries are preserved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, however in theory we should not run out of space easily. There is at least 1k (~700 on Electron) bytes in system retained RAM and we do not have that many threads (especially on Electron).

We do not generate full stacktraces on every call (only when needed, e.g. using USB request or in hardfault/watchdog ISR) and every thread entry only has a header struct (8 bytes) + its name (up to 16 bytes) + a checkpoint stored (4 bytes).

@@ -26,6 +26,8 @@ CFLAGS += -DINCLUDE_PLATFORM=1
# platforms.h
ifeq ($(PLATFORM_ID),3)
INCLUDE_DIRS += $(PROJECT_ROOT)/platform/shared/inc
INCLUDE_DIRS += $(PROJECT_ROOT)/platform/MCU/gcc/inc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unexpected - could you explain why the gcc platform is brought in for all user apps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for GCC platform: ifeq ($(PLATFORM_ID),3)

@@ -7,6 +7,6 @@ TARGET_TYPE = a

BUILD_PATH_EXT=$(COMMUNICATION_BUILD_PATH_EXT)

DEPENDENCIES = hal dynalib services wiring crypto
DEPENDENCIES = hal dynalib services wiring crypto platform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this is because of the traces in the logs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. platform_tracer.h mainly.

@@ -1,8 +1,8 @@

system_part3_start = 0x8060000;

system_part3_ram_end = 0x2001D800 /* 0x20200000-10K */;
system_part3_ram_start = 0x2001c000 /* end of SRAM - 16K */;
system_part3_ram_end = 0x2001D800 - 1K /* 0x20200000-10K-1K */;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the 1K reserved for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time of writing system-part3 RAM was overflowing, this change might no longer be needed.


int tracer_save_checkpoint__(tracer_checkpoint_t* chkpt, uint32_t flags, void* reserved) {
if (callbacks2.tracer_save_checkpoint) {
return callbacks2.tracer_save_checkpoint(chkpt, flags, reserved);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain please why we need two distinct callback stores?

Copy link
Member Author

@avtolstoy avtolstoy Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One for system-part1, the other for system-part3. Ideally these should be moved into an .inc and built as part of module-specific source files or I guess both should be enabled simultaneously and built as weak, as otherwise these will probably cause issues with non-clean builds. 👍

@m-mcgowan m-mcgowan added this to the 0.8.0-rc.2 milestone Jan 18, 2018
avtolstoy and others added 22 commits January 19, 2018 05:05
…hen built by C (as opposed to C++) compiler. Replace const variables like OS_THREAD_INVALID_HANDLE in concurrent_hal.h with defines when being built by C compiler
…d of passing an integer argument to __builtin_return_address
@avtolstoy avtolstoy force-pushed the feature/tracer-rebase branch from 8e40494 to 95628c0 Compare January 18, 2018 22:05
@m-mcgowan m-mcgowan modified the milestones: 0.8.0-rc.2, 0.8.0-rc.3 Jan 24, 2018
@avtolstoy avtolstoy modified the milestones: 0.8.0-rc.3, 0.8.0-rc.4 Mar 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants