Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot deletion fixes #10

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

vladimiroltean
Copy link
Contributor

This is a collection of one bug fix potentially leading to lockups, a typo correction and the removal of one unused variable.

…_REGION_GET

mnlg.c is not reentrant - that is, when we call delete_snapshot() ->
mnlg_msg_prepare() and _mnlg_socket_sndrcv(), we alter the state of
the ctx->nlg devlink genetlink socket. Most problematic is nlg->buf
corrupted by mnlg_socket_recv_run().

But delete_snapshot_port_id() itself runs in the context of another
delete_snapshots() -> _mnlg_socket_sndrcv() callback:
delete_snapshots_cb(). This is servicing its own mnlg_socket_recv_run()
which has not yet finished.

The problem I'm seeing is actually with the sja1105_dump fork. But that
uses logic copied word for word from this project. I'm seeing a lockup
when delete_snapshot_port_id() is in the call path of delete_snapshots():

SEND devlink
    DEVLINK_CMD_REGION_GET
        DEVLINK_ATTR_BUS_NAME = "spi"
        DEVLINK_ATTR_DEV_NAME = "spi2.0"
Calling _mnlg_socket_recv_run() from delete_snapshots
mnlg_socket_recv_run called from delete_snapshots: before mnl_socket_recvfrom(), buf 0xaaab123442d0
mnlg_socket_recv_run called from delete_snapshots: after mnl_socket_recvfrom(), err = 116
RECEIVE devlink
    DEVLINK_CMD_REGION_GET
        DEVLINK_ATTR_BUS_NAME = "spi"
        DEVLINK_ATTR_DEV_NAME = "spi2.0"
        DEVLINK_ATTR_REGION_NAME = "static-config"
        DEVLINK_ATTR_REGION_SIZE = 95528
        [170] = 01:00:00:00
        DEVLINK_ATTR_REGION_SNAPSHOTS
            DEVLINK_ATTR_REGION_SNAPSHOT
                DEVLINK_ATTR_REGION_SNAPSHOT_ID = 42
RECEIVE devlink
    DEVLINK_CMD_UNSPEC
SEND devlink
    DEVLINK_CMD_REGION_DEL
        DEVLINK_ATTR_BUS_NAME = "spi"
        DEVLINK_ATTR_DEV_NAME = "spi2.0"
        DEVLINK_ATTR_REGION_NAME = "static-config"
        DEVLINK_ATTR_REGION_SNAPSHOT_ID = 42
Calling _mnlg_socket_recv_run() from delete_snapshot_port_id
mnlg_socket_recv_run called from delete_snapshot_port_id: before mnl_socket_recvfrom(), buf 0xaaab123442d0
mnlg_socket_recv_run called from delete_snapshot_port_id: after mnl_socket_recvfrom(), err = 36
RECEIVE devlink
    DEVLINK_CMD_UNSPEC
mnlg_socket_recv_run called from delete_snapshot_port_id: mnl_cb_run2() returned MNL_CB_STOP
mnlg_socket_recv_run called from delete_snapshots: mnl_cb_run2() returned MNL_CB_OK
mnlg_socket_recv_run called from delete_snapshots: before mnl_socket_recvfrom(), buf 0xaaab123442d0

^C

The normal way, I think, in which delete_snapshots() ->
mnlg_socket_recv_run() stops is that mnl_cb_run2() encounters NLMSG_DONE
and that makes err = 0, which breaks the loop. This no longer works when
the do {} while loop has another mnlg_socket_recv_run() embedded within
it: the top-most loop will not see that NLMSG_DONE, thus mnl_cb_run2()
will run the callback which returns MNL_CB_OK, and the loop runs once
more, blocking in the mnl_socket_recvfrom() call, because no more data
is expected from the kernel.

I have no idea why I wasn't able to reproduce this on mv88e6xxx.
Somehow, even with the incorrect-in-principle usage, mnl_cb_run2() does
not block on that last mnl_socket_recvfrom() call from the surplus
iteration, and somehow even still, eventually, calls mnlg_cb_stop().
It may have to do with the different number and geometry of devlink
regions between these 2 drivers. On mv88e6xxx there are many, small
regions. On sja1105 there is a single, huge one (95528 bytes).

Just queue the snapshots up until we finish parsing the netlink messages
from the DEVLINK_CMD_REGION_GET command, and delete them only afterwards.
That solves the lockup in sja1105_dump, and is good to have in the base
program as well.

Signed-off-by: Vladimir Oltean <[email protected]>
Replace "sanpshot" with "snapshot".

Signed-off-by: Vladimir Oltean <[email protected]>
What actually signaled this to me was the warning:

mv88e6xxx_dump.c: In function ‘port_dump’:
mv88e6xxx_dump.c:593:30: warning: too many arguments for format [-Wformat-extra-args]
  593 |         sprintf(region_name, "port", port);
      |                              ^~~~~~

Since nothing is actually done with the region_name string, delete it.

Signed-off-by: Vladimir Oltean <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant