-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle optimization failures in the fixed-lag smoother #213
Comments
If the entire fixed-lag smoother node shuts down and respawns, it will go back into its "ignition" phase. Depending on how that is configured, it will either: (a) Load a default initial pose and start running immediately If this is the global localization source (providing map->odom transforms), the robot is probably not at the default location. Though I suppose someone could update the default pose parameter periodically to keep up with the robot. Not sure if anyone does that in practice. Someone could even make an ignition sensor that performs this logging action using the If this is the relative localization source (providing odom->base_link transforms), then the absolute pose in that frame is less important. But the robot will experience a large jump in that frame. Unless that type of artifact is being checked for and handled, this jump will likely corrupt the global localization value. Or (b) wait for global localization/ignition measurement before running. If the robot waits for a global localization source, then the robot will be lost but it will at least know that it is lost. I find a small bit of solace in that. That may not be the desirable outcome, but at least the robot won't wonder off a cliff while it thought it was somewhere else. |
If we have a single bad constraint...returns NaNs in the jacobian or something...then it will eventually leave the fixed-lag window. However, the way things leave the fixed-lag window is via a marginalization operation. If the constraint is returning invalid values for the costs of the jacobian, then the marginalization step will also fail. In such a case, with enough work, it may be possible to identify and remove the constraint that is causing problems. One could imagine evaluating the cost of each new constraint and look for obvious problems (infs, nans, etc). That could work in some situations. However, if the constraint causing problems is a motion model, or if the constraint is critical for having a fully-constrained graph (e.g. a prior on a visual landmark), then removing the problematic constraint will merely change your problem; it will not fix it. If we have an incomplete set of constraints resulting in a critical disconnected graph, say a missing motion model constraint creating a break in the robot odometry chain, then the section after the break will be missing its marginal/prior on the first state in the chain. This will make the second chain under-constrained for all time. Let's see if I can make some ASCII art.
If we have a break in the chain:
Then the second sub-chain is missing its prior/marginal. Without that, the second sub-chain is under-constrained and will likely fail optimization...eventually (L-M optimization is very forgiving). This is true even if we wait and let the first sub-chain fall out the back of the fixed-lag window.
If we have an incomplete set of constraints resulting in a noncritical disconnected graph, say a visual landmark that is not connected to any robot poses, then it may be possible to "wait it out". The marginalization process should result in an empty constraint and the landmark would just disappear. That assumes the rest of the variables remain in a good working state. I haven't explored what happens to the variable values when an optimization fails. Do the variable values remain unchanged? Or do they get corrupted while the optimizer is trying to find a solution? tl;dr; There are many failure modes that are simply unrecoverable. It might be possible to recover from a few specific cases, but I'm not sure if it is worth trying. |
As discussed in PR#208: @ayrton04 @efernandez
@efernandez 7 days ago Author Collaborator
I wonder if we should do something when the solver doesn't converge. I'd like to discuss the options and listen to your thoughts, so maybe we can do something on another PR.
One option is to check if (summary_.isSolutionUsable()): http://ceres-solver.org/nnls_solving.html#_CPPv4NK5ceres6Solver7Summary16IsSolutionUsableEv
If it's not usable we could do one of the following:
Abort
Don't notify, but still do the rest, i.e. prepare the marginalization for the next cycle
Roll back the graph to the state before adding the transaction, but this isn't trivial
Note that isSolutionUsable() returns true also if the solver doesn't converge and finishes because it reached the max time or iterations limit. In that case, the diagnostic warning should be enough, and we can simply continue as normal.
@svwilliams svwilliams 16 hours ago Member
I agree we should be checking if the output is usable before using it. Though I'm not sure what should be done when it is not...which is probably why I didn't bother checking in the first place. slightly_smiling_face
We could try "rolling back" the transaction. For the graph itself, it would not actually be that hard. We clone the graph anyway a few lines down (https://github.com/locusrobotics/fuse/blob/devel/fuse_optimizers/src/fixed_lag_smoother.cpp#L207). With a small modification, we could clone the graph before merging in the transaction, optimize the clone, and swap it in for the "real" graph if the optimization succeeds, or drop the clone if the optimization fails.
However, that doesn't fix any of the supporting infrastructure. The motion models generally track which constraint segments have been sent to the optimizer, and only send new or modified constraints in the future. If the optimizer throws away a transaction without somehow telling the motion models about it, then the two could get out of sync. That would lead to missing constraints and potentially a disconnected graph and future failures.
It would definitely be possible to modify the motion model interface to require some sort of ACK from the optimizer for each motion model segment that was successfully incorporated...but that will be a big change.
And then there are things like visual landmark observations, where the sensor model may create special constraints when a visual landmark is first added to the system (e.g. a prior on the landmark position, or some sort of 3-view constraint to ensure the landmark position is solvable). If such a constraint was thrown out without also informing the sensor models, we again get out of sync and set ourselves up for future optimization failures.
Continuing on is basically what we do now. But based on earlier errors, once the graph cannot be optimized correctly, I'm not sure anything in the future will ever fix that. I suspect you will get a constant stream of failures moving forward.
I'm inclined to go with the "abort" option. Log a descriptive message of the failure and exit(), throw an exception, or similar. If the optimization fails, we should make this as noticeable as possible. This is likely caused by a configuration or usage error, and nothing we can do in code will fix it.
But I'm not sure how you feel about it. I can be convinced otherwise.
@efernandez efernandez 9 hours ago •
Author Collaborator
The sensor and motion models going out of sync if we roll back to the graph before updating it with the last transaction is a very good point. I admit I didn't think much about that, although I already anticipated it wouldn't be trivial to roll back the graph.
I agree aborting is the best (and simplest) thing we can do. In such a case, I could print the last transaction, blaming it.
Indeed, if the optimization failed it should be because of something related with one or more constraints in the transaction, or the initial values provided in the variables, or the implementation of the constraints cost functions.
We would still have to narrow that down, but it sounds like we should be able to that after if we record the graph and transactions.
Actually, we need to print the transaction, because the last transaction hasn't been published yet. It sounds like we should still notify the publishers, so the bad graph and transaction are published. Then, in the optimization loop we only need to print the transaction timestamp when aborting.
I've updated this PR with a commit that does that.
Does this sound like a good plan to you?
@ayrton04 ayrton04 5 hours ago Member
Just thinking about this from a "robot in the real world" stance, if the node in question goes down due to the exception or exit call, then even if the node respawns, won't it be completely lost? Just wondering if logging (loudly) and rolling back the graph is a good idea. Won't it be the case that, depending on how things are configured, the "bad" constraint will eventually not be included in the optimization? I guess I'm just wondering if crashing is better/worse than attempting to get back on track and yelling. Obviously the inability to tell the models which constraint/transaction is faulty makes this difficult.
@efernandez efernandez 5 hours ago Author Collaborator
I thought about that, but I believe a failure here only happens if a cost function yields a bad result, like NaN. TBH I've not been able to test a failure mode because my current configuration just works. smile
I think when a failure happens, the solver itself prints some diagnostic messages, and it usually crashes after, if NaNs pollute the state. This is speculation though. It'd be great if I could test a failure mode. Then, maybe what we can do is be more informative when it happens, in terms of transaction and graph. Ideally, rolling back the graph would be great, but it'd be quite difficult to do that atm, as already pointed out.
Any suggestion to force a failure? Maybe adding a division by zero in a cost function, or sth like that. Or just returning false always, so the cost function always fails to evaluate. thinking
@ayrton04 ayrton04 5 hours ago Member
If this is just a one-off test with throwaway code, then maybe throw a quick static counter variable into one of the cost functions, and after it reaches N iterations, we do what you suggested and divide by 0 or something?
The text was updated successfully, but these errors were encountered: