Skip to content

Role dispatchers delivery guarantee changes

Compare
Choose a tag to compare
@louthy louthy released this 12 Oct 20:18
· 152 commits to main since this release

Role dispatchers which find all online nodes in a role, and deliver messages to some or all of them (depending the dispatcher policy: round-robin, broadcast, least-busy, etc.) had a flaw: in that if there were zero nodes online the tell would fail with the following error:

No processes in group - usually this means there are offline services.

This was in stark contrast to dispatching to a single named Process, in that it would be queued up for when that process came back online.

The problem with taking this approach for roles is that some nodes may never come back online, and so the persistent store could fill up. Roles are supposed to be dynamic in a way that a single ProcessId pointing at a known Process is not.

However, there's a middle ground:

  • Role dispatchers first use the Process.ClusterNodes property to see what nodes have been active in the past four seconds
    • If there are some, then the tell will be sent only to the nodes currently online
    • This was the entirety of the previous system
  • If there aren't any nodes online, then the Role dispatcher will fall back to Process.ClusterNodes24 - which has a list of the nodes that have been active in the past 24 hours.
    • It will first try to find nodes active in the past hour, then within two hours, then three, etc. up to 24 hours
    • If some nodes have been active recently then the tell will be sent to their persisted queue. Waiting for the node(s) to start up

When those nodes restart (if they ever do), they will be able to process the messages as normal.

This allows for periods of downtime, and no lost messages for perhaps single instances of a service that you might be running.