Role dispatchers delivery guarantee changes
Role dispatchers which find all online nodes in a role, and deliver messages to some or all of them (depending the dispatcher policy: round-robin, broadcast, least-busy, etc.) had a flaw: in that if there were zero nodes online the tell
would fail with the following error:
No processes in group - usually this means there are offline services.
This was in stark contrast to dispatching to a single named Process, in that it would be queued up for when that process came back online.
The problem with taking this approach for roles is that some nodes may never come back online, and so the persistent store could fill up. Roles are supposed to be dynamic in a way that a single ProcessId
pointing at a known Process is not.
However, there's a middle ground:
- Role dispatchers first use the
Process.ClusterNodes
property to see what nodes have been active in the past four seconds- If there are some, then the
tell
will be sent only to the nodes currently online - This was the entirety of the previous system
- If there are some, then the
- If there aren't any nodes online, then the Role dispatcher will fall back to
Process.ClusterNodes24
- which has a list of the nodes that have been active in the past 24 hours.- It will first try to find nodes active in the past hour, then within two hours, then three, etc. up to 24 hours
- If some nodes have been active recently then the
tell
will be sent to their persisted queue. Waiting for the node(s) to start up
When those nodes restart (if they ever do), they will be able to process the messages as normal.
This allows for periods of downtime, and no lost messages for perhaps single instances of a service that you might be running.