Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add validator slot id groups #123

Closed
wants to merge 6 commits into from
Closed

Conversation

emizzle
Copy link
Collaborator

@emizzle emizzle commented Jul 25, 2024

Related to codex-storage/nim-codex#457 and codex-storage/nim-codex#458.

To cover the entire SlotId (uint256) address space, each validator must validate a portion of the SlotId space. When a slot is filled, the SlotId will be put in to a bucket, based on the value of the SlotId and the number of buckets (validators) configured. Similar to myRequests and mySlots, a function called validationSlots can be used to retrieve the SlotIds being validated for a particular bucket (validator index). This facilitates loading actively filled slots in need of validation when a validator starts.

The number of validators specified in the network-level configuration specifies the minimum number of validators to deployed to the network. There can be more than that number of validators on the network. In the Codex client, each validator will opt to validate a bucket of the SlotId space, by specifying an index of [0, validators-1]. It is important a minimum of 1 validator per bucket is deployed on the network, or the configuration value is set to 1. For example, if the validators configuration value was set to 3, there should be a minimum of 3 validators deployed on the network, with each one specifying a different validator bucket to cover.

These changes do not prevent any 1 validator participant from watching the entire SlotId space to potentially earn more.

emizzle added 5 commits July 25, 2024 15:31
Related to nim-codex/457, nim-codex/458.

To cover the entire SlotId (uint256) address space, each validator must validate a portion of the SlotId space. When a slot is filled, the SlotId will be put in to a bucket, based on the value of the SlotId and the number of buckets (validators) configured. Similar to `myRequests` and `mySlots`, a function called `validationSlots` can be used to retrieve the `SlotIds` being validated for a particular bucket (validator index). This facilitates loading actively filled slots in need of validation when a validator starts.
Rename:
`validationSlots` to `myValidationSlots`
`addToValidationSlots` to `addToMyValidationSlots`
`removeFromValidationSlots` to `removeFromMyValidationSlots`
@AuHau
Copy link
Member

AuHau commented Jul 25, 2024

So I am a bit surprised by this design. I understand that you are taking a similar approach to how we solve similar issues for other concepts (Slots, Requests), but this seems to me a bit over-engineered solution that is not necessary and will incur additional IMHO unnecessary costs (gas) for users of the network. Also while we go with this solution (Validators) for now, we know we will move away from it in the future. Moreover, I don't really like the staticity of the Validator groups, which are specified on the contract deployed.

Let's discuss it on the call today.

@emizzle
Copy link
Collaborator Author

emizzle commented Jul 26, 2024

The alternative idea proposed by @AuHau (modified for detail and completeness):

  1. [Local] Check repostore for previous state downloads, set block b to last downloaded block. If there is no previous block in the repostore, set block b to the start of the block the latest contract was deployed at, or if that doesn't exist, to 0. There are some unknowns around where to get/persist the latest contract deployment block number, with one possible solution being to hardcode this value along with the contract address in the codebase.
  2. [RPC requests] Download past SlotFilled events from Block b to latest.
  3. [Local] Store the latest block number in the repo store to prevent redownloading historical state for next node restart
  4. [Local] Check if the filled slot's SlotId is in the range of the validator. Validator ranges, or buckets, will have to be hardcoded in the codebase, and each validator will elect an index via a cli parameter. If no index is elected, a random one can be assigned, or we could allow validators to override this to validate all slots, but there may be downsides to this (start up time, memory footprint).
  5. [RPC requests] For all slots in range, check the state of each slot.
  6. [Local] If Filled: 1) store the filled slot ids in the repo store to prevent redownloading, 2) start validating
  7. [Local + RPC] Similar to the slot queue, subscribe to all contract events to maintain the current slot state in the repo store, along with the latest block number.

The potential downsides of this approach are:

  1. Imo, this is more of an overengineered solution than what is proposed in the PR for something that is going to be replaced by aggregators. Agreed, from a gas perspective, this is the better solution, however from a perspective of getting a solution in place so that we can have validators for production, I'd rather make the sacrifice of not having to go through the more involved development process of the above.
  2. The amount of RPC requests that need to be performed before the validator can start performing its duties on start up could be potentially very long. As an example, each slot state request incurred a response time of 350ms locally when testing on the current devnet, however that is mostly due to the physical location of the request (australia) and the RPC server (europe). But extrapolate that out to potentially thousands of slots (or more).
  3. Leaning on our experience with the slot queue, maintaining the slot state locally, while being just an engineering challenge, is not particularly super simple. Meaning it will likely take more time than is obvious to develop.

To address the other point that the current PR approach is "static" in that the number of minimum validators is set at a network level in the contract. Yes, this is true and to change it would require a new contract deployment. The new design idea outlined above would still require this value to be set, but would need to be hardcoded in the codebase instead. Because this is a temporary solution, I don't see a network-level configuration setting as a show stopper. If the network grows large enough such that validators are having trouble keeping up with the number of slots they need to monitor, this can be worked around in a number of ways: adding more memory, using the existing validatorMaxSlots cli param, subsetting the buckets locally, and then eventually changing the validator config setting when upgrading the contract. The worst case (which is a good problem to have), is that the network grows large enough where it warrants us to fix this in a more permanent way either with the introduction of aggregators, or if they haven't been developed yet, the design as above.

Ultimately, as @AuHau stated, this is a temporary solution and will most likely be replaced by aggregators in the future. By my estimate, the amount of work involved above is much more than the solution that is currently proposed in this PR. Because of this, in my opinion, I think we should keep what has been already proposed in this PR.

@emizzle
Copy link
Collaborator Author

emizzle commented Aug 2, 2024

I put some more thought into this and have attempted to map the pros and cons for each design.

Loading Validator slots on startup

Local On chain
Slot buckets 🟡 low chance of validation delays1,2
🟢 (no additional gas requirements)
🔴 must fetch, build, and maintain slot state3
🟡 low chance of startup delays4
🔴 may slow Codex client during startup5
🟡 must run one validator per bucket to cover entire space6
🔴 additional CLI param needed14
🔴 additional Codex client complexity16
SCORE: -23
🟡 low chance of validation delays1,2
🔴 increased gas requirement for fillSlot7
🟡 low chance of hitting RPC limits8
🟡 must run one validator per bucket to cover entire space6
🟢 (no startup delays)9
🔴 change to validator buckets requires contract re-deploy
🔴 additional CLI param needed14
SCORE: -18
No slot buckets 🔴 increased chance of validation delays1,11
🔴 increased chance of startup delays4
🔴 must fetch, build, and maintain slot state3
🟢 (no additional CLI params)12
🟡 must run one validator covering all slots13
🔴 additional Codex client complexity16
SCORE: -21
🔴 increased chance of validation delays1,11
🔴 increased gas requirement for fillSlot7
🔴 increased chance of hitting RPC limits15
🟢 (no startup delays)9
🟢 (no additional CLI params)12
🟡 must run one validator covering all slots13
SCORE: -16

Legend

Indicator Weight Description
🟢 (feat) +0 added for comparison only, does not add a benefit over the baseline implementation
🟢 +5 adds a benefit to the baseline implementation
🟡 -1 potential drawback to the baseline implementation
🔴 -5 adds a drawback to the baseline implementation

1 The number of slots to validate may exceed the validators' capabilities to validate all required slots in a single period. If this happens, then SPs may be able to get paid out at the end of contracts before validators have had a chance to validate that they didn't miss proofs. A potential solution for this, is to disallow freeSlot to be called for some periods after the contract has finished.

2 Assumes configured number of validators is large enough to keep the bucket size to within the validators' capacity to validate slots within a single period. The chance of exceeding validator capabilities is less than if there were no slot buckets, due to less slots required for validation.

3 Slot state maintenance has a few disadvantages:

  1. fetching and building the slot state locally increases the startup time for the validator, with a fresh start being potentially very long.
  2. increased memory footprint
  3. must consider the complexities of (single-threaded) concurrency issues with simultaneous reading and writing of the slot state, similar to those experienced with the slot queue

4 Before a validator begins validating, it must check the current state of the slot. The more slots to validate, the longer it will take to start validating slots at startup. With buckets in place, the number of slots will be less than without buckets.

5 Building the slot state may rob needed CPU/IO from other parts of the Codex startup routines

6 To ensure that the entire SlotId space is covered by validators, a minimum of one validator per bucket must be running at all times. The number of SlotId to be covered is less than having no buckets all, in which case one validator must be running that covers the entire space.

7 fillSlot stores the SlotId in the corresponding bucket, adding to the gas costs. However, the gas costs may decrease for freeSlot and payoutSlot as storage is being freed.

8 Compared to fetching all SlotIds when not limited in number by buckets

9 Compared to fetching and building the slot state locally

11 With no slot buckets to limit the number of slots to validate, there is an increased chance the validator will not be able to validate all slots in the network in a given period.

12 No additional validator CLI params are needed, only --validator-max-slots is needed which is already implemented.

13 No need to consider if one validator for each bucket are running, however, in order to guarantee the entirety of the SlotId space is covered, there must be one validator running that can handle validation of all the slots in the network.

14 When slot buckets are used, an additional CLI param is required to assign the validator to a particular bucket. If not specified, by default a random bucket could be assigned, or slots in all buckets could be assigned. In both cases, --validator-max-slots would limit the slots validated.

15 Compared to fetching SlotIds when limited in number by buckets

16 The complexity required for the local slot state structure is more complex and will take more time to implement and maintain than storing it on chain.

Other considerations

  • When --validator-max-slots is set, we must consider to avoid validating only the first subset of filled slots. Possible random assignment (will change on each startup)?
  • Allow --validator-max-slots to be 0, indicating no maximum?

Scoring

Scores were drawn using the current client state as a baseline to indicate if a point was a supplementary benefit or a drawback to the baseline. All four options added a net positive benefit of allowing validators to fetch active slots in the network, so that point has been left out of the comparison. We might want to add different additional weights to some of the points, eg introduction of validation delays could potentially have a higher weight.

@emizzle
Copy link
Collaborator Author

emizzle commented Aug 8, 2024

After a discussion, it was highlighted that in order to prevent SPs from being locked into lengthy contracts, a maximum duration of initially 30 days will be implemented. Due to this maximum, querying and processing 30 days of historical SlotFilled events is sufficient to cover active slots in the network. On startup, Validators will issue one RPC request to retrieve all SlotFilled events over the past 30 days and build the list of SlotIds in memory. Only SlotIds that are in the assigned bucket will be added to the in-memory list. The validator will then process the SlotIds as they are processed now, in a loop, removing slots that are no longer filled. There will be no persistence of slots or synced blocks in the repo store, forcing the validator to re-create this list on each startup.

@markspanbroek @AuHau, I would like to point out, that even though the number of slots is limited to slots filled in the past 30 days, and limited to the assigned bucket, the number could be substantial and take quite a lot of time to check the state of each slot in the removeSlotsThatHaveEnded routine. Empirically, in our current testnet setup, queries to the RPC server for fetching the slot state take on the order of 300ms (due to my physical distance from the RPC server). If there are 1000 slots in the list, it will take 300 seconds (5 minutes) for the validator to complete one iteration of removeSlotsThatHaveEnded.

Did we figure out what we decided to do if there are too many slots for the validator to be able to validate in a given period?

@emizzle
Copy link
Collaborator Author

emizzle commented Aug 9, 2024

Did we figure out what we decided to do if there are too many slots for the validator to be able to validate in a given period?

Add an error log if this occurs.

Another point, is that the validator does not have an entire period to validate all slots. It has from the end of the last period to timeout seconds to complete validation.

One idea for an optimisation is to listen for ProofSubmitted events, which would remove slots from the validation slot list.

@AuHau
Copy link
Member

AuHau commented Sep 26, 2024

@emizzle I believe this PR can be closed as we are going with the local approach right?

@emizzle
Copy link
Collaborator Author

emizzle commented Oct 4, 2024

Closing in favour of codex-storage/nim-codex#890

@emizzle emizzle closed this Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants