Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sequencer-relayer)!: submit blobs directly to celestia app #963

Merged
merged 30 commits into from
Apr 25, 2024

Conversation

Fraser999
Copy link
Contributor

@Fraser999 Fraser999 commented Apr 16, 2024

Summary

The sequencer-relayer has been updated to submit blobs directly to the Celestia app via its gRPC interface rather than using the Celestia node as a proxy.

Background

There have been issues experienced in the past when using the Celestia node for blob submissions related to sequences and timeouts. We want to avoid such issues by "cutting out the middle man" and controlling the flow of submissions directly inside the relayer.

Changes

This PR adds a lot of new proto files and their corresponding generated Rust sources, although they have been reduced to the minimal set of messages/rpcs required for this new feature.

Aside from these, most of the new code has been isolated to a new module celestia_client inside the relayer module. It should be reasonably self-contained, with a view to possibly extracting it to a separate crate in the future, and/or providing a generic client API should we wish to add handling for more chains in the future.

At a high level, the new client is constructed via a builder, which may need multiple attempts to create the client in the case that the Celestia app is unreachable. Retries are controlled via tryhard, maxing out at 30s and with infinite attempts. A shutdown signal can cancel this, allowing for graceful exit. The builder retrieves initial info from the Celestia app (e.g. chain ID, gas costs). The immutable chain ID is held as a member of the client, but the other values (being variable) are cached in the State struct and can be retrieved and cached again should an attempt to submit blobs fail.

The CelestiaClient only has one public method: try_submit. This is called in the existing relayer::write::submit_with_retry function in much the same way as previously. The main difference here is that the client's try_submit method is also passed a tokio watch receiver containing the optional last error. When first submitting a given batch of blobs, the watch channel holds None. On each failed attempt, the on_retry closure sends the error returned by try_submit. On a fresh batch of blobs, a new watch channel is constructed, again initialized with None. The purpose of providing the last error to the client is that it can be interrogated for a failure due to the fee being too low. Such an error holds information on the required fee, and on reattempt, the client uses this value in preference to its own calculated fee.

The general flow inside try_submit is as follows:

  • if the previous submit attempt failed, fetch the cost info from the Celestia app and cache it
  • fetch our account number and sequence number from the Celestia app
  • create, sign and send the blob transaction via the Celestia BroadcastTx method
  • poll the Celestia app once per second via the GetTx method

Testing

Most of the new functionality has unit tests. The CelestiaClient mainly has member methods calling gRPC requests, however handling the responses is delegated to free functions. As well as code clarity, this allows for comprehensive unit tests of these functions without the need for running/mocking a Celestia server. However, there are still missing tests for the try_submit method. That might be best done via the new astria-grpc-mock crate, or at a higher level by refactoring the existing black box tests (almost all of which have been removed for now).

I have run the sequencer against a local network running on k8s and observed the blob submission to behave as expected via the logging. This includes temporarily changing the fee to a low value for the first submission attempt, which caused the retry behaviour to identify and use the correct fee from the error response of the first attempt. I have also had the smoke test pass locally.

Breaking Changelist

  • ASTRIA_SEQUENCER_RELAYER_CELESTIA_BEARER_TOKEN has been replaced by ASTRIA_SEQUENCER_RELAYER_CELESTIA_APP_KEY_FILE which expects a path to a hex-encoded secp256k1 secret key corresponding to an account on the Celestia chain.
  • ASTRIA_SEQUENCER_RELAYER_CELESTIA_ENDPOINT has been replaced by ASTRIA_SEQUENCER_RELAYER_CELESTIA_APP_GRPC_ENDPOINT which represents the endpoint of the gRPC server on the Celestia app.

Further Discussion

  • integration testing/black box testing. [Discussed, agreed to restore them in follow-up. Ticketed at Restore black box tests removed from sequencer-relayer #1008.]
  • the polling period for GetTx after a successful BroadcastTx is hard-coded to one second. Seems like this should be configurable via an env var. [Discussed, and the period is now variable depending upon the result of GetTx. Also, given that the Celestia app will likely be managed by us, we can control its rate-limits, meaning there's less likelihood of needing to change the polling period in the relayer.]

Related Issues

Closes #921.

@github-actions github-actions bot added conductor pertaining to the astria-conductor crate proto pertaining to the Astria Protobuf spec sequencer-relayer pertaining to the astria-sequencer-relayer crate cd labels Apr 16, 2024
Comment on lines +122 to +145
impl Display for GrpcResponseError {
fn fmt(&self, formatter: &mut Formatter<'_>) -> fmt::Result {
write!(
formatter,
"status: {}, message: {}, metadata: {:?}",
self.0.code(),
self.0.message(),
self.0.metadata(),
)
}
}

impl From<Status> for GrpcResponseError {
fn from(status: Status) -> Self {
Self(status)
}
}

impl std::error::Error for GrpcResponseError {
fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
self.0.source()
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost equivalent to just decorating the GrpcResponseError definition with #[error(transparent)]. However, the Display impl for Status delegates to the Debug impls of its member fields including the details: Bytes field and is overly verbose. Hence I just hand-rolled these trait impls.

@Fraser999 Fraser999 force-pushed the 921-write-to-celestia-app branch from 56858c6 to 6dda430 Compare April 18, 2024 23:24
@Fraser999 Fraser999 marked this pull request as ready for review April 18, 2024 23:25
@Fraser999 Fraser999 requested review from a team as code owners April 18, 2024 23:25
Copy link
Member

@SuperFluffy SuperFluffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work, thank you for the PR.

I only have relatively minor comments, none of which I consider stopping this PR from progressing.

I have 2 big concerns, both of which I consider outside the scope of this PR:

  1. My main concern is with confirm_submission never returning with a successful result and spinning indefinitely. At the moment it will simply block indefinitely without resource. How do we decide if we should finally abort the submission and try again? Does Celestia ever report "didn't include your blobs and never will, try again"?
  2. The fee calculation is risky in that it could return very strange results without us understanding what's happening. This could break silently with Celestia updating their logs in such a way that the parsing logic still returns a number, but one that makes no sense.

/// An error while encoding a Bech32 string.
#[derive(Error, Clone, Debug)]
#[error(transparent)]
pub(in crate::relayer) struct Bech32EncodeError(#[from] bech32::EncodeError);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of wrapping this as a transparent instead of directly using bech32::EncodeError (for example, so that BuilderError::EncodeAddress { address: AccountId, source: bech32::EncodeError }?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to avoid having a third party type in our public Rust API. That would make our API vulnerable to version changes in the dependency. I approached the celestia_client module as if it were a standalone crate, since we spoke about maybe moving in that direction in the future.

// How long to wait after starting `confirm_submission` before starting to log errors.
const START_LOGGING_DELAY: Duration = Duration::from_secs(15);
// The minimum duration between logging errors.
const LOG_ERROR_INTERVAL: Duration = Duration::from_secs(5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: to some extent this requires self.get_tx to return in time. If it never returns we will never get any logs (however, this is an issue with all of our RPCs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, agreed, but outside the scope of this PR I believe.

@github-actions github-actions bot added the ci issues that are related to ci and github workflows label Apr 25, 2024
Copy link
Member

@joroshiba joroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approval on behalf of infra / api, have not vetted rust changes.

@Fraser999 Fraser999 added this pull request to the merge queue Apr 25, 2024
Merged via the queue into astriaorg:main with commit 65a22ce Apr 25, 2024
36 checks passed
@Fraser999 Fraser999 deleted the 921-write-to-celestia-app branch April 25, 2024 22:20
github-merge-queue bot pushed a commit that referenced this pull request Apr 30, 2024
## Summary
A new configuration env var has been added to allow filtering data
submission to a subset of rollups.

## Background
This allows more flexibility in what is actually submitted to the
Celestia chain.

## Changes
A new env var was added, mapped to a new field
`Config::rollup_id_filter`. This represents a list of rollups whose
transactions should be included in blobs submitted to Celestia. Rollups
not in the list do not have their data submitted. However, if the list
is empty, no filtering is applied; all rollups have their data
submitted.

The collection of filtered rollup IDs is passed down to the
`write::conversion::convert` function where sequencer blocks are
converted to blobs. The blobs are filtered inside this function, partly
since this is relatively early in the flow, but also so that the
`ConversionInfo` struct can carry information about blobs being
excluded. This data is logged at the start of `write::submit_blobs` in
an info-level log message. It might be worthwhile adding metrics around
the number of excluded rollups per submission, but I didn't see a clear
benefit to that and it can be added in a follow-up PR if required.

## Testing
- Added unit tests covering parsing of the config string to a collection
of `RollupId`s.
- ~Added a black box test where a sequencer block is constructed with
several transactions, half from included rollups and half from excluded
ones. Existing black box tests ensure that when the filter is empty, no
filtering is done.~ These all got removed in #963, and should be
restored as part of #1008.

## Breaking Changelist
- Added `ASTRIA_SEQUENCER_RELAYER_ONLY_INCLUDE_ROLLUPS` which can be
empty or can be a comma-separated list.
github-merge-queue bot pushed a commit that referenced this pull request May 3, 2024
## Summary
The black box tests were removed in a recent PR. This PR reinstates them
using the new mock gRPC framework `astria-grpc-mock`.

## Background
The tests would have needed heavy refactoring in #963, and we wanted to
avoid adding a lot more code to that already-large PR. We decided to
temporarily delete them and reinstate them using `astria-grpc-mock` to
mock responses from the Celestia app.

## Changes
The tests removed in #963 and a single test briefly added in #1001 then
removed again have been restored. I also added a new test to check the
relayer shuts down in a timely manner.

Previously the tests leaned heavily on counts of blobs received by the
mock Celestia app. The current tests retain these checks, but also query
the `/status` endpoint of the sequencer-relayer to confirm its state.
There was also a single test previously which checked the state of the
postsubmit.json file. I didn't think this was necessary given that we're
querying the state in a more "black box" way now (via the http server),
but I didn't remove the function to perform this check
(`TestSequencerRelayer::assert_state_files_are_as_expected`) pending a
decision on whether to reinstate that check or not inside the
`later_height_in_state_leads_to_expected_relay` test.

As @SuperFluffy predicted, all the new protobuf packages added in #963
had to be added to the `pbjson_build` builder to generate the serde
impls required for using them in `astria-grpc-mock`. This accounts for
the vast majority of the new LoC in this PR.

I made a few small changes to the mock framework:
- added `Mock::up_to_n_times` to avoid failures due to a single-use mock
response being sent multiple times
- added `DynamicResponse` to support constructing mock responses which
can be passed the relevant gRPC request
- changed `MockGuard::wait_until_satisfied` to take `self` by ref rather
than value, since if it consumes self and `wait_until_satisfied` times
out, the `MockGuard` panics in its `Drop` impl, meaning we don't get a
good indication from e.g. a `tokio::time::timeout` as to what went wrong

Most of the new functionality lives in `MockCelestiaAppServer` and
`TestSequencerRelayer`. For the former, all gRPCs except for
`BroadcastTx` and `GetTx` are set up to always return valid-enough
responses - i.e. these don't need to be mounted individually in every
test case. In the case of `MockCelestiaAppServer`, the new functionality
is a combination of support for mounting mock responses and functions
with timeouts to query the `/status`, `/healthz` and `/readyz` http
endpoints of the sequencer-relayer.

## Testing
These changes are tests.

Closes #1008.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cd ci issues that are related to ci and github workflows conductor pertaining to the astria-conductor crate proto pertaining to the Astria Protobuf spec sequencer-relayer pertaining to the astria-sequencer-relayer crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

relayer: switch to using celestia-app for blob writing
4 participants