Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast import: tenant broken if import cancelled / errors out before /status/pgdata reaches done state #10191

Open
Tracked by #10188
problame opened this issue Dec 18, 2024 · 2 comments
Labels
t/bug Issue Type: Bug

Comments

@problame
Copy link
Contributor

If the import flow stops due to an error or cancellation (PS restart, tenant migration), the index part in S3 has an invalid disk_consistent_lsn and the tenant will fail to load_remote_timelines in any location where it subsequently attached

Timeline 888a599bcfba087241276c6df8efc311/a6aedf56204960c7b5b9fdc857dff15c has invalid disk_consistent_lsn

More context: https://neondb.slack.com/archives/C033RQ5SPDH/p1734549772067199?thread_ts=1734368383.258759&cid=C033RQ5SPDH

When fixing this issue, also think about / add regression test for resilience to tenant migrations, etc in other places.
Probably there will be no time to get the kind of excellent test coverage we have for the timeline deletion flow, but think about it in the same way.

Evidence

Index part looks like this

cs@cs-neon-mbp:[~/Desktop]: aws s3 cp s3://neon-dev-storage-us-east-2/pageserver/v1/tenants/888a599bcfba087241276c6df8efc311/timelines/a6aedf56204960c7b5b9fdc857dff15c/index_part.json-00000001 - | jq
{
  "version": 10,
  "import_pgdata": {
    "V1": {
      "InProgress": {
        "idempotency_key": "todo",
        "location": {
          "AwsS3": {
            "region": "us-east-2",
            "bucket": "neon-dev-bulk-import-us-east-2",
            "key": "import-pgdata/fast-import/v1/br-little-lab-w28y1bqu"
          }
        },
        "started_at": "2024-12-18T15:33:22.045125540"
      }
    }
  },
  "layer_metadata": {},
  "disk_consistent_lsn": "0/0",
  "metadata_bytes": {
    "disk_consistent_lsn": "0/0",
    "prev_record_lsn": null,
    "ancestor_timeline": null,
    "ancestor_lsn": "0/0",
    "latest_gc_cutoff_lsn": "0/0",
    "initdb_lsn": "0/0",
    "pg_version": 15
  },
  "lineage": {}
}
@NanoBjorn
Copy link
Contributor

@problame @jcsp as a precaution measure, cplane can remove such branches with a size-defined timeout, wdyt? Like if we did not mark branch ready for 2h * logical_size / 50gb, then we are unlikely to mark it and we can delete it?

@problame
Copy link
Contributor Author

With the current implementation / representation of the timeline importing state inside the pageserver, a delete will fail with a 404.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants