Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability tests for beta releases #908

Open
alejandrox1 opened this issue Dec 10, 2019 · 34 comments
Open

Scalability tests for beta releases #908

alejandrox1 opened this issue Dec 10, 2019 · 34 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@alejandrox1
Copy link
Contributor

Current state of affairs:
We have the following jobs to gauge the quality of the current release

These run against the latest on the master branch of k/k.
These jobs provide critical signal during the release cycle.
However, after code freeze, when we reopen the master branch for the next release, we may occasionally cherry pick multiple commits from master to the release-x.y branch.
During this period, between code thaw and the official release-x.y, we occasionally see failures in our master-informing scalability jobs and are unsure if the changes that brought on the failure are have been cherry picked into the release-x.y branch.

The thing I want to bring upfor discussion in this issue is the possibility of creating scalability jobs for the beta release (the version of the Kubernetes code from code thaw until the official release).
An additional caveat is that besides testing a certain portion of the lubernetes source code (the contents of the release-X.Y branch from code thaw to release) we may also have to set up the tests to run with the equivalent version of https://github.com/kubernetes/perf-tests (to make sure changes to this repo dont obscure signal from k/k).
In short, what do you all think?

Additional resources:

/cc @kubernetes/sig-scalability-feature-requests
/cc @kubernetes/release-team @kubernetes/release-engineering
/sig release
/sig scalability
/priority important-longterm
/milestone v1.18

@justaugustus
Copy link
Member

xref 1.15 Retro AIs: #806

@justaugustus justaugustus transferred this issue from kubernetes/kubernetes Dec 10, 2019
@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/release Categorizes an issue or PR as relevant to SIG Release. labels Dec 10, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Dec 10, 2019
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Dec 10, 2019
@alejandrox1 alejandrox1 self-assigned this Feb 3, 2020
@alejandrox1
Copy link
Contributor Author

/cc @wojtek-t @mm4tt
would love to hear your thoughts on this proposal

@wojtek-t
Copy link
Member

While in general I support it, currently we don't have resources to run this.
We're working on speeding up our tests (among others, we're trying to merge existing two tests into a single one: kubernetes/perf-tests#1008). We believe, we will be able to speed up our 5k-node tests to take hopefully less than 8 hours.
Once this is done, we should be able to leave the 1-day frequency, but run the job for both master and k8s-beta release.

But that still requires non-trivial amount of work to do on our side to speed up our tests.

@alejandrox1
Copy link
Contributor Author

Thank you @wojtek-t for your comment. The work you are doing to speed up the tests would be of great help for us.

One thing that has come up a couple times, please correct me if Im wrong, is that Google is running all scalability jobs, correct?
If possible, we would like to (at some point) move these jobs into CNCF infrastructure.

In order to do this we would need some idea of what we would need to run these tests. For example billing information that could be shared with us and wg-k8s. This way we can ask and possibly start planning how we can make the move.
What do you think?

@wojtek-t
Copy link
Member

One thing that has come up a couple times, please correct me if Im wrong, is that Google is running all scalability jobs, correct?
If possible, we would like to (at some point) move these jobs into CNCF infrastructure.

I think the release-blocking ones has already been moved. But TBH I don't know how to confirm it.
They are running in kubernetes-scale project - do you know how to check if this was already transferred to CNCF?

@mm4tt
Copy link

mm4tt commented Feb 17, 2020

Sorry for not jumping in earlier, I was OOO.
FTR, we run 100 gce node and 500 kubemark node tests continuously on all active release branches - https://k8s-testgrid.appspot.com/sig-scalability-gce. Also recently we've started running the same tests as presubmits in those branches. They are not as sensitive as 5k node tests, but treating them (gce 100 node tests) as beta release blocking tests might be a good intermediate solution (if you don't do that already).

@alejandrox1
Copy link
Contributor Author

Sorry for the super delayed response on this: between release team and checking on the current state of infra I lost track of time but couple details...

I think the release-blocking ones has already been moved. But TBH I don't know how to confirm it.
They are running in kubernetes-scale project - do you know how to check if this was already transferred to CNCF?

Currently, runs in CI borrow (at least for GCP-based jobs) projects/credentials from Boskos.
I went around asking wg k8s infra and it seems that all credentials (projects) are currently owned by Google.

bentheelder
Prow itself is on Google infra.
Most of the infra used from Prow is in google.com GCP projects.
A little bit is not. Including:

  • AWS accounts via CNCF (formerly Google, not for a year or so now)
  • GCB projects for some release automation, which are Google funded CNCF owned

Jobs execute arbitrary code though, so some of them could be using some other infra
Like I think we might actually still have some jobs running EKS tests :face_with_rolling_eyes:
And some of the windows testing from Azure involves images built out of band by some Azure owned process we know nothing about 🙃

alejandrox1
The pool of GCP creds that boskos uses, are all of those google-owned accounts?

bentheelder
Yes.
Also fwiw boskos only hands out project names, the same credential owns all of them :witnessprotectionparrot:

So I guess the work involved in moving scalability tests onto CNCF resources would involve some tweaking on Boskos, in which case it would land us (SIG scalability and release) to work with wg-k8s-infra and figure this out. wdyt @justaugustus @mm4tt @wojtek-t ? Should we proceed with this and try and figure out a way to use CNCF resources for the 5k scalability job?


FTR, we run 100 gce node and 500 kubemark node tests continuously on all active release branches - https://k8s-testgrid.appspot.com/sig-scalability-gce. Also recently we've started running the same tests as presubmits in those branches. They are not as sensitive as 5k node tests, but treating them (gce 100 node tests) as beta release blocking tests might be a good intermediate solution (if you don't do that already).

Thank you Matt for mentioning these. We, release team, do consider these jobs release blocking as well.

@alejandrox1 alejandrox1 modified the milestones: v1.18, v1.19 Mar 22, 2020
@mm4tt
Copy link

mm4tt commented Mar 23, 2020

So I guess the work involved in moving scalability tests onto CNCF resources would involve some tweaking on Boskos, in which case it would land us (SIG scalability and release) to work with wg-k8s-infra and figure this out. wdyt Stephen Augustus Matt Matejczyk Wojciech Tyczynski ? Should we proceed with this and try and figure out a way to use CNCF resources for the 5k scalability job?

Yeah, we should do that. I alway thought these tests were already transferred to CNCF. Let me know what I can do to help with the transfer

@wojtek-t
Copy link
Member

@alejandrox1 - are you really sure we rely on Boscos for 5k-node tests?
We don't use pool of projects - we explicitly set the project for those (there is exactly one predefined project):
https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L29

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2020
@wojtek-t
Copy link
Member

/remove-lifecycle stale

I would like us to get there, but we won't in 1.19 timeframe. Hopefully 1.20...

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2020
@alejandrox1
Copy link
Contributor Author

coming back to this one (excuse the delay).
Couple thing to mention - scalability 5k job does not need boskos as @wojtek-t mention.

I think the way forward is to work with wg-k8s-infra .
There is an open issue on identifying the infrastructure needed to run scalability jobs in cncf resources kubernetes/k8s.io#851
So I guess we can collaborate with that, move the existing scalability job over to cncf resources, and then work on this one.
/milestone v1.20

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.19, v1.20 Sep 18, 2020
@mm4tt
Copy link

mm4tt commented Sep 21, 2020

Sounds good, let me know if there is anything I can help with

@wojtek-t
Copy link
Member

wojtek-t commented Nov 3, 2021

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 3, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2022
@wojtek-t
Copy link
Member

wojtek-t commented Feb 1, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2022
@wojtek-t
Copy link
Member

wojtek-t commented May 2, 2022

/remove-lifecycle stale
/kind bug
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 2, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 30, 2022
@k8s-triage-robot
Copy link

The issue has been marked as an important bug and triaged.
Such issues are automatically marked as frozen when hitting the rotten state
to avoid missing important bugs.

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 30, 2022
@jeremyrickard
Copy link
Contributor

Picking this up for v1.26

/milestone v1.26

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.22, v1.26 Sep 6, 2022
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 18, 2024
@wojtek-t
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 31, 2024
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

9 participants