Scalability tests for beta releases #908

alejandrox1 · 2019-12-10T19:19:19Z

Current state of affairs:
We have the following jobs to gauge the quality of the current release

These run against the latest on the master branch of k/k.
These jobs provide critical signal during the release cycle.
However, after code freeze, when we reopen the master branch for the next release, we may occasionally cherry pick multiple commits from master to the release-x.y branch.
During this period, between code thaw and the official release-x.y, we occasionally see failures in our master-informing scalability jobs and are unsure if the changes that brought on the failure are have been cherry picked into the release-x.y branch.

The thing I want to bring upfor discussion in this issue is the possibility of creating scalability jobs for the beta release (the version of the Kubernetes code from code thaw until the official release).
An additional caveat is that besides testing a certain portion of the lubernetes source code (the contents of the release-X.Y branch from code thaw to release) we may also have to set up the tests to run with the equivalent version of https://github.com/kubernetes/perf-tests (to make sure changes to this repo dont obscure signal from k/k).
In short, what do you all think?

Additional resources:

Figure out a mechanism to avoid breaking tests in k/k repo perf-tests#797 Figure out a mechanism to avoid breaking tests in k/k repo

/cc @kubernetes/sig-scalability-feature-requests
/cc @kubernetes/release-team @kubernetes/release-engineering
/sig release
/sig scalability
/priority important-longterm
/milestone v1.18

justaugustus · 2019-12-10T19:26:13Z

xref 1.15 Retro AIs: #806

alejandrox1 · 2020-02-10T00:17:52Z

/cc @wojtek-t @mm4tt
would love to hear your thoughts on this proposal

wojtek-t · 2020-02-10T08:55:10Z

While in general I support it, currently we don't have resources to run this.
We're working on speeding up our tests (among others, we're trying to merge existing two tests into a single one: kubernetes/perf-tests#1008). We believe, we will be able to speed up our 5k-node tests to take hopefully less than 8 hours.
Once this is done, we should be able to leave the 1-day frequency, but run the job for both master and k8s-beta release.

But that still requires non-trivial amount of work to do on our side to speed up our tests.

alejandrox1 · 2020-02-10T20:23:52Z

Thank you @wojtek-t for your comment. The work you are doing to speed up the tests would be of great help for us.

One thing that has come up a couple times, please correct me if Im wrong, is that Google is running all scalability jobs, correct?
If possible, we would like to (at some point) move these jobs into CNCF infrastructure.

In order to do this we would need some idea of what we would need to run these tests. For example billing information that could be shared with us and wg-k8s. This way we can ask and possibly start planning how we can make the move.
What do you think?

wojtek-t · 2020-02-10T20:34:07Z

One thing that has come up a couple times, please correct me if Im wrong, is that Google is running all scalability jobs, correct?
If possible, we would like to (at some point) move these jobs into CNCF infrastructure.

I think the release-blocking ones has already been moved. But TBH I don't know how to confirm it.
They are running in kubernetes-scale project - do you know how to check if this was already transferred to CNCF?

mm4tt · 2020-02-17T11:01:33Z

Sorry for not jumping in earlier, I was OOO.
FTR, we run 100 gce node and 500 kubemark node tests continuously on all active release branches - https://k8s-testgrid.appspot.com/sig-scalability-gce. Also recently we've started running the same tests as presubmits in those branches. They are not as sensitive as 5k node tests, but treating them (gce 100 node tests) as beta release blocking tests might be a good intermediate solution (if you don't do that already).

alejandrox1 · 2020-03-22T13:41:47Z

Sorry for the super delayed response on this: between release team and checking on the current state of infra I lost track of time but couple details...

I think the release-blocking ones has already been moved. But TBH I don't know how to confirm it.
They are running in kubernetes-scale project - do you know how to check if this was already transferred to CNCF?

Currently, runs in CI borrow (at least for GCP-based jobs) projects/credentials from Boskos.
I went around asking wg k8s infra and it seems that all credentials (projects) are currently owned by Google.

bentheelder
Prow itself is on Google infra.
Most of the infra used from Prow is in google.com GCP projects.
A little bit is not. Including:

AWS accounts via CNCF (formerly Google, not for a year or so now)

GCB projects for some release automation, which are Google funded CNCF owned

Jobs execute arbitrary code though, so some of them could be using some other infra
Like I think we might actually still have some jobs running EKS tests :face_with_rolling_eyes:
And some of the windows testing from Azure involves images built out of band by some Azure owned process we know nothing about 🙃

alejandrox1
The pool of GCP creds that boskos uses, are all of those google-owned accounts?

bentheelder
Yes.
Also fwiw boskos only hands out project names, the same credential owns all of them :witnessprotectionparrot:

So I guess the work involved in moving scalability tests onto CNCF resources would involve some tweaking on Boskos, in which case it would land us (SIG scalability and release) to work with wg-k8s-infra and figure this out. wdyt @justaugustus @mm4tt @wojtek-t ? Should we proceed with this and try and figure out a way to use CNCF resources for the 5k scalability job?

FTR, we run 100 gce node and 500 kubemark node tests continuously on all active release branches - https://k8s-testgrid.appspot.com/sig-scalability-gce. Also recently we've started running the same tests as presubmits in those branches. They are not as sensitive as 5k node tests, but treating them (gce 100 node tests) as beta release blocking tests might be a good intermediate solution (if you don't do that already).

Thank you Matt for mentioning these. We, release team, do consider these jobs release blocking as well.

mm4tt · 2020-03-23T07:52:25Z

So I guess the work involved in moving scalability tests onto CNCF resources would involve some tweaking on Boskos, in which case it would land us (SIG scalability and release) to work with wg-k8s-infra and figure this out. wdyt Stephen Augustus Matt Matejczyk Wojciech Tyczynski ? Should we proceed with this and try and figure out a way to use CNCF resources for the 5k scalability job?

Yeah, we should do that. I alway thought these tests were already transferred to CNCF. Let me know what I can do to help with the transfer

wojtek-t · 2020-03-23T12:18:48Z

@alejandrox1 - are you really sure we rely on Boscos for 5k-node tests?
We don't use pool of projects - we explicitly set the project for those (there is exactly one predefined project):
https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L29

fejta-bot · 2020-06-21T12:36:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wojtek-t · 2020-06-22T06:49:17Z

/remove-lifecycle stale

I would like us to get there, but we won't in 1.19 timeframe. Hopefully 1.20...

alejandrox1 · 2020-09-18T18:43:17Z

coming back to this one (excuse the delay).
Couple thing to mention - scalability 5k job does not need boskos as @wojtek-t mention.

I think the way forward is to work with wg-k8s-infra .
There is an open issue on identifying the infrastructure needed to run scalability jobs in cncf resources kubernetes/k8s.io#851
So I guess we can collaborate with that, move the existing scalability job over to cncf resources, and then work on this one.
/milestone v1.20

mm4tt · 2020-09-21T15:46:06Z

Sounds good, let me know if there is anything I can help with

wojtek-t · 2021-11-03T08:00:57Z

/remove-lifecycle rotten

k8s-triage-robot · 2022-02-01T08:46:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t · 2022-02-01T09:06:42Z

/remove-lifecycle stale

k8s-triage-robot · 2022-05-02T09:59:12Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t · 2022-05-02T11:09:36Z

/remove-lifecycle stale
/kind bug
/triage accepted

k8s-triage-robot · 2022-07-31T11:37:15Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-08-30T11:39:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-08-30T14:39:53Z

The issue has been marked as an important bug and triaged.
Such issues are automatically marked as frozen when hitting the rotten state
to avoid missing important bugs.

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle frozen

jeremyrickard · 2022-09-06T14:43:17Z

Picking this up for v1.26

/milestone v1.26

k8s-triage-robot · 2024-01-18T23:58:51Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

wojtek-t · 2024-01-31T11:29:53Z

/triage accepted

k8s-triage-robot · 2025-01-30T11:30:52Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

justaugustus transferred this issue from kubernetes/kubernetes Dec 10, 2019

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/release Categorizes an issue or PR as relevant to SIG Release. labels Dec 10, 2019

k8s-ci-robot added this to the v1.18 milestone Dec 10, 2019

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Dec 10, 2019

justaugustus mentioned this issue Dec 10, 2019

[Umbrella] 1.16 Release Retrospective Action Items #806

Closed

9 tasks

alejandrox1 self-assigned this Feb 3, 2020

alejandrox1 modified the milestones: v1.18, v1.19 Mar 22, 2020

xmudrii mentioned this issue May 11, 2020

Review of the Release Engineering backlog #1069

Closed

7 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2020

k8s-ci-robot modified the milestones: v1.19, v1.20 Sep 18, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 3, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2022

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 2, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 30, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 30, 2022

k8s-ci-robot modified the milestones: v1.22, v1.26 Sep 6, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 18, 2024

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 31, 2024

k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability tests for beta releases #908

Scalability tests for beta releases #908

alejandrox1 commented Dec 10, 2019

justaugustus commented Dec 10, 2019

alejandrox1 commented Feb 10, 2020

wojtek-t commented Feb 10, 2020

alejandrox1 commented Feb 10, 2020

wojtek-t commented Feb 10, 2020

mm4tt commented Feb 17, 2020

alejandrox1 commented Mar 22, 2020

mm4tt commented Mar 23, 2020

wojtek-t commented Mar 23, 2020

fejta-bot commented Jun 21, 2020

wojtek-t commented Jun 22, 2020

alejandrox1 commented Sep 18, 2020

mm4tt commented Sep 21, 2020

wojtek-t commented Nov 3, 2021

k8s-triage-robot commented Feb 1, 2022

wojtek-t commented Feb 1, 2022

k8s-triage-robot commented May 2, 2022

wojtek-t commented May 2, 2022

k8s-triage-robot commented Jul 31, 2022

k8s-triage-robot commented Aug 30, 2022

k8s-triage-robot commented Aug 30, 2022

jeremyrickard commented Sep 6, 2022

k8s-triage-robot commented Jan 18, 2024

wojtek-t commented Jan 31, 2024

k8s-triage-robot commented Jan 30, 2025

Scalability tests for beta releases #908

Scalability tests for beta releases #908

Comments

alejandrox1 commented Dec 10, 2019

justaugustus commented Dec 10, 2019

alejandrox1 commented Feb 10, 2020

wojtek-t commented Feb 10, 2020

alejandrox1 commented Feb 10, 2020

wojtek-t commented Feb 10, 2020

mm4tt commented Feb 17, 2020

alejandrox1 commented Mar 22, 2020

mm4tt commented Mar 23, 2020

wojtek-t commented Mar 23, 2020

fejta-bot commented Jun 21, 2020

wojtek-t commented Jun 22, 2020

alejandrox1 commented Sep 18, 2020

mm4tt commented Sep 21, 2020

wojtek-t commented Nov 3, 2021

k8s-triage-robot commented Feb 1, 2022

wojtek-t commented Feb 1, 2022

k8s-triage-robot commented May 2, 2022

wojtek-t commented May 2, 2022

k8s-triage-robot commented Jul 31, 2022

k8s-triage-robot commented Aug 30, 2022

k8s-triage-robot commented Aug 30, 2022

jeremyrickard commented Sep 6, 2022

k8s-triage-robot commented Jan 18, 2024

wojtek-t commented Jan 31, 2024

k8s-triage-robot commented Jan 30, 2025