Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests: flaky failures #464

Closed
inge4pres opened this issue May 10, 2024 · 4 comments · Fixed by #568
Closed

Tests: flaky failures #464

inge4pres opened this issue May 10, 2024 · 4 comments · Fixed by #568
Assignees
Labels
bug Something isn't working good first issue Good for newcomers impact:high urgency:low

Comments

@inge4pres
Copy link
Contributor

Describe the current state

During the development of #447 I noticed some flaky tests, passing locally most of the times but not in CI:

  • consumer_test.go:224 -> ProducedWithHeaders
    metrics_test.go:64: [Metrics Data not equal: Sum[int64] not equal: Sum DataPoints not equal:
        missing expected values:
  • consumer_test.go:464 --- FAIL: TestConsumerGracefulShutdown (0.03s)
    --- FAIL: TestConsumerGracefulShutdown/AtLeastOnceDelivery (0.01s)
    --- PASS: TestConsumerGracefulShutdown/AtMostOnceDelivery (0.02s)
    
            	Error Trace:	/home/runner/work/apm-queue/apm-queue/kafka/consumer_test.go:464
        	            				/home/runner/work/apm-queue/apm-queue/kafka/consumer_test.go:469
        	Error:      	Not equal: 
        	            	expected: 2
        	            	actual  : 1
        	Test:       	TestConsumerGracefulShutdown/AtLeastOnceDelivery
  • TestProducerMetrics

Describe the desired state

The tests are consistently reporting either a failure or a success.

@inge4pres inge4pres added bug Something isn't working good first issue Good for newcomers impact:high urgency:low labels May 10, 2024
@inge4pres
Copy link
Contributor Author

inge4pres commented May 13, 2024

For TestProducerMetrics: running it with

 go test -v -count=10 -run="TestProducerMetrics/DeadlineExceeded" ./kafka/...

most certainly produces a panic

=== RUN   TestProducerMetrics/DeadlineExceeded
    metrics_test.go:65: 
                Error Trace:    /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:65
                                                        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:123
                Error:          Not equal: 
                                expected: 2
                                actual  : 1
                Test:           TestProducerMetrics/DeadlineExceeded
    metrics_test.go:67: producer.messages.count
    metrics_test.go:71: [Name not equal:
        expected: messaging.kafka.connect_errors.count
        actual: producer.messages.count Description not equal:
        expected: Total number of connection errors, by broker
        actual: The number of messages produced Metrics Data not equal: Sum[int64] not equal: Sum DataPoints not equal:
        missing expected values:
        metricdata.DataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:[1]attribute.KeyValue{attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"kafka", slice:interface {}(nil)}}}}}, StartTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Time:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Value:1, Exemplars:[]metricdata.Exemplar[int64](nil)}
        unexpected additional values:
        metricdata.DataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:[7]attribute.KeyValue{attribute.KeyValue{Key:"error_reason", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"timeout", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.destination.name", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"default-topic", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.kafka.destination.partition", Value:attribute.Value{vtype:2, numeric:0x0, stringly:"", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"kafka", slice:interface {}(nil)}}, attribute.KeyValue{Key:"namespace", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"name_space", slice:interface {}(nil)}}, attribute.KeyValue{Key:"outcome", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"failure", slice:interface {}(nil)}}, attribute.KeyValue{Key:"topic", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"name_space-default-topic", slice:interface {}(nil)}}}}}, StartTime:time.Date(2024, time.May, 13, 17, 24, 7, 997820327, time.Local), Time:time.Date(2024, time.May, 13, 17, 24, 7, 998078794, time.Local), Value:3, Exemplars:[]metricdata.Exemplar[int64](nil)}
        ]
--- FAIL: TestProducerMetrics (0.00s)
    --- FAIL: TestProducerMetrics/DeadlineExceeded (0.00s)
panic: runtime error: index out of range [1] with length 1 [recovered]
        panic: runtime error: index out of range [1] with length 1

goroutine 7 [running]:
testing.tRunner.func1.2({0xdab780, 0xc000338078})
        /usr/local/go/src/testing/testing.go:1545 +0x238
testing.tRunner.func1()
        /usr/local/go/src/testing/testing.go:1548 +0x397
panic({0xdab780?, 0xc000338078?})
        /usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/elastic/apm-queue/v2/kafka.TestProducerMetrics.func1({0xf62858, 0xc00024a5b0}, 0xc0001ff6c0, {0xf604c8, 0xc000180c60}, {0xf62c80, 0xc0000bad20}, {0xc000055a80, 0x2, 0x2}, ...)
        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:70 +0x6be
github.com/elastic/apm-queue/v2/kafka.TestProducerMetrics.func2(0xc0001ff6c0?)
        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:123 +0xd68
testing.tRunner(0xc0001ff6c0, 0xc000053590)
        /usr/local/go/src/testing/testing.go:1595 +0xff
created by testing.(*T).Run in goroutine 6
        /usr/local/go/src/testing/testing.go:1648 +0x3ad
FAIL    github.com/elastic/apm-queue/v2/kafka   0.008s
FAIL

From what I could see, it appears that the commit 49cfc50 introduced even more flakyness.

The offending line i if !slices.Contains(ignore, metrics[i].Name) { because when the returned metrics are less than want, we can still get to the loop (require would halt the test, but assert is used so we keep executing).

@kyungeunni
Copy link
Contributor

One more flaky test I've discovered, TestTopicCreatorCreateTopics fails occasionally, with logs of data race detected.
example: https://github.com/elastic/apm-queue/actions/runs/9282217459/job/25539755922#step:4:1059

@simitt
Copy link
Contributor

simitt commented Aug 9, 2024

@rubvs can you confirm that this can be closed now that #537 got merged?

@rubvs
Copy link
Contributor

rubvs commented Aug 9, 2024

TestProducerMetrics are still failing with:

go test -v -count=10 -failfast ./kafka/...

However, TestConsumerGracefulShutdown seems to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers impact:high urgency:low
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants