Skip to content
This repository has been archived by the owner on Mar 9, 2019. It is now read-only.

tests/timer failures on FreeBSD #48

Open
kev009 opened this issue Dec 23, 2013 · 12 comments
Open

tests/timer failures on FreeBSD #48

kev009 opened this issue Dec 23, 2013 · 12 comments

Comments

@kev009
Copy link
Contributor

kev009 commented Dec 23, 2013

I haven't dug into this yet. I'm not getting these consistently, sometime the run will be successful. The affinity failed message is always present ("nbio" and "job" in the message were added by me).

1387793688.893 err: /1 nbio failed to set thread 0x80240e200 affinity
1387793688.893 err: /2 nbio failed to set thread 0x80280e200 affinity
1387793688.893 err: /3 nbio failed to set thread 0x802c0e200 affinity
1387793688.894 err: /4 nbio failed to set thread 0x80300e200 affinity
1387793688.894 err: /5 nbio failed to set thread 0x80340e200 affinity
1387793688.894 err: /6 nbio failed to set thread 0x80380e200 affinity
1387793688.894 err: /7 nbio failed to set thread 0x803c0e200 affinity
1387793688.894 err: /8 nbio failed to set thread 0x80400e200 affinity
1387793688.894 err: /9 nbio failed to set thread 0x80440e200 affinity
1387793688.894 err: /10 nbio failed to set thread 0x80480e200 affinity
1387793688.894 err: /11 nbio failed to set thread 0x804c0e200 affinity
1387793688.895 err: dns/13 job failed to set thread 0x80540e200 affinity
1387793688.895 err: dns/12 job failed to set thread 0x80500e200 affinity
1387793688.893 err: phenom:sched/0 nbio failed to set thread 0x80209a200 affinity
#     Failed test (tests/timer.c:record_tick() at line 38)
#     Failed test (tests/timer.c:record_tick() at line 38)
# Looks like you failed 2 tests of 8.
# ELAPSED: 749ms

1..8
ok 1 - PH_OK == ph_nbio_init(0)
ok 2 - PH_OK == ph_job_init(&timer)
ok 3 - PH_OK == ph_job_set_timer_at(&timer, last_tick)
ok 4 - 100ms resolution: diff=106
ok 5 - 100ms resolution: diff=106
not ok 6 - 100ms resolution: diff=213
not ok 7 - 100ms resolution: diff=214
ok 8 - PH_OK == ph_sched_run()
wez added a commit that referenced this issue Dec 23, 2013
It was succeeding but we were claiming failure

Reported in #48
@wez
Copy link
Contributor

wez commented Dec 23, 2013

I see these occasional timer interval failures in my fbsd VM and had chalked them up to VM timing issues.
The problem that the test is detecting is that we're late in dispatching something that should trigger approximately every 100ms.

I don't know (because I haven't delved in!) whether this lateness is due to timer granularity in the system, or whether the timer is being rolled into the next 100ms window by the Ensure that we never schedule in the past logic in timerwheel.c.

@kev009
Copy link
Contributor Author

kev009 commented Dec 23, 2013

Seeing it on HW too with Westmere timers. It's easy to trigger by making all the cores busy.

It does seem to be hitting the code you identified. Any intuition on a fix?

@wez
Copy link
Contributor

wez commented Dec 23, 2013

Instrumenting shows this:

./tests/timer.t
1..8
ok 1 - PH_OK == ph_nbio_init(0)
ok 2 - PH_OK == ph_job_init(&timer)
ok 3 - PH_OK == ph_job_set_timer_at(&timer, last_tick)
1387834193.788 err: phenom:sched/0 timer=0x7fffffffd648 due={1387834193,788433} next={1387834193,888433}
1387834193.788 err: phenom:sched/0 due < next so make due = next
1387834193.788 err: phenom:sched/0 diff=0 (bucket 0)
1387834193.897 err: phenom:sched/0 tick
ok 4 - 100ms resolution: diff=109
1387834193.897 err: phenom:sched/0 timer=0x800a6d088 due={1387834198,788433} next={1387834193,988433}
1387834193.897 err: phenom:sched/0 diff=48 (bucket 0)
1387834193.897 err: phenom:sched/0 timer=0x7fffffffd648 due={1387834193,997860} next={1387834193,988433}
1387834193.897 err: phenom:sched/0 diff=0 (bucket 0)
1387834194.008 err: phenom:sched/0 tick
ok 5 - 100ms resolution: diff=110
1387834194.008 err: phenom:sched/0 timer=0x7fffffffd648 due={1387834194,108020} next={1387834194,88433}
1387834194.008 err: phenom:sched/0 diff=1 (bucket 0)
1387834194.117 err: phenom:sched/0 tick
1387834194.227 err: phenom:sched/0 tick
not ok 6 - 100ms resolution: diff=219
#     Failed test (tests/timer.c:record_tick() at line 38)
1387834194.227 err: phenom:sched/0 timer=0x7fffffffd648 due={1387834194,327806} next={1387834194,288433}
1387834194.227 err: phenom:sched/0 diff=1 (bucket 0)
1387834194.336 err: phenom:sched/0 tick
1387834194.447 err: phenom:sched/0 tick
not ok 7 - 100ms resolution: diff=219
#     Failed test (tests/timer.c:record_tick() at line 38)
ok 8 - PH_OK == ph_sched_run()
# Looks like you failed 2 tests of 8.
# ELAPSED: 660ms

looks like both of the late cases are when diff > 0 in compute_list

@wez
Copy link
Contributor

wez commented Dec 23, 2013

I'm not 100% clear on why this isn't behaving. My best guess is that we're hitting a combination of factors:

  • kqueue seems consistently 10ms late in my VM (maybe relevant: http://unix.derkeiler.com/Mailing-Lists/FreeBSD/hackers/2012-07/msg00307.html)
  • the due time for timeouts is based off ph_time_now, which is the current gettimeofday from the scheduler wakeup, and NOT the wheel tick times that the wheel intervals are based on
  • If the timer due time is sufficiently far away from the wheel due time, and kqueue is late, we could miss a tick and roll into the next one

I tried making the kevent setup for the timer use 1/2, 1/4, 1/8th granularities to make sure that we're waking up often enough, and they all tend to miss the precise due time, even when diff==0, and we can still be late.

It's also possible that the timerwheel code is subtly wrong, but it is weird that we only see these failures on freebsd

@kev009
Copy link
Contributor Author

kev009 commented Dec 24, 2013

Documenting for others, when running https://gist.github.com/kev009/8106940 I see at least 1 ms precision using kqueue which seems to match OS X using timerwheel. Looking into the second bullet.

@erikarn
Copy link

erikarn commented Oct 9, 2014

I think this was just fixed in FreeBSD-HEAD in r272528. Is anyone able to re-test it?

@kev009
Copy link
Contributor Author

kev009 commented Oct 10, 2014

@erikarn it looks a lot better in the happy case, within 1ms.

@wez in some unhappy case, it seems to be getting chucked into a different schedule (see final run). Any thoughts?

FreeBSD kev-ws-aurora 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r272152: Thu Sep 25 22:00:23 MST 2014     root@kev-ws-aurora:/usr/obj/usr/src/sys/AURORA  amd64

1..8
ok 1 - PH_OK == ph_nbio_init(0)
ok 2 - PH_OK == ph_job_init(&timer)
ok 3 - PH_OK == ph_job_set_timer_at(&timer, last_tick)
ok 4 - 100ms resolution: diff=104
ok 5 - 100ms resolution: diff=106
ok 6 - 100ms resolution: diff=106
not ok 7 - 100ms resolution: diff=210
ok 8 - PH_OK == ph_sched_run()
1..8
ok 1 - PH_OK == ph_nbio_init(0)
ok 2 - PH_OK == ph_job_init(&timer)
ok 3 - PH_OK == ph_job_set_timer_at(&timer, last_tick)
ok 4 - 100ms resolution: diff=104
ok 5 - 100ms resolution: diff=101
ok 6 - 100ms resolution: diff=103
ok 7 - 100ms resolution: diff=107
ok 8 - PH_OK == ph_sched_run()
1..8
ok 1 - PH_OK == ph_nbio_init(0)
ok 2 - PH_OK == ph_job_init(&timer)
ok 3 - PH_OK == ph_job_set_timer_at(&timer, last_tick)
ok 4 - 100ms resolution: diff=106
ok 5 - 100ms resolution: diff=106
ok 6 - 100ms resolution: diff=103
ok 7 - 100ms resolution: diff=100
ok 8 - PH_OK == ph_sched_run()
1..8
ok 1 - PH_OK == ph_nbio_init(0)
ok 2 - PH_OK == ph_job_init(&timer)
ok 3 - PH_OK == ph_job_set_timer_at(&timer, last_tick)
ok 4 - 100ms resolution: diff=106
ok 5 - 100ms resolution: diff=107
ok 6 - 100ms resolution: diff=106
ok 7 - 100ms resolution: diff=100
ok 8 - PH_OK == ph_sched_run()
1..8
ok 1 - PH_OK == ph_nbio_init(0)
ok 2 - PH_OK == ph_job_init(&timer)
ok 3 - PH_OK == ph_job_set_timer_at(&timer, last_tick)
ok 4 - 100ms resolution: diff=107
ok 5 - 100ms resolution: diff=107
ok 6 - 100ms resolution: diff=106
not ok 7 - 100ms resolution: diff=213
ok 8 - PH_OK == ph_sched_run()


FreeBSD kev-ws-aurora 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r272902: Fri Oct 10 13:22:55 MST 2014     root@kev-ws-aurora:/usr/obj/usr/src/sys/AURORA  amd64


$ ./timer.t
1..8
ok 1 - PH_OK == ph_nbio_init(0)
ok 2 - PH_OK == ph_job_init(&timer)
ok 3 - PH_OK == ph_job_set_timer_at(&timer, last_tick)
ok 4 - 100ms resolution: diff=100
ok 5 - 100ms resolution: diff=100
not ok 6 - 100ms resolution: diff=136
#     Failed test (tests/timer.c:record_tick() at line 38)
not ok 7 - 100ms resolution: diff=63
#     Failed test (tests/timer.c:record_tick() at line 38)
ok 8 - PH_OK == ph_sched_run()

1..8
ok 1 - PH_OK == ph_nbio_init(0)
ok 2 - PH_OK == ph_job_init(&timer)
ok 3 - PH_OK == ph_job_set_timer_at(&timer, last_tick)
ok 4 - 100ms resolution: diff=100
ok 5 - 100ms resolution: diff=100
ok 6 - 100ms resolution: diff=100
ok 7 - 100ms resolution: diff=99
ok 8 - PH_OK == ph_sched_run()

1..8
ok 1 - PH_OK == ph_nbio_init(0)
ok 2 - PH_OK == ph_job_init(&timer)
ok 3 - PH_OK == ph_job_set_timer_at(&timer, last_tick)
ok 4 - 100ms resolution: diff=100
ok 5 - 100ms resolution: diff=99
not ok 6 - 100ms resolution: diff=139
#     Failed test (tests/timer.c:record_tick() at line 38)
not ok 7 - 100ms resolution: diff=160
#     Failed test (tests/timer.c:record_tick() at line 38)
ok 8 - PH_OK == ph_sched_run()

@kev009
Copy link
Contributor Author

kev009 commented Oct 21, 2014

@erikarn I think the other failures are related to load/power management. Not sure if there is a bug report or feature request there?

But I think this can be closed out for phenom.

@erikarn
Copy link

erikarn commented Oct 21, 2014

You can file a FreeBSD bug if you can reproduce a timer problem with power
management. :)

-a

On 21 October 2014 15:59, Kevin Bowling [email protected] wrote:

@erikarn https://github.com/erikarn I think the other failures are
related to load/power management. Not sure if there is a bug report or
feature request there?

But I think this can be closed out for phenom.


Reply to this email directly or view it on GitHub
#48 (comment).

@ghost
Copy link

ghost commented Aug 4, 2015

Thank you for reporting this issue and appreciate your patience. We've notified the core team for an update on this issue. We're looking for a response within the next 30 days or the issue may be closed.

@kev009
Copy link
Contributor Author

kev009 commented Aug 4, 2015

worst bot evar

@wez
Copy link
Contributor

wez commented Aug 4, 2015

:-/

@wez wez added the do-not-reap label Aug 5, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants