-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverse mode apply iterate #1485
Conversation
|
Benchmark Results
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
@vchuravy per signal convo, my best memory of where this left off (@gbaraldi please correct me). We find a segfault. Specifically we are apparently setting the name of a datatype to something new which is wrong (specifically tuple.name or something). Backtracing to the culprit we find that it is getting overwritten in the += of the reverse pass. Specifically we have a for loop of apply generics (for the iterate) and also an unstable get nth index of. These result in an allocation of shadow pointers from their results for use in the reverse pass. We shuold have a cache of Tuple{Ref{Float64},Ref{Float64}}. The tuple of size 2 is from the batch. The ref as that is what the shadow from the augmented primal of unstable get nth index of should return. In the reverse pass, we should have two iterations -- just like the fwd pass has two iterations. It is on the last iteration of the reverse pass [aka the theoretical index 0] which is the issue. The cache does not actually contain 2 (for the two iters of loop) Tuple{Ref{Float64},Ref{Float64}}'s. Instead it contains as first element (1=Tuple.name, 2=Tuple.name) and then (1=RefValue{1.0}, 2=RefValue{1.0}) [random float values added, idr what they were]. We had earlier spent a rabbit hole looking at the assembly of the indexing since LLVM appears to have split the induction variable into two counters, one which is negative and increments up and the other which goes down. One is used for indexing, the other for the revere pass loop exit. This was originally suspicious since the indexing one on the last reverse pass iteration (aka i==0) seemed to index at an offset of -8 possibly implying the loop was doing an incorrect iteration. However this negation was actually expected as it basically was LLVM strength reduction moving part of the indexing out of the loop. So the fundamental issue here is why is the 0th element of the cache busted. So far, no clue. It is possible this could be simplified a bit from julia end by basically turning it into explicit iterate calls and then doing some more precise type stability -- but maybe not. For obvious reasons of unknown loop size we use the exponential allocation which makes things more obnoxious to look at per the generated junk IR. However, it may be a bug in the julia side of the expontential copy? Just a guess though as we ended very very stuck. |
We crash due a data corruption:
Reverse executing
Second overwrite:
Since this memory is allocated in the system image (it's the name of Tuple) it smells more like a calling mistake.
|
So how did these values end up in the cache? |
So I find:
A bit fishy. The code got changed here but |
@vchuravy I don't think so? The calling conv of that is which dptr (and dptrs for extra batches) are the shadows, not vals? |
if !is_constant_value(gutils, ops[1])
inp = invert_pointer(gutils, ops[1], B)
inp = lookup_value(gutils, inp, B)
if width == 1
inps = [inp]
else
inps = LLVM.Value[]
for w in 1:width
push!(inps, extract_value!(B, inp, w-1))
end
end
else
inp = new_from_original(gutils, ops[1])
inp = lookup_value(gutils, inp, B)
inps = [inp]
end
vals = LLVM.Value[]
push!(vals, inps[1])
push!(vals, tape)
sym = new_from_original(gutils, ops[2])
sym = lookup_value(gutils, sym, B)
sym = (sizeof(Int) == sizeof(Int64) ? emit_box_int64! : emit_box_int32!)(B, sym)
sym = emit_apply_type!(B, Base.Val, [sym])
push!(vals, sym)
push!(vals, unsafe_to_llvm(Val(is_constant_value(gutils, orig))))
for v in inps[2:end]
push!(vals, v)
end
pushfirst!(vals, unsafe_to_llvm(idx_jl_getfield_rev)) so the unction, original shadow, tape [aka dret], type{val{sym}}, batched other shadows [from 2 ... n] the shadow tape does look ishy tho. |
@vchuravy okay the caching mechanism is fine, the function is actually returning garbage.
|
The input to idx_jl_getfield_aug is RefValue{DataType}(Tuple{Float64, Int64}) |
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1485 +/- ##
===========================================
+ Coverage 71.15% 96.35% +25.19%
===========================================
Files 30 9 -21
Lines 11224 411 -10813
===========================================
- Hits 7986 396 -7590
+ Misses 3238 15 -3223 ☔ View full report in Codecov by Sentry. |
No description provided.