Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elpi fail on OCaml-multicore #126

Closed
moyodiallo opened this issue Dec 17, 2021 · 13 comments
Closed

Elpi fail on OCaml-multicore #126

moyodiallo opened this issue Dec 17, 2021 · 13 comments

Comments

@moyodiallo
Copy link

Hi
I found the package doesn't work on OCaml multicore (ocaml-variants.4.12.0+domains) because of Segfault

[1]    2563302 segmentation fault (core dumped)  ./_build/install/default/bin/elpi -test -I  -I tests/sources/ ackermann.elpi

I did a test on OCaml nnpchecker (ocaml-variants.4.12.0+nnpchecker or ocaml-option-nnpchecker.1) and
found Out of heap pointer

make build
./_build/install/default/bin/elpi -test -I _build/install/default/bin/../lib/elpi/ -I tests/sources/ ackermann.elpi

Parsing time: 0.001

Compilation time: 0.002

Out-of-heap pointer at 0x7fe85d590610 of value 0xffff8017a27027c2. Cannot read head.
Out-of-heap pointer at 0x7fe85d5905c0 of value 0xffff8017a2702102. Cannot read head.
Out-of-heap pointer at 0x7fe85d590570 of value 0xffff8017a26ff99a. Cannot read head.
Out-of-heap pointer at 0x7fe85d590520 of value 0xffff8017a27187c2. Cannot read head.
Out-of-heap pointer at 0x7fe85d5904d0 of value 0xffff8017a2718102. Cannot read head.
Out-of-heap pointer at 0x7fe85d5b4630 of value 0xffff8017a2554322. Cannot read head.
Out-of-heap pointer at 0x7fe85d5b45f8 of value 0xffff8017a2a55662. Cannot read head.
Out-of-heap pointer at 0x7fe85d5b45c0 of value 0xffff8017a2a554fa. Cannot read head.
...

It seems to come from using Obj module, it's the same for some code. Could you review your Obj's usage please ?

This is important to consider for OCaml 5.00 (next coming): ocaml-health-check

@gares
Copy link
Contributor

gares commented Dec 17, 2021

I'm aware of the failure. There is one use of Obj which I'm getting (slowly) rid of in this branch: #118. I guess you have the right compiler set up, could you please run it on this branch?

@moyodiallo
Copy link
Author

There's Segfault with this branch #118 on those switches :

ocaml-base-compiler.4.12.0
ocaml-variants.4.12.0+domains
ocaml-option-nnpchecker.1

@gares
Copy link
Contributor

gares commented Dec 17, 2021

ok, thanks for testing.
any hints on what became unsafe?

@moyodiallo
Copy link
Author

It's about using Obj module: Obj.magic make the program unsafe.
This is a simple example:

type typeA = Name of string | Nat
let n1 = Name "fifo"
let n2 = Nat

let n = Obj.magic n2
let m = String.get n 0
---------------------------------------------
$ocamlopt test.ml
$./a.out
[2]    2616032 segmentation fault (core dumped)  ./a.out

A program without Obj would be better because internal objects in the runtime are not stable either

@gares
Copy link
Contributor

gares commented Dec 18, 2021

Sorry I was not clear. I know what Obj is. I was asking what chenged in the runtime of ocaml that made previously "legit" uses of magic illegal.

I will investigate this. I don't recall having many uses of Obj anyway.

@gares
Copy link
Contributor

gares commented Dec 19, 2021

I think I found the culprit, but I don't know how to fix it or to replace this code:

elpi/src/util.ml

Lines 614 to 712 in 332dbea

(* A map with opaque bodex data as key.
This data structure is faster than an associative list + List.assq
asyntotically since it keeps a cache with log(n) lookup time.
On standard OCaml backends we compuete the pointer of a boxed value and
turn it into an integer key and use that value as the key for the cache.
Since the Gc may move the boxed value a sanity check is performed at lookup
time and the cache is eventually updated on the bases of an authoritative
associative list.
*)
module PtrMap = struct
type 'a t = {
(* maps the key's address to the value. It also holds the key
itself so that we can check if the key was moved by the Gc
and fall back to the authoritative associative list *)
mutable cache : (Obj.t * 'a) IntMap.t;
(* We associate to the boxed key a value, but we also keep track of
its address. When it is found to be outdated, we remove the old
entry in the cache. All OCaml data is eventually moved by the Gc
at least once, so we keep the size of the cache close to the
size of the list, and not to its double, by puring outdated cache
entries. *)
authoritative : (Obj.t * ('a * int ref)) list;
}
let empty () = { cache = IntMap.empty; authoritative = [] }
let is_empty { authoritative } = authoritative = []
let address_of =
match Sys.backend_type with
| (Sys.Bytecode | Sys.Native) ->
fun (ro : Obj.t) : int -> begin
assert(Obj.is_block ro);
let a : int = Obj.magic ro in
~- a (* so that the Gc will not mistake it for a block *)
end
| Sys.Other _ ->
(* We don't know how the backend deals with memory, so we play safe.
In this way the cache is a 1 slot for the last used entry. *)
fun _ -> 46
let add o v { cache; authoritative } =
let ro = Obj.repr o in
let address = address_of ro in
{ cache = IntMap.add address (ro,v) cache;
authoritative = (ro,(v,ref address)) :: authoritative }
let linear_search_and_cache ro address cache authoritative orig =
let v, old_address = List.assq ro authoritative in
orig.cache <- IntMap.add address (ro,v) (IntMap.remove !old_address cache);
old_address := address;
v
let linear_scan_attempted = ref false
let find o ({ cache; authoritative } as orig) =
linear_scan_attempted := false;
let ro = Obj.repr o in
let address = address_of ro in
try
let ro', v = IntMap.find address cache in
if ro' == ro then v
else
let cache = IntMap.remove address cache in
linear_scan_attempted := true;
linear_search_and_cache ro address cache authoritative orig
with Not_found when not !linear_scan_attempted ->
linear_search_and_cache ro address cache authoritative orig
let remove o { cache; authoritative } =
let ro = Obj.repr o in
let address = address_of ro in
let _, old_address = List.assq ro authoritative in
let authoritative = List.remove_assq ro authoritative in
let cache = IntMap.remove address cache in
let cache =
if !old_address != address then IntMap.remove !old_address cache
else cache in
{ cache; authoritative }
let filter f { cache; authoritative } =
let cache = ref cache in
let authoritative = authoritative |> List.filter (fun (o,(v,old_address)) ->
let keep = f (Obj.obj o) v in
if not keep then begin
let address = address_of o in
cache := IntMap.remove address !cache;
if !old_address != address then cache := IntMap.remove !old_address !cache
end;
keep) in
{ cache = !cache; authoritative }
let pp f fmt { authoritative } =
pplist (fun fmt (_,(x,_)) -> f fmt x) ";" fmt authoritative
let show f m = Format.asprintf "%a" (pp f) m
end

In particular this is the offending part:

elpi/src/util.ml

Lines 644 to 655 in 332dbea

let address_of =
match Sys.backend_type with
| (Sys.Bytecode | Sys.Native) ->
fun (ro : Obj.t) : int -> begin
assert(Obj.is_block ro);
let a : int = Obj.magic ro in
~- a (* so that the Gc will not mistake it for a block *)
end
| Sys.Other _ ->
(* We don't know how the backend deals with memory, so we play safe.
In this way the cache is a 1 slot for the last used entry. *)
fun _ -> 46

This data structure is a faster List.assq: it uses the value of the address of a boxed data to build a non authoritative search tree and uses it instead of a linear scan. If the object was moved by the gc to another address we fall back to the linear scan and update the search tree.

I lack knowledge of the new Gc, would you mind putting in CC knowledgeable people?

@moyodiallo
Copy link
Author

A program without Obj would be better because internal objects in the runtime are not stable either

What I meant by that, using Obj.magic could easily seeing by the Gc like naked pointers. OCaml+domains doesn't admit naked pointers. The next OCaml version(5.00) will support domains by default.

@kayceesrk, @Engil : have you something to share here ?

@gares
Copy link
Contributor

gares commented Dec 19, 2021

FTR, I'm happy to replace this data structure with something else. Even better if the table keys were weak (but I did read somewhere that ephemerons are not available in 5.0...)

@kayceesrk
Copy link

@gares Ephemerons are available in 5.0. Some of the unsupportable functions in multicore have been removed. Please find the supported API here: https://github.com/ocaml-multicore/ocaml-multicore/blob/5.00/stdlib/ephemeron.mli. The removed functions have also been marked deprecated on trunk OCaml: https://github.com/ocaml/ocaml/blob/trunk/stdlib/ephemeron.mli.

@gares
Copy link
Contributor

gares commented Dec 20, 2021

Thanks ill give ephemerons a try then (they were not there I believe when I coded this).

But what about the int cast trick I was doing here? I found that 5.0 uses more tag bits to represent domains, but I could not find why this code is now broken. I'm curious.

@gares
Copy link
Contributor

gares commented Dec 28, 2021

I tried to port the code to ephemerons, but I forgot I can't, this is why I had this custom map.

The problem I have is that the boxed value I have cannot be hashed. In particular It is something like term option ref and the instances I need to put in a map are the ones where the ref points is None (it's a logic programming language, unification variables are mutable, they born unassigned and end up assigned eventually).

So I can't possibly provide a decent hash function, since all ref cells I need to use as keys contain 0. The old code was using the address of the ref cell as a hash value, and the rest of the code was coping with the fact that the GC could move the cell (something that happens for sure, but not very frequently, so after all the lookup was quick on the average).

So I'm afraid I need to repair the old code. @kayceesrk could you shed some light on why the old code is broken? (the address_of thing seems the culprit).

@gares
Copy link
Contributor

gares commented Dec 30, 2021

After some thinking I found a way, see PR #127. The performances are a bit weird, so I need to investigate more on this PR, but are OKish (there is some effect on code paths which are not really touched by the patch).

One very weird thing is that the elpi parser (written in camlp5) now needs a lot more of stack space.
See 831b06d
This is not a blocking issue, since I want to ditch camlp5 eventually, but looks very weird to me.
Is it a known "problem" of multicore? Does it ring a bell?

@gares gares closed this as completed Feb 3, 2022
@kayceesrk
Copy link

The increased stack space requirement doesn't ring a bell. It may be useful to open an issue on OCaml github repo when you get the chance. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants