-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PFind out of memory #22
Comments
I suggest we reproduce this in the main pwalk github using one of their examples like pdu and then file the issue there. |
For me, it does not seem to crash, but when I increase the nodes, it takes forever and I can not finish. The find did not finish not even in 80 minutes. I was decreasing the number of files and it does not help that much. |
Seems I found the issue on my system: io500_fixed.sh around line 211 about pfind it is: myrun "$command" $result_file this command was never finishing and I was thinking that it is too slow. I tried an interactive job and only then I get MPI error but not through sbatch. Then I observed that I execute the srun from the root folder (io-500-dev) and not the bin folder, then entering bin folder solved the issue. So If I do that:
The pfind works (I have not tried on more than 2 nodes). However, something was changed, because some days ago it was working without any issue. |
When it is runs, there should be an output line like [Exec] or something. What is it saying? I thought we were passing the full path to pfind so the ‘cd’ shouldn’t make a difference. Unless the problem is that pfind can’t find its pwalk library?
So, do you have a result for us? :)
… On Oct 27, 2017, at 4:58 AM, George Markomanolis ***@***.***> wrote:
Seems I found the issue on my system:
io500_fixed.sh around line 211 about pfind it is:
myrun "$command" $result_file
this command was never finishing and I was thinking that it is too slow. I tried an interactive job and only then I get MPI error but not through sbatch.
Then I observed that I execute the srun from the root folder (io-500-dev) and not the bin folder, then entering bin folder solved the issue.
So If I do that:
cd bin
myrun "$command" $result_file
matches=$( grep MATCHED $result_file )
cd ..
The pfind works (I have not tried on more than 2 nodes). However, something was changed, because some days ago it was working without any issue.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Maybe we need Io500.sh to set PYTHONPATH to where pwalk is?
… On Oct 27, 2017, at 4:58 AM, George Markomanolis ***@***.***> wrote:
Seems I found the issue on my system:
io500_fixed.sh around line 211 about pfind it is:
myrun "$command" $result_file
this command was never finishing and I was thinking that it is too slow. I tried an interactive job and only then I get MPI error but not through sbatch.
Then I observed that I execute the srun from the root folder (io-500-dev) and not the bin folder, then entering bin folder solved the issue.
So If I do that:
cd bin
myrun "$command" $result_file
matches=$( grep MATCHED $result_file )
cd ..
The pfind works (I have not tried on more than 2 nodes). However, something was changed, because some days ago it was working without any issue.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
It is the whole path of pfind all the time, but some days ago, it was working ok and yesterday it didn't. The difference is from where I execute the pfind command. Yes, it could be just an environment variable, but why it was working before? This is the confusing part that probably we can not figure out. I was trying to get the results yesterday but pfind was not working, so I will check now. |
I am tuning and I have the following issue: I have created 8 million files which are not that much but my mdtest_easy is just 30 seconds and mdtest_hard_write 700 seconds, so I have to adjust. In any case, I will not have much fewer files. The pfind is looking for all the files and it takes too much time, I am in this test more than half an hour. My point is that if my system is slow with pfind but fast to create files, ok pfind should give me a bad result but to finish in normal time. We should have an approach not to search all the files. Initially, I thought to have one MPI process per node also as I am running 4 processes per node overall. |
Great question. I don't know what to do here.
…On Fri, Oct 27, 2017 at 8:12 AM, George Markomanolis < ***@***.***> wrote:
I am tuning and I have the following issue:
I have created 8 million files which are not that much but my mdtest_easy
is just 30 seconds and mdtest_hard_write 700 seconds, so I have to adjust.
In any case, I will not have much fewer files. The pfind is looking for all
the files and it takes too much time, I am in this test more than half an
hour. My point is that if my system is slow with pfind but fast to create
files, ok pfind should give me a bad result but to finish in normal time.
We should have an approach not to search all the files. Initially, I
thought to have one MPI process per node also as I am running 4 processes
per node overall.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB89PvT7JJc2IHLo-P0P9L-Zxau0MG88ks5sweTXgaJpZM4QHK2e>
.
|
Yeah, that reminds to the testing stuff we did in the beginning...
I think this is now fixed with stonewalling from John in the scripts?
Also some kind of stonewalling inside pfind of the C-app available now.
So that is probably fixed here?
2017-10-27 16:22 GMT+02:00 John Bent <[email protected]>:
… Great question. I don't know what to do here.
On Fri, Oct 27, 2017 at 8:12 AM, George Markomanolis <
***@***.***> wrote:
> I am tuning and I have the following issue:
>
> I have created 8 million files which are not that much but my mdtest_easy
> is just 30 seconds and mdtest_hard_write 700 seconds, so I have to
adjust.
> In any case, I will not have much fewer files. The pfind is looking for
all
> the files and it takes too much time, I am in this test more than half an
> hour. My point is that if my system is slow with pfind but fast to create
> files, ok pfind should give me a bad result but to finish in normal time.
> We should have an approach not to search all the files. Initially, I
> thought to have one MPI process per node also as I am running 4 processes
> per node overall.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#22 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-
auth/AB89PvT7JJc2IHLo-P0P9L-Zxau0MG88ks5sweTXgaJpZM4QHK2e>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE1uyiqOtj4YIR8TsI0EpEwypJ2NY_Vnks5swecZgaJpZM4QHK2e>
.
|
This was fixed in commit 462ace9 to enable stonewall by default and should probably be closed? |
I'm not convinced that PFIND out of memory is resolved (even with
stonewalling).
I have to check the following theoretic setting:
An extremely big directory but extremely slow stat() operations.
One thread runs readdir() and creates jobs for stat(), they start to queue
up.
I'm not quite sure if libcircle handles this case and stalls the creation
of new jobs().
That we do not see the problem is since even with 1k per filename, 1 GB
memory is 1 M files which is not the amount of files we created.
It is on my list to check when replacing libcircle.
2018-06-20 23:27 GMT+01:00 adilger <[email protected]>:
… This was fixed in commit 462ace9
<462ace9>
to enable stonewall by default and should probably be closed?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE1uyrJ377GXLJV7WodnxDs3YQ-MltU5ks5t-sxEgaJpZM4QHK2e>
.
--
Dr. Julian Kunkel
Lecturer, Department of Computer Science
+44 (0) 118 378 8218
http://www.cs.reading.ac.uk/
https://hps.vi4io.org/
|
On Jun 21, 2018, at 2:39 AM, Julian Kunkel ***@***.***> wrote:
I'm not convinced that PFIND out of memory is resolved (even with
stonewalling).
I have to check the following theoretic setting:
An extremely big directory but extremely slow stat() operations.
One thread runs readdir() and creates jobs for stat(), they start to queue
up.
I'm not quite sure if libcircle handles this case and stalls the creation
of new jobs().
That we do not see the problem is since even with 1k per filename, 1 GB
memory is 1 M files which is not the amount of files we created.
It is on my list to check when replacing libcircle.
We have a producer/consumer model for LFSCK traversal and repair of Lustre
filesystems. The producer keeps track of how many items are in the queue,
and if the queue gets too large it stops scanning until the consumer has
reduced the backlog by some amount.
Cheers, Andreas
2018-06-20 23:27 GMT+01:00 adilger ***@***.***>:
> This was fixed in commit 462ace9
> <462ace9>
> to enable stonewall by default and should probably be closed?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#22 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE1uyrJ377GXLJV7WodnxDs3YQ-MltU5ks5t-sxEgaJpZM4QHK2e>
> .
>
--
Dr. Julian Kunkel
Lecturer, Department of Computer Science
+44 (0) 118 378 8218
http://www.cs.reading.ac.uk/
https://hps.vi4io.org/
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
Cheers, Andreas
|
Running with 200 nodes and 5 proc produces an error, with 100 nodes and 10 proc it does work.
Error was:
Traceback (most recent call last):
File "/home/dkrz/k202079/work/io-500/io-500-dev/bin/pfind", line 16, in
from lib.parallelwalk import ParallelWalk
File "", line 969, in _find_and_load
File "", line 954, in _find_and_load_unlocked
File "", line 896, in _find_spec
File "", line 1139, in find_spec
File "", line 1113, in _get_spec
File "", line 1225, in find_spec
File "", line 1264, in _fill_cache
OSError: [Errno 12] Cannot allocate memory: '/mnt/lustre01/work/k20200/k202079/io-500/io-500-dev/bin/lib'
The text was updated successfully, but these errors were encountered: