Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PFind out of memory #22

Open
JulianKunkel opened this issue Oct 26, 2017 · 12 comments
Open

PFind out of memory #22

JulianKunkel opened this issue Oct 26, 2017 · 12 comments

Comments

@JulianKunkel
Copy link
Contributor

Running with 200 nodes and 5 proc produces an error, with 100 nodes and 10 proc it does work.
Error was:

Traceback (most recent call last):
File "/home/dkrz/k202079/work/io-500/io-500-dev/bin/pfind", line 16, in
from lib.parallelwalk import ParallelWalk
File "", line 969, in _find_and_load
File "", line 954, in _find_and_load_unlocked
File "", line 896, in _find_spec
File "", line 1139, in find_spec
File "", line 1113, in _get_spec
File "", line 1225, in find_spec
File "", line 1264, in _fill_cache
OSError: [Errno 12] Cannot allocate memory: '/mnt/lustre01/work/k20200/k202079/io-500/io-500-dev/bin/lib'

@johnbent
Copy link
Collaborator

I suggest we reproduce this in the main pwalk github using one of their examples like pdu and then file the issue there.

@gmarkomanolis
Copy link
Collaborator

For me, it does not seem to crash, but when I increase the nodes, it takes forever and I can not finish. The find did not finish not even in 80 minutes. I was decreasing the number of files and it does not help that much.

@gmarkomanolis
Copy link
Collaborator

Seems I found the issue on my system:

io500_fixed.sh around line 211 about pfind it is:

myrun "$command" $result_file

this command was never finishing and I was thinking that it is too slow. I tried an interactive job and only then I get MPI error but not through sbatch.

Then I observed that I execute the srun from the root folder (io-500-dev) and not the bin folder, then entering bin folder solved the issue.

So If I do that:

cd bin
myrun "$command" $result_file
matches=$( grep MATCHED $result_file )
cd ..

The pfind works (I have not tried on more than 2 nodes). However, something was changed, because some days ago it was working without any issue.

@johnbent
Copy link
Collaborator

johnbent commented Oct 27, 2017 via email

@johnbent
Copy link
Collaborator

johnbent commented Oct 27, 2017 via email

@gmarkomanolis
Copy link
Collaborator

It is the whole path of pfind all the time, but some days ago, it was working ok and yesterday it didn't. The difference is from where I execute the pfind command. Yes, it could be just an environment variable, but why it was working before? This is the confusing part that probably we can not figure out.

I was trying to get the results yesterday but pfind was not working, so I will check now.

@gmarkomanolis
Copy link
Collaborator

I am tuning and I have the following issue:

I have created 8 million files which are not that much but my mdtest_easy is just 30 seconds and mdtest_hard_write 700 seconds, so I have to adjust. In any case, I will not have much fewer files. The pfind is looking for all the files and it takes too much time, I am in this test more than half an hour. My point is that if my system is slow with pfind but fast to create files, ok pfind should give me a bad result but to finish in normal time. We should have an approach not to search all the files. Initially, I thought to have one MPI process per node also as I am running 4 processes per node overall.

@johnbent
Copy link
Collaborator

johnbent commented Oct 27, 2017 via email

@JulianKunkel
Copy link
Contributor Author

JulianKunkel commented Oct 30, 2017 via email

@adilger
Copy link
Collaborator

adilger commented Jun 20, 2018

This was fixed in commit 462ace9 to enable stonewall by default and should probably be closed?

@JulianKunkel
Copy link
Contributor Author

JulianKunkel commented Jun 21, 2018 via email

@adilger
Copy link
Collaborator

adilger commented Jun 21, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants