Gridfs versus file system performance #301
Replies: 1 comment
-
Another thought on this. It would be VERY EASY for me to write a C/C++ function that would take an ensemble object, write only the sample data to a file with C fwrite, and return an index with a struct to be determined. Then the python wrapper would only need to reorganize the return into documents being inserted into MongoDB - i.e. putting the right dir, dfile, and foff values for the correct waveform. The index would need to be designed to make sure there that process would be simple and error proof. A reader could be produced the same way, but the input would need to look much like the output of the writer. i.e. a python wrapper could pass a C function the file read index and a skeleton of the ensemble's data (i.e. the I would bet a lot that will really speed up file-based reads and writes. I suspect it would be a lot faster than a pure python solution, but I could be wrong on that. |
Beta Was this translation helpful? Give feedback.
-
Thought I'd pass along some numbers from a serial run I have running on my "quakes" machine. For reference, both the database and file system with this workflow are driven from a raid1 magnetic disk (2 disks mirrored). All the job is doing is reading TimeSeries data from gridfs as ensembles, running bundle, and then writing the result out as files to the file system using save_ensemble_data with "file" as the storage mode. The files it is building are "common source gathers" with all the sample data packed into one file for each gather. Our current implementation, however, opens and closes the file after writing the sample data from each Seismogram object.
Here are some numbers:
Here is an (incomplete) sample of the output:
There is a lot more variance in the gridfs read time. Have no idea why. A rough guess scanning the numbers is that the read time is around 1.5 times (on average) longer than the write time. That isn't horrible, BUT this is a serial job. We can test this with a parallel job later if you think it would be helpful, but I suspect the results are somewhat predictable: gridfs reads will be a throttle although how it would trade off with writes with a parallel job would be hard to predict. Incidentally, the overall throughput is not that bad for the writer. It is running about 20 Mb/s. I suspect strongly, however, that that could be speeded up a lot that time didn't include 1000+ open/close calls.
I do think there are some clear things we need to fix with save_ensemble_data:
I suggest we should change the name of the current arg for save_ensemble_data from
dfile_list
todfile
anddir_list
should bedir
. The function should accept either a single string or a list of strings for both args. The first few lines of the method function could essentially run the code above to create an internal list IF the input is a single string. If the type is list it would just use it as it does now.Beta Was this translation helpful? Give feedback.
All reactions