-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why don't you use mmfile? #4
Comments
Thanks for comment and the tip. I know about memory mapped files, but I hadn't seen the mmfile facilities in Phobos. Even so, I'd be inclined to be cautious when they'd be used with machines and files that weren't my own. This is perhaps out of date info now, but in my prior experiences with memory mapped files it was always the case that system specific aspects mattered. And for these utilities, they sometimes get used with quite large files, multiple gigabytes. Related approaches I deliberately chose not to do is slurp entire files into memory, or write my own buffering code. This was partly philosophical. I wasn't setting out to create the absolute fastest utilities. I was trying to create utilities as they might be created by data scientists who might typically be using Python or similar and see how D's performance stacked up. D actually did pretty well. I had to avoid auto-decoding, and the csv2tsv converter is slow for reasons I haven't out yet. But overall pretty good. But back to the approach rationale - The way I wrote it avoids a couple complications. One is that standard input and files work the same way, no special casing. The other is that reading entire files generally bypasses the system specific newline detection, so generic code needs to handle both different forms of newline (eg. CRLF on Windows, LF on Unix). And to be honest, I expect the underlying libraries to provide good buffering without needing to write my own. There are definitely people who would disagree with these choices, and if my primary goal was the very fastest performing tools I would change a few things as well. As it is, they are actually pretty good. tsv-filter in particular runs very fast. |
As said I just measured that it's twice as fast in my simple experiments and I was interested whether you actually tried it.
" A possible benefit of memory-mapped files is a "lazy loading", thus using small amounts of RAM even for a very large file. Trying to load the entire contents of a file that is significantly larger than the amount of memory available can cause severe thrashing as the operating system reads from disk into memory and simultaneously writes pages from memory back to disk. Memory-mapping may not only bypass the page file completely, but the system only needs to load the smaller page-sized sections as data is being edited, similarly to demand paging scheme used for programs." (from Wikipedia)
If you see a file as a range, then there's not special casing ;-)
byLine uses |
@wilzbach - your code is mapping an entire file which is completely unnecessary, IMHO. |
Hi @dejlek -- FWIW, I did experiment with memory mapped files at one point a good bit after @wilzbach's original suggestion. I haven't used them at this point, but there are places in the tools where it warrants consideration. Mostly, at this point I haven't wanted to worry about the distinctions between reading infinite/indefinite size streams, streaming large vs small files, multiple files, and reading full files into memory. But also, I didn't see big performance wins on the tests I ran. I suspect this has more to do with the specific tests I ran than with the technique, but clearly it would take a bit more time investment to characterize the cases better. There are cases in the toolset where MM files would really seem to make sense. For example, a couple of the sampling methods provided by As to |
A couple of weeks ago I did a (noobish) benchmark that compared different D functions with C, C++ and Python.
By far the fastest was the following code:
I am just wondering whether you just didn't know about this or there was any reason against it.
Memory-mapping also works well with extremely large files.
https://dlang.org/phobos/std_mmfile.html
https://en.wikipedia.org/wiki/Memory-mapped_file
The text was updated successfully, but these errors were encountered: