-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow reading big text files #78
Comments
I did a quick test with Files.ReadString(rider, chunk), but if the file I'm reading from is bigger than my chunk ( Doing |
The Files module is specified by Oakwood and you can find the (admittedly sparse) detail on page 36 of: http://www.math.bas.bg/bantchev/place/oberon/oakwood-guidelines.pdf Files.ReadString and WriteString deal with Oberon formatted strings in files, that is: the string representation in the file is zero terminated, not newline terminated. As far as I can see the original Oberon systems did not provide convenience functions for handling plain text files - presumably because it used Oberon formatted files through modules Files and Texts. The voc compiler itself reads the .mod file character by character using Texts.Read(ch). Yes, I think a good way to count lines in plain text files quickly would be to use Files.ReadBytes, then scan the bytes for line feed. -- Dave. |
I've looked into where the time is spent and there are two opportunities for improvement which would bring Oberon inline with Dart.
Files.Read and Files.ReadLine can be optimised: This following updated version of Files.Read includes a fast path which knocks about 30% off ReadLine time: (The only change is the additional fast case code at the beginning)
More significantly the Readline code can be reimplemented to work directly with buffering rather than going through Read. I've made such an implementation which additionally handles lines longer than the buffer by splitting them where they overflow. This implementation dropped the ReadLine time by nearly 2/3:
It is of course a lot more complex than the original ReadLine and presumably there are bugs I have missed. Finally the cost of calling Strings.Length can be avoided by implementing it manually, inline. It's very straightforward, not least because the ReadLine function is guaranteed to provide a zero terminator, for example here's a program to count lines and characters:
Using the updated Files.Readline and the inline length counting my test of scanning a 1GB file reduced from 16.9 seconds to 6.4 seconds. I suspect that using Files.ReadBytes to read the file a large buffer at a time, then scanning the buffer, you should be able to do a little better still. |
@jkleiser it's interesting, try to IMPORT by aliasing Strings := oocStrings, or ooc2Strings, or ulmStrings, and see, if that would do the difference. I didn't compare the Length() implementation, but I will look in to the code to check. |
@norayr I'm afraid they all take their string parameter by value incurring the extra copy cost and are no better than plain Strings.Length. The issue is not just the cost of copying string (i.e. two memory access per character), but also the memory allocation cost, and subsequent garbage collection cost. One fix would be to have Strings.Length take its parameter by VAR (which avoids the copy), but then it would be illegal to pass a constant string to Strings.Length. In Oberon 2007, Wirth removed the copy cost by adding the rule that value ARRAY and RECORD parameters are read-only. This allows ARRAYs and RECORDs to be passed by address safely, without copy. Oberon2 doesn't do this though, and changing Oberon2 to use the Oberon 2007 rule would be a significant backward compatibility break. BUT Joseph Templ included an undocumented system flag feature in ooc (from which voc is derived). We already know that it can be used with pointers to create non-garbage collected arrays as these are used by the garbage collection code itself. I had a vague recollection of seeing it used in parameter passing too, and looked further: Setting a non-zero system flag on value ARRAY parameters is an undocumented feature for disabling value ARRAY parameter copying. It turns off the generation for that parameter of the call to __DUP (which does the copy) and the call to __DEL (which later frees the copy). It requires SYSTEM imported to parse without error. This solves our performance problem with Strings.Length. We only need to change
and we get a major performance improvement. (Don't confuse this with And there are a number of other Strings functions that currently suffer from expensive and redundant copying that can be fixed too. |
It has always bothered me that Pascal, Modula and Oberon compilers would copy value array parameters regardless of usage, I assumed it would be too difficult to check for writes in the compiler. But actually it's not difficult. I've checked in to the 'faster' branch quite a small change that automatically omits the copying of value array parameters unless there is code that modifies, or could modify them. For example if there are any assignments within the procedure body, or the bodies of nested procedures, or if the variable is passed to a VAR parameter of another procedure then the value array parameter needs to be copied. It turns out that this change is more than sufficient to get rid of most redundant copies. For example with this change, the followings Strings module procedures used to have unnecessary copies, but no longer: Length, Append, Insert, Replace, Extract, Pos, Match, ... The faster branch also includes the optimized Files.Read and Files.ReadLine implementations. |
in ofront you mean. (: hmmm, it's interesting, may be you remember the lines from oakwood:
so actually you have implemented this optimization. though i was always in doubt about this. i was inclined, that we should not think of programmer as of someone, who does not understand what he is doing, and allowing to explicitly mention, if the array has to be passed by reference or by value, is a plus. also, Ada has in/out/inout parameter passing, it always spoke to me. also, related discussions: https://groups.google.com/forum/#!msg/comp.lang.oberon/5iCjyr8U8kY/h6uQ_QhiVxEJ Oleg, instead, added read only parameter passing (marked as '-' dash symbol after the variable name) to his ofrontplus fork. He suggested me to take the same path, and I was also reluctant there, because I was not sure if that's a good solution either. David, do you think we need a compiler flag, which enables optimizations and experimental extensions? I remember there was already some difference implemeted by you regarding string assignments(oberon-07 style)? Overall, practically you have implemented the optimizations, oakwood guidelines mention? Then should we add it to master, you think? |
Yes, I also like the approach of marking the parameter as read only, I'll come back to that in a moment. Thanks for the related discussion link. But first I think you are proposing that a programmer already has sufficient tools available in choice of VAR vs value when you say: ' i was inclined, that we should not think of programmer as of someone, who does not understand what he is doing, and allowing to explicitly mention, if the array has to be passed by reference or by value, is a plus.' But this is not enough. Consider all the library API's that currently take a value parameter ARRAY OF CHAR so as to accept a literal string parameter. For example, to append the .conf extension to a filename, I can use:
Currently this causes the allocation of a temporary array of char, copying ".conf" into it, copying it in turn to the end of filename, then freeing the temporary array memory. No matter how well the developer of Strings.Append understands what is happening they must choose between a VAR and a value parameter, and neither solves our issue.
A deeper issue: Thanks to the related discussion you referenced, I though of the following pathological case. Let's say we want to double a string, e.g. so that "hello" becomes "hellohello". We might think of using Strings.Append like this:
Assume a common implementation of append that, after finding the 0 at the end of the target string, copies characters from the source to the target until it hits the zero at the end of the source. With any change that passes the first parameter by address (i.e. my optimization, or a new constant param kind) this loop will never find the 0 terminator of the source string because they are both the same string and the very first character copy overwrites the 0 terminator. This is annoying :-). Have you seen this issue before, say in component pascal? |
After more thought I am convinced that it would be bad to turn this on by default as it breaks the simple behaviour of a value parameter: that it always behaves exactly as a local copy. But we could still expose the copy disable with one of these syntaxes:
Of these only the 'IN' prefix seems to me to describe the behaviour well. In particular it does not misdirect the reader into thinking that the content of the array cannot change during procedure execution. Any thoughts? |
I like Ada's IN, OUT, and INOUT options. Do you think it's not hard to implement checks for OUT? Then I would vote to introduce experimental extensions - IN, OUT and INOUT, which is an alias for VAR. but all experimental extensions should be allowed by the use of the compiler flag. Otherwise, voc should compile only legit Oberon-2 code. |
I think it would be nice if we can have experimental features into voc, like INOUT from Ada, light-weight process from BEAM VM and use compiler flags to distinguish legit Oberon-2 code from “featured” codes.
…Sent from my iPhone
On Dec 20, 2019, at 4:05 PM, Norayr Chilingarian ***@***.***> wrote:
I like Ada's IN, OUT, and INOUT options.
by these programmer can declare the intentions, expectations: this data will not be overwritten, but will be passed by reference. this data cannot be read, but is here for you to write in to. and INOUT is similar to VAR in Oberon, i. e. gives read and write access.
Do you think it's not hard to implement checks for OUT? Then I would vote to introduce experimental extensions - IN, OUT and INOUT, which is an alias for VAR.
but all experimental extensions should be allowed by the use of the compiler flag. Otherwise, voc should compile only legit Oberon-2 code.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I wrote a simple program 'filelines.mod' (below) to count the lines, and find the length of the longest line, in text files, mainly to get to know voc better. My program works fine, but when I compared the execution speed to the speed of a similar program written in Dart (also below), I was surprised to find that my Dart program was nearly 3 times faster! That was when reading from files of GB size.
My main question now is if there is something in my 'filelines.mod' solution that is very inefficient. Could it be e.g. that my use of Files.ReadLine and/or Strings.Length is a bad choice?
Oberon program:
Dart program:
The text was updated successfully, but these errors were encountered: