Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow reading big text files #78

Open
jkleiser opened this issue Nov 8, 2019 · 11 comments
Open

Slow reading big text files #78

jkleiser opened this issue Nov 8, 2019 · 11 comments

Comments

@jkleiser
Copy link

jkleiser commented Nov 8, 2019

I wrote a simple program 'filelines.mod' (below) to count the lines, and find the length of the longest line, in text files, mainly to get to know voc better. My program works fine, but when I compared the execution speed to the speed of a similar program written in Dart (also below), I was surprised to find that my Dart program was nearly 3 times faster! That was when reading from files of GB size.
My main question now is if there is something in my 'filelines.mod' solution that is very inefficient. Could it be e.g. that my use of Files.ReadLine and/or Strings.Length is a bad choice?
Oberon program:

MODULE filelines;

IMPORT Args, Files, Strings, Console;

VAR
  filename: ARRAY 256 OF CHAR;
  file: Files.File;
  rider: Files.Rider;
  line: ARRAY 4096 OF CHAR;
  lineLength, maxLength: INTEGER;
  lineCount: LONGINT;
  verbose: BOOLEAN;

BEGIN
  IF Args.argc > 1 THEN
    Args.Get(1, filename);
    verbose := Args.argc > 2;
    file := Files.Old(filename);
    IF file # NIL THEN
      Files.Set(rider, file, 0);
      maxLength := 0;
      lineCount := 0;
      WHILE ~rider.eof DO
        Files.ReadLine(rider, line);
        lineLength := Strings.Length(line);
        IF lineLength > maxLength THEN
          maxLength := lineLength
        END;
        IF ~(rider.eof & (lineLength = 0)) THEN
          lineCount := lineCount + 1;
        END;
        IF verbose THEN
          Console.String(line); Console.Ln
        END
      END;
      Console.Ln;
      Console.String("max. line length:"); Console.Int(maxLength, 6); Console.Ln;
      Console.String("line count:"); Console.Int(lineCount, 10); Console.Ln
    ELSE
      Console.String("* cannot open file *"); Console.Ln
    END
  ELSE
    Console.String("No file name given"); Console.Ln
  END
END filelines.

Dart program:

import "dart:io";
import "dart:convert";
import "dart:async";

main(List<String> arguments) {
  final filename = arguments.first;
  final file = new File(filename);
  Stream<List<int>> inputStream = file.openRead();
  final verbose = arguments.length > 1;
  
  int maxLength = 0;
  int lineCount = 0;
  
  inputStream
    .transform(utf8.decoder)
    .transform(new LineSplitter())
    .listen((String line) {
        if (line.length > maxLength) {
          maxLength = line.length;
        }
        lineCount++;
        if (verbose) {
          print(line);
        }
      },
      onDone: () {
        print("\nmax. line length: $maxLength (utf8 code points)");
        print("line count: $lineCount");
      },
      onError: (e) { print(e.toString()); }
    );
}
@jkleiser
Copy link
Author

jkleiser commented Nov 9, 2019

I did a quick test with Files.ReadString(rider, chunk), but if the file I'm reading from is bigger than my chunk (ARRAY 1024 OF CHAR), then I get a "Terminated by Halt(-2). Index out of range." That makes Files.ReadString pretty much useless to me. Am I missing something?

Doing Files.ReadBytes(rider, chunk, chunkLength) seems to be a better choice.

@dcwbrown
Copy link
Contributor

dcwbrown commented Nov 9, 2019

The Files module is specified by Oakwood and you can find the (admittedly sparse) detail on page 36 of: http://www.math.bas.bg/bantchev/place/oberon/oakwood-guidelines.pdf

Files.ReadString and WriteString deal with Oberon formatted strings in files, that is: the string representation in the file is zero terminated, not newline terminated.

As far as I can see the original Oberon systems did not provide convenience functions for handling plain text files - presumably because it used Oberon formatted files through modules Files and Texts.

The voc compiler itself reads the .mod file character by character using Texts.Read(ch).

Yes, I think a good way to count lines in plain text files quickly would be to use Files.ReadBytes, then scan the bytes for line feed.

-- Dave.

@dcwbrown
Copy link
Contributor

dcwbrown commented Nov 9, 2019

I've looked into where the time is spent and there are two opportunities for improvement which would bring Oberon inline with Dart.

  1. Files.ReadLine is implemented to read one character at a time using Files.Read. Files.Read itself has considerable overhead.

  2. Strings.Length has a signature of PROCEDURE Length* (s: ARRAY OF CHAR): INTEGER;. Because the string is passed by value, a copy is taken at each invocation, and this is taking a lot of time.

Files.Read and Files.ReadLine can be optimised:

This following updated version of Files.Read includes a fast path which knocks about 30% off ReadLine time: (The only change is the additional fast case code at the beginning)

PROCEDURE Read* (VAR r: Rider; VAR x: SYSTEM.BYTE);
    VAR offset: LONGINT; buf: Buffer;
  BEGIN
    IF (r.org = r.buf.org) & (r.offset < r.buf.size) THEN (* Fast case *)
      x := r.buf.data[r.offset];
      INC(r.offset)
    ELSE (* Slow case *)
      buf := r.buf; offset := r.offset;
      IF r.org # buf.org THEN
        Set(r, buf.f, r.org + offset); buf := r.buf; offset := r.offset
      END;
      Assert(offset <= buf.size);
      IF (offset < buf.size) THEN
        x := buf.data[offset]; r.offset := offset + 1
      ELSIF r.org + offset < buf.f.len THEN
        Set(r, r.buf.f, r.org + offset);
        x := r.buf.data[0]; r.offset := 1
      ELSE
        x := 0X; r.eof := TRUE
      END
    END
  END Read;

More significantly the Readline code can be reimplemented to work directly with buffering rather than going through Read. I've made such an implementation which additionally handles lines longer than the buffer by splitting them where they overflow.

This implementation dropped the ReadLine time by nearly 2/3:

 PROCEDURE ReadLine* (VAR r: Rider; VAR x: ARRAY OF CHAR);
  VAR i: INTEGER; offset, limit: LONGINT; buffer: Buffer; eoln: BOOLEAN; ch: CHAR;
  BEGIN
    i := 0;  limit := LEN(x)-1;
    offset := r.offset;  buffer := r.buf;
    eoln := r.eof OR (limit = 0);
    WHILE ~eoln DO
      IF (r.org # buffer.org) OR (offset >= buffer.size) THEN (* Refresh buffer *)
        IF r.org + offset < buffer.f.len THEN
          Set(r, buffer.f, r.org+offset);
          offset := r.offset;  buffer := r.buf;  eoln := r.eof;
        ELSE
          r.eof := TRUE;  eoln := TRUE;
        END
      END;
      WHILE (offset < buffer.size) & ~eoln DO
        ch := SYSTEM.VAL(CHAR, buffer.data[offset]);  INC(offset);
        IF (ch # 0X) & (ch # 0AX) THEN
          x[i] := ch;  INC(i);  eoln := i >= limit
        ELSE
          eoln := TRUE
        END
      END;
      r.offset := offset;
      IF i=limit THEN eoln := TRUE END
    END;
    IF (i>0) & (x[i-1] = 0DX) THEN DEC(i) END; (* Exclude CR if present at eol *)
    x[i] := 0X
  END ReadLine;

It is of course a lot more complex than the original ReadLine and presumably there are bugs I have missed.

Finally the cost of calling Strings.Length can be avoided by implementing it manually, inline. It's very straightforward, not least because the ReadLine function is guaranteed to provide a zero terminator, for example here's a program to count lines and characters:

MODULE readlines;
IMPORT Files, Out;
VAR f: Files.File; r: Files.Rider; l: ARRAY 10000 OF CHAR; i, lines, characters: LONGINT;
BEGIN lines := 0; characters := 0;
  f := Files.Old("testfile");
  IF f # NIL THEN
    Files.Set(r, f, 0);
    WHILE ~r.eof DO
      INC(lines);
      Files.ReadLine(r, l);
      i := 0;  WHILE l[i] # 0X DO INC(i) END;
      INC(characters, i)
    END
  END;
  Out.Int(lines,1); Out.Char(' '); Out.Int(characters,1); Out.Ln;
END readlines.

Using the updated Files.Readline and the inline length counting my test of scanning a 1GB file reduced from 16.9 seconds to 6.4 seconds.

I suspect that using Files.ReadBytes to read the file a large buffer at a time, then scanning the buffer, you should be able to do a little better still.

@norayr
Copy link
Member

norayr commented Nov 9, 2019

@jkleiser it's interesting, try to IMPORT by aliasing Strings := oocStrings, or ooc2Strings, or ulmStrings, and see, if that would do the difference. I didn't compare the Length() implementation, but I will look in to the code to check.

@dcwbrown
Copy link
Contributor

@norayr I'm afraid they all take their string parameter by value incurring the extra copy cost and are no better than plain Strings.Length.

The issue is not just the cost of copying string (i.e. two memory access per character), but also the memory allocation cost, and subsequent garbage collection cost.

One fix would be to have Strings.Length take its parameter by VAR (which avoids the copy), but then it would be illegal to pass a constant string to Strings.Length.

In Oberon 2007, Wirth removed the copy cost by adding the rule that value ARRAY and RECORD parameters are read-only. This allows ARRAYs and RECORDs to be passed by address safely, without copy. Oberon2 doesn't do this though, and changing Oberon2 to use the Oberon 2007 rule would be a significant backward compatibility break.

BUT

Joseph Templ included an undocumented system flag feature in ooc (from which voc is derived). We already know that it can be used with pointers to create non-garbage collected arrays as these are used by the garbage collection code itself.

I had a vague recollection of seeing it used in parameter passing too, and looked further: Setting a non-zero system flag on value ARRAY parameters is an undocumented feature for disabling value ARRAY parameter copying. It turns off the generation for that parameter of the call to __DUP (which does the copy) and the call to __DEL (which later frees the copy). It requires SYSTEM imported to parse without error.

This solves our performance problem with Strings.Length.

We only need to change

PROCEDURE Length* (s: ARRAY OF CHAR): INTEGER;
to
PROCEDURE Length* (s: ARRAY [1] OF CHAR): INTEGER;

and we get a major performance improvement.

(Don't confuse this with ARRAY 1 OF CHAR, the [1] is a system flag, not a dimension.)

And there are a number of other Strings functions that currently suffer from expensive and redundant copying that can be fixed too.

@dcwbrown
Copy link
Contributor

It has always bothered me that Pascal, Modula and Oberon compilers would copy value array parameters regardless of usage, I assumed it would be too difficult to check for writes in the compiler.

But actually it's not difficult.

I've checked in to the 'faster' branch quite a small change that automatically omits the copying of value array parameters unless there is code that modifies, or could modify them.

For example if there are any assignments within the procedure body, or the bodies of nested procedures, or if the variable is passed to a VAR parameter of another procedure then the value array parameter needs to be copied.

It turns out that this change is more than sufficient to get rid of most redundant copies.

For example with this change, the followings Strings module procedures used to have unnecessary copies, but no longer: Length, Append, Insert, Replace, Extract, Pos, Match, ...

The faster branch also includes the optimized Files.Read and Files.ReadLine implementations.

@norayr
Copy link
Member

norayr commented Nov 24, 2019

Joseph Templ included an undocumented system flag feature in ooc (from which voc is derived).

in ofront you mean. (:

hmmm, it's interesting, may be you remember the lines from oakwood:

There have been many requests to make ARRAY and RECORD parameters read-only to achieve the efficiency of passing by reference without the associated possibility for corruption of the calling parameter.  An attempt to make an assignment to any component of such a read only parameter is a compile-time error.  Such parameters could be marked with the standard read only "-" symbol.  For example:
	PROCEDURE Print (theText-: ARRAY OF CHAR) ;
Discussions with ETH suggest this is really a compiler code optimisation issue and on this basis it is recommended that this extension is not implemented.

so actually you have implemented this optimization.

though i was always in doubt about this. i was inclined, that we should not think of programmer as of someone, who does not understand what he is doing, and allowing to explicitly mention, if the array has to be passed by reference or by value, is a plus. also, Ada has in/out/inout parameter passing, it always spoke to me.

also, related discussions: https://groups.google.com/forum/#!msg/comp.lang.oberon/5iCjyr8U8kY/h6uQ_QhiVxEJ

Oleg, instead, added read only parameter passing (marked as '-' dash symbol after the variable name) to his ofrontplus fork.

He suggested me to take the same path, and I was also reluctant there, because I was not sure if that's a good solution either.

David, do you think we need a compiler flag, which enables optimizations and experimental extensions? I remember there was already some difference implemeted by you regarding string assignments(oberon-07 style)?

Overall, practically you have implemented the optimizations, oakwood guidelines mention? Then should we add it to master, you think?

@dcwbrown
Copy link
Contributor

Yes, I also like the approach of marking the parameter as read only, I'll come back to that in a moment.

Thanks for the related discussion link.

But first I think you are proposing that a programmer already has sufficient tools available in choice of VAR vs value when you say: ' i was inclined, that we should not think of programmer as of someone, who does not understand what he is doing, and allowing to explicitly mention, if the array has to be passed by reference or by value, is a plus.'

But this is not enough.

Consider all the library API's that currently take a value parameter ARRAY OF CHAR so as to accept a literal string parameter.

For example, to append the .conf extension to a filename, I can use:

   Strings.Append(".conf", filename)

Currently this causes the allocation of a temporary array of char, copying ".conf" into it, copying it in turn to the end of filename, then freeing the temporary array memory.

No matter how well the developer of Strings.Append understands what is happening they must choose between a VAR and a value parameter, and neither solves our issue.

  • A value parameter accepts a string literal as actual parameter but necessitates the redundant allocation,copy,deallocation sequence.
  • A VAR parameter avoids the allocation,copy,deallocation sequence but cannot accept a literal string actual parameter.

A deeper issue:

Thanks to the related discussion you referenced, I though of the following pathological case. Let's say we want to double a string, e.g. so that "hello" becomes "hellohello". We might think of using Strings.Append like this:

str: ARRAY 100 OF CHAR;
str := "Hello";
Strings.Append(str, str);

Assume a common implementation of append that, after finding the 0 at the end of the target string, copies characters from the source to the target until it hits the zero at the end of the source.

With any change that passes the first parameter by address (i.e. my optimization, or a new constant param kind) this loop will never find the 0 terminator of the source string because they are both the same string and the very first character copy overwrites the 0 terminator.

This is annoying :-).

Have you seen this issue before, say in component pascal?

@dcwbrown
Copy link
Contributor

After more thought I am convinced that it would be bad to turn this on by default as it breaks the simple behaviour of a value parameter: that it always behaves exactly as a local copy.

But we could still expose the copy disable with one of these syntaxes:

  • Retain the current undocumented [1] flag
  • Use the trailing '-' syntax from Oleg and the Oakwood comments
  • Use a leading 'CONST' syntax - I've seen this in Pascal compilers
  • Use a leading 'IN' syntax as in component pascal

Of these only the 'IN' prefix seems to me to describe the behaviour well. In particular it does not misdirect the reader into thinking that the content of the array cannot change during procedure execution.

Any thoughts?

@norayr
Copy link
Member

norayr commented Dec 19, 2019

I like Ada's IN, OUT, and INOUT options.
by these programmer can declare the intentions, expectations: this data will not be overwritten, but will be passed by reference. this data cannot be read, but is here for you to write in to. and INOUT is similar to VAR in Oberon, i. e. gives read and write access.

Do you think it's not hard to implement checks for OUT? Then I would vote to introduce experimental extensions - IN, OUT and INOUT, which is an alias for VAR.

but all experimental extensions should be allowed by the use of the compiler flag. Otherwise, voc should compile only legit Oberon-2 code.

@antranigv
Copy link
Member

antranigv commented Dec 20, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants