CHANGE_LOG.txt

02-05-2022:
* Add keywords to grammar file (BabyCobolTokens.g4)
* Copy grammar from original BabyCobol paper (BabyCobolGrammar.g4)
* Extend original grammar with divisions & paragraphs

-> Achieved: Simple parsing of the fib example files as given in the paper. Doesn't parse completely yet.

03-05-2022:
* Work on identification division
-> Encountered problem with the greedy algorithm. Always matches ~[.]+ with everything, not just the identification division.
   -> A possible solution is to switch to a non-greedy strategy (https://github.com/antlr/antlr4/blob/master/doc/wildcard.md)
      -> Eventually not necessary, but switched to a recursive grammar rule, which switches between arbitrary values.
         Really not a nice implementation but it works.

* Add support for sufficient qualification (FOR)
* Ability to fully parse (without any errors) the simple fib file.
* Add labels to grammar rules with multiple RHSs.
* Started working on a quick-and-dirty pretty-print program.
-> Is able to expand picture representations to their proper values 2(X) -> XX and 99V9(9) -> 99V999999999
-> For the visual effect sequence numbers are printed in the first six columns.
-> Depth values for the data division are recalculated and increase by one each depth.
   -> Not properly done yet... Currently the following problem persists:
      000000 IDENTIFICATION DIVISION.
      000001     PROGRAMID. TEST-REPRESENTATION.
      000002 DATA DIVISION.
      000003     1 TEST PICTURE IS 9XXXXX999.
      000004         2 BETA LIKE TEST.
      000005     1 OMEGA OCCURS 20 TIMES.
      000006             3 ALPHA PICTURE IS AAAXXXXX9.
      But 3 ALPHA ... should be one indent back, starting on the same column as 2 BETA ...

04-05-2022:
* Update the ANTLR grammar to what's available on the BabyCobol documentation site, since the paper has some mistakes
* Set new goal: Implement a pre-processor that conducts the full lexical analysis.
-> Attempted to write an ANTLR grammar, but for now switched to Java regex.

05-05-2022:
* Continued on column based parsing.
-> Base regex on the COBOL preprocessor as found on the ANTLR GitHub (https://github.com/antlr/grammars-v4/blob/c270f16dc0bc30fb15f7b1aac1d2239f8512d704/cobol85/java/CobolPreprocessor.java#L34)

06-05-2022:
* Continued on column based parsing
-> Start working on inserting continued lines in the same line.
-> Finished column based parsing in the Java preprocessor.

09-06-2022:
* The preprocessor will communicate line numbers after the 72nd column with the sequence #![S]_[E]\r\n
-> Where S -> First line number of the line and E -> Last number of the line (in case of line merges)
-> Started on a custom error listener that shows correct line numbers and underlines errors.

* By including the Line objects in the error listener, we do not have to provide the original line numbers in the to be parsed text.
* Give a warning when the clean printing removes information from the A section

* Found a section in the ANTLR reference p. 210 about keywords not being reserved.

12-06-2022:
* Start on implementing "keywords as identifiers" according to p. 210.
-> Looks easy since it's just adding all keywords to the identifier rule, however. How do we fix the issue of ambiguity?
   Since ANTLR chooses for us right?

* Move atomic expressions out of normal statements.
* PROBLEM: How do we tell the parser to choose depending on the case?!

* White space insignificance
-> Tried to split tokens into subtokens consisting out of single letters, didn't work. Whitespace still affected
   tokenization.
-> Put separate single char tokens as grammar rules.
   -> This works great! Only have to figure out on how to handle the picture repr. and integers.
   -> Integers should have been fixed

16-05-2022:
* Started on using predicates in order to distinguish keywords from identifiers.
* Well, eventually decided to switch back, since during the parsing one cannot get the value of a gramamr rule on the
  right-hand-side, since this can still be anything. (Yield a class casting error)

17-05-2022:
* Possibly came up with a fix for yesterday's issue. So instead of redefining every token as a grammar rule consisting
  out of separate tokens we are back at actual tokens. However, they are built up from individual character fragments
  that are non-greedily followed by whitespace.
  -> This especially took a lot of effort, since everything was being recognized as names, this was fixed by making sure
     that all the individual fragments have a non-greedy consumption of possible whitespace.
  -> It should now be possible to use predicates in order to check if the parser should detect the keyword or tell that
     it is an identifier.

18-05-2022:
* While testing the parser I figured out that ANTLR shows all the places where ambiguity takes place. Perhaps there are
  triggers fired during parsing upon which we could act.
* Start focussing on one thing at the time, for now do not continue on white-space ignorance. But first focus on having
  the option to have keywords as identifiers.
* Just discovered that ambiguities are apparently reported!!! And can be dealt with in the error listener!!!
* Ambiguities are indeed DETECTED, but cannot be changed with the reportAmbiguity() function as declared in the
  BaseErrorListener.

19-05-2022:
* New idea, we can apparently not get the proper input in a predicate of a grammar rule before deciding things.
  However, if a predicate decides to fail afterwards, we can handle accordingly.
  -> Yeah handling things afterwards in this manner doesn't seem the best option.
* Perhaps now take a look at having the ability to use error recovery?!
* Since I am quite stuck, for now wait till I get the repositories from Vadim and start focussing on something else.

23-05-2022:
* Start on sufficient qualification.
* Finished the data structure that holds all variables according to the given order and levels.
* Created a tree visitor that collects the variables into the previously made data structure.

24-05-2022:
* Validate the quantification in the procedure division. Give error messages and warnings when appropriate.

25-05-2022:
* Finally received the ANTLR projects from Vadim.
  -> All of them have not implemented the keywords == identifiers feature.
  -> Some of them have limited whitespace ignorance between tokens but not within them.
     -> Hence that feature isn't implemented either since one just does WS: [ ] -> skip;
* Conclusion: Perhaps a better idea to just let the idea die for now and focus on other features!
  -> We can of course still describe the idea in the paper, and where the journey has ended.
  -> TODO: Message Vadim on what to do, for now continue with sufficient qualification
* Got answer back during the meeting on what would happen with the example:

000001 ...
000002 PROCEDURE DIVISION.
000003     DISPLAY A.
000004   X.
000005     DISPLAY B.

Where X is parsed as a label "X" and not skipped over or parsed as "  X"

27-05-2022:
* Finished sufficient qualification.
* Continue on writing a pre-processor in ANTLR.

30-05-2022:
* Decided previously that it would be nice to support the COPY feature. This has now mostly been completed. Including
  giving errors in the correct locations, within the original files.
* However, after parsing has finished, we should still write a nice error method that also puts errors in a proper
  location when walking the tree with a tree listener or visitor.
* Found two bugs. Copy doesn't always replace all words, and apparently we cannot do like statements referring to fields
  in the data division...
  -> Both are fixed!
* Error messages always include the file name of the original file, and the correct line numbers. Additionally they have
  the erroring toking underlined.
  -> TODO: Sometimes, however, we have an expression that is broken and not just a token. We still have to fix this,
           since only tokens are supported for now.

31-05-2022:
* Implemented keywords as identifiers, but only with limited functionality. Identifiers that are used as keywords cannot
  be implicitly defined. Additionally, if one wants to use such keyword, it cannot be full uppercase (LOOp is allowed).
  Last but not least, if you want to use a keyword as a keyword that is also defined as an identifier, then it should be
  in uppercase.
  -> This limitation will keep existing, since we have to disable parse rules IN ADVANCE, and cannot backtrack till a
     certain point when parsing would fail. This is an unfortunate limitation of ANTLR. Backtracking is definitely
     done by the tool, however this is fully automated and cannot be altered by normal means.
* There was a small issue where sufficient qualification was not recognized any longer. But this happened due to possible
  ambiguity. The fix was to reorder the possible identifier statements.

02-06-2022:
* When using replace in the copy statement, only the content area is searched and replaced for if a match is found. All
  other areas remain untouched.
* Still working on the division of the line when a copy statement is found. There is currently a problem with get all
  the original lines outside the copy statement. But the inputstream should solve this, since it contains the original
  input, including whitespace.

03-06-2022:
* Copy is now only replacing the statement itself, and not the entire line when the code is inserted. However, due to
  the internal workings the offset is usually now lost before the code enters the parser. This should however not
  matter since column based parsing is done by the pre-processor and is taken core of BEFORE the handling of the copy
  statement.
* Created a basic pretty printer, this can be used for quick debugging. Keywords can quickly be recognized since they're
  upper-cased, whereas identifiers are upper-cased.
* FIXME: The line numbers in error messages seem always to be 1. (At least for fib_bad.bc)

04-06-2022:
* Fixed the previous mentioned issue that line numbers where always 1. (Happened due to the new implementation for the
  handling of the copy statement.)
* FIXME: A new issue was found, probably the copy parser cannot deal with empty lines (better said, empty content areas)
         this should be fixed.
  -> Fixed. The pre-processor (Java regex) apparently does not like empty lines. If a line is empty, it is now simply
     skipped.

05-06-2022:
* Switched to automated testing for the parser with the help of JUnit. Files are read automatically, tested against
  possible error messages generated by its components.

06-06-2022:
* Implicit definitions are added to the variable structure.
* On 4-6 mentioned that the preprocessor didn't like empty lines. It couldn't handle an empty B area either. This is now
  completely fixed. The regex apparently required the columns to be filled, this is no longer needed in order to
  generate a match.
* Error messages contain the correct character positions inside a line, even if this line was continued.

07-06-2022:
* Figured out that keywords cannot be used as procedure names. This could however, if someone would really want to have
  support for it still be achieved. By making all the grammar rules have an optional body after the first mentioned
  keyword. Then one could later on in a listener or visitor decide if the keyword indicates a statement or procedure
  name. Hence, this is something for future work.
* Since we will not cover white space insignificance anyhow, removed all the fragments for individual characters and
  changed all the tokens to just their strings.

08-06-2022:
* Added a test demonstrating that keywords cannot be used as procedure or paragraph names.
* Each test now prints how long the pre-processor, parser, qualification checker took to execute.

09-06-2022:
* During the meeting with Vadim we discovered that it is actually quite easy to use keywords as paragraph names. This
  has now been implemented. Next I'll have a look at also being able to call these names in an unambiguous manner. And
  of course write some test cases for it.
* Procedure names can also now consist of a keyword name.

13-06-2022:
* TODO: Apparently I do not yet check if a variable is already defined or not.
* TODO: Is it actually possible to define an empty container.
  -> This is not desired right? -> Ask Vadim
* FIXME: The keyword STOP is not always properly recognized as a paragraph name, even though it is intended to be one.
         This makes sense since stop always only occurs on its own and hence cannot quickly be disambiguated. This could
         be fixed by making the parser explicitly aware of the area A and B. Since currently this is just being ignored.
  -> Test case:
         IDENTIFICATION DIVISION.
             T. T.
         DATA DIVISION.
             01 If PICTURE IS X.
             01 ELSe PICTURE IS 9.
             01 then PICTURE IS V.
         PROCEDURE DIVISION.
        * The line below should indicate a paragraph name, but instead it is recognized as a statement.
         STOP.
             IF if = then THEN DISPLAY then ELSE DISPLAY if END.
             STOP.
         MOVE.
             ACCEPT if.
         STOP

14-06-2022:
* A possible fix for the third point mentioned yesterday is to make the parser aware of the areas within the content
  area itself. A possible way on how to do this would be make use of stropping.

22-06-2022:
* Added a grammar without predicates which is used in PerformanceTest.java.