-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCHANGE_LOG.txt
236 lines (201 loc) · 13.2 KB
/
CHANGE_LOG.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
02-05-2022:
* Add keywords to grammar file (BabyCobolTokens.g4)
* Copy grammar from original BabyCobol paper (BabyCobolGrammar.g4)
* Extend original grammar with divisions & paragraphs
-> Achieved: Simple parsing of the fib example files as given in the paper. Doesn't parse completely yet.
03-05-2022:
* Work on identification division
-> Encountered problem with the greedy algorithm. Always matches ~[.]+ with everything, not just the identification division.
-> A possible solution is to switch to a non-greedy strategy (https://github.com/antlr/antlr4/blob/master/doc/wildcard.md)
-> Eventually not necessary, but switched to a recursive grammar rule, which switches between arbitrary values.
Really not a nice implementation but it works.
* Add support for sufficient qualification (FOR)
* Ability to fully parse (without any errors) the simple fib file.
* Add labels to grammar rules with multiple RHSs.
* Started working on a quick-and-dirty pretty-print program.
-> Is able to expand picture representations to their proper values 2(X) -> XX and 99V9(9) -> 99V999999999
-> For the visual effect sequence numbers are printed in the first six columns.
-> Depth values for the data division are recalculated and increase by one each depth.
-> Not properly done yet... Currently the following problem persists:
000000 IDENTIFICATION DIVISION.
000001 PROGRAMID. TEST-REPRESENTATION.
000002 DATA DIVISION.
000003 1 TEST PICTURE IS 9XXXXX999.
000004 2 BETA LIKE TEST.
000005 1 OMEGA OCCURS 20 TIMES.
000006 3 ALPHA PICTURE IS AAAXXXXX9.
But 3 ALPHA ... should be one indent back, starting on the same column as 2 BETA ...
04-05-2022:
* Update the ANTLR grammar to what's available on the BabyCobol documentation site, since the paper has some mistakes
* Set new goal: Implement a pre-processor that conducts the full lexical analysis.
-> Attempted to write an ANTLR grammar, but for now switched to Java regex.
05-05-2022:
* Continued on column based parsing.
-> Base regex on the COBOL preprocessor as found on the ANTLR GitHub (https://github.com/antlr/grammars-v4/blob/c270f16dc0bc30fb15f7b1aac1d2239f8512d704/cobol85/java/CobolPreprocessor.java#L34)
06-05-2022:
* Continued on column based parsing
-> Start working on inserting continued lines in the same line.
-> Finished column based parsing in the Java preprocessor.
09-06-2022:
* The preprocessor will communicate line numbers after the 72nd column with the sequence #![S]_[E]\r\n
-> Where S -> First line number of the line and E -> Last number of the line (in case of line merges)
-> Started on a custom error listener that shows correct line numbers and underlines errors.
* By including the Line objects in the error listener, we do not have to provide the original line numbers in the to be parsed text.
* Give a warning when the clean printing removes information from the A section
* Found a section in the ANTLR reference p. 210 about keywords not being reserved.
12-06-2022:
* Start on implementing "keywords as identifiers" according to p. 210.
-> Looks easy since it's just adding all keywords to the identifier rule, however. How do we fix the issue of ambiguity?
Since ANTLR chooses for us right?
* Move atomic expressions out of normal statements.
* PROBLEM: How do we tell the parser to choose depending on the case?!
* White space insignificance
-> Tried to split tokens into subtokens consisting out of single letters, didn't work. Whitespace still affected
tokenization.
-> Put separate single char tokens as grammar rules.
-> This works great! Only have to figure out on how to handle the picture repr. and integers.
-> Integers should have been fixed
16-05-2022:
* Started on using predicates in order to distinguish keywords from identifiers.
* Well, eventually decided to switch back, since during the parsing one cannot get the value of a gramamr rule on the
right-hand-side, since this can still be anything. (Yield a class casting error)
17-05-2022:
* Possibly came up with a fix for yesterday's issue. So instead of redefining every token as a grammar rule consisting
out of separate tokens we are back at actual tokens. However, they are built up from individual character fragments
that are non-greedily followed by whitespace.
-> This especially took a lot of effort, since everything was being recognized as names, this was fixed by making sure
that all the individual fragments have a non-greedy consumption of possible whitespace.
-> It should now be possible to use predicates in order to check if the parser should detect the keyword or tell that
it is an identifier.
18-05-2022:
* While testing the parser I figured out that ANTLR shows all the places where ambiguity takes place. Perhaps there are
triggers fired during parsing upon which we could act.
* Start focussing on one thing at the time, for now do not continue on white-space ignorance. But first focus on having
the option to have keywords as identifiers.
* Just discovered that ambiguities are apparently reported!!! And can be dealt with in the error listener!!!
* Ambiguities are indeed DETECTED, but cannot be changed with the reportAmbiguity() function as declared in the
BaseErrorListener.
19-05-2022:
* New idea, we can apparently not get the proper input in a predicate of a grammar rule before deciding things.
However, if a predicate decides to fail afterwards, we can handle accordingly.
-> Yeah handling things afterwards in this manner doesn't seem the best option.
* Perhaps now take a look at having the ability to use error recovery?!
* Since I am quite stuck, for now wait till I get the repositories from Vadim and start focussing on something else.
23-05-2022:
* Start on sufficient qualification.
* Finished the data structure that holds all variables according to the given order and levels.
* Created a tree visitor that collects the variables into the previously made data structure.
24-05-2022:
* Validate the quantification in the procedure division. Give error messages and warnings when appropriate.
25-05-2022:
* Finally received the ANTLR projects from Vadim.
-> All of them have not implemented the keywords == identifiers feature.
-> Some of them have limited whitespace ignorance between tokens but not within them.
-> Hence that feature isn't implemented either since one just does WS: [ ] -> skip;
* Conclusion: Perhaps a better idea to just let the idea die for now and focus on other features!
-> We can of course still describe the idea in the paper, and where the journey has ended.
-> TODO: Message Vadim on what to do, for now continue with sufficient qualification
* Got answer back during the meeting on what would happen with the example:
000001 ...
000002 PROCEDURE DIVISION.
000003 DISPLAY A.
000004 X.
000005 DISPLAY B.
Where X is parsed as a label "X" and not skipped over or parsed as " X"
27-05-2022:
* Finished sufficient qualification.
* Continue on writing a pre-processor in ANTLR.
30-05-2022:
* Decided previously that it would be nice to support the COPY feature. This has now mostly been completed. Including
giving errors in the correct locations, within the original files.
* However, after parsing has finished, we should still write a nice error method that also puts errors in a proper
location when walking the tree with a tree listener or visitor.
* Found two bugs. Copy doesn't always replace all words, and apparently we cannot do like statements referring to fields
in the data division...
-> Both are fixed!
* Error messages always include the file name of the original file, and the correct line numbers. Additionally they have
the erroring toking underlined.
-> TODO: Sometimes, however, we have an expression that is broken and not just a token. We still have to fix this,
since only tokens are supported for now.
31-05-2022:
* Implemented keywords as identifiers, but only with limited functionality. Identifiers that are used as keywords cannot
be implicitly defined. Additionally, if one wants to use such keyword, it cannot be full uppercase (LOOp is allowed).
Last but not least, if you want to use a keyword as a keyword that is also defined as an identifier, then it should be
in uppercase.
-> This limitation will keep existing, since we have to disable parse rules IN ADVANCE, and cannot backtrack till a
certain point when parsing would fail. This is an unfortunate limitation of ANTLR. Backtracking is definitely
done by the tool, however this is fully automated and cannot be altered by normal means.
* There was a small issue where sufficient qualification was not recognized any longer. But this happened due to possible
ambiguity. The fix was to reorder the possible identifier statements.
02-06-2022:
* When using replace in the copy statement, only the content area is searched and replaced for if a match is found. All
other areas remain untouched.
* Still working on the division of the line when a copy statement is found. There is currently a problem with get all
the original lines outside the copy statement. But the inputstream should solve this, since it contains the original
input, including whitespace.
03-06-2022:
* Copy is now only replacing the statement itself, and not the entire line when the code is inserted. However, due to
the internal workings the offset is usually now lost before the code enters the parser. This should however not
matter since column based parsing is done by the pre-processor and is taken core of BEFORE the handling of the copy
statement.
* Created a basic pretty printer, this can be used for quick debugging. Keywords can quickly be recognized since they're
upper-cased, whereas identifiers are upper-cased.
* FIXME: The line numbers in error messages seem always to be 1. (At least for fib_bad.bc)
04-06-2022:
* Fixed the previous mentioned issue that line numbers where always 1. (Happened due to the new implementation for the
handling of the copy statement.)
* FIXME: A new issue was found, probably the copy parser cannot deal with empty lines (better said, empty content areas)
this should be fixed.
-> Fixed. The pre-processor (Java regex) apparently does not like empty lines. If a line is empty, it is now simply
skipped.
05-06-2022:
* Switched to automated testing for the parser with the help of JUnit. Files are read automatically, tested against
possible error messages generated by its components.
06-06-2022:
* Implicit definitions are added to the variable structure.
* On 4-6 mentioned that the preprocessor didn't like empty lines. It couldn't handle an empty B area either. This is now
completely fixed. The regex apparently required the columns to be filled, this is no longer needed in order to
generate a match.
* Error messages contain the correct character positions inside a line, even if this line was continued.
07-06-2022:
* Figured out that keywords cannot be used as procedure names. This could however, if someone would really want to have
support for it still be achieved. By making all the grammar rules have an optional body after the first mentioned
keyword. Then one could later on in a listener or visitor decide if the keyword indicates a statement or procedure
name. Hence, this is something for future work.
* Since we will not cover white space insignificance anyhow, removed all the fragments for individual characters and
changed all the tokens to just their strings.
08-06-2022:
* Added a test demonstrating that keywords cannot be used as procedure or paragraph names.
* Each test now prints how long the pre-processor, parser, qualification checker took to execute.
09-06-2022:
* During the meeting with Vadim we discovered that it is actually quite easy to use keywords as paragraph names. This
has now been implemented. Next I'll have a look at also being able to call these names in an unambiguous manner. And
of course write some test cases for it.
* Procedure names can also now consist of a keyword name.
13-06-2022:
* TODO: Apparently I do not yet check if a variable is already defined or not.
* TODO: Is it actually possible to define an empty container.
-> This is not desired right? -> Ask Vadim
* FIXME: The keyword STOP is not always properly recognized as a paragraph name, even though it is intended to be one.
This makes sense since stop always only occurs on its own and hence cannot quickly be disambiguated. This could
be fixed by making the parser explicitly aware of the area A and B. Since currently this is just being ignored.
-> Test case:
IDENTIFICATION DIVISION.
T. T.
DATA DIVISION.
01 If PICTURE IS X.
01 ELSe PICTURE IS 9.
01 then PICTURE IS V.
PROCEDURE DIVISION.
* The line below should indicate a paragraph name, but instead it is recognized as a statement.
STOP.
IF if = then THEN DISPLAY then ELSE DISPLAY if END.
STOP.
MOVE.
ACCEPT if.
STOP
14-06-2022:
* A possible fix for the third point mentioned yesterday is to make the parser aware of the areas within the content
area itself. A possible way on how to do this would be make use of stropping.
22-06-2022:
* Added a grammar without predicates which is used in PerformanceTest.java.