Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically changing RS #143

Open
microo8 opened this issue Aug 18, 2022 · 7 comments
Open

Dynamically changing RS #143

microo8 opened this issue Aug 18, 2022 · 7 comments

Comments

@microo8
Copy link

microo8 commented Aug 18, 2022

I've got a file where the first few bites define some of the attributes of the file. The 9th bite is the record separator.

I need to read this file, set RS and then read the file "again" but now separated by this new record separator.

Input file (here the record separator is '):

UNA:+,? 'UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL'UNH+1+MSCONS:D:04B:UN:2.3'BGM+7+EC33218279A-1+9'DTM+137:202012310206:203'RFF+Z13:13018'NAD+MS+4042805000102::9'NAD+MR+4016001000655::9'UNS+D'NAD+DP'LOC+172+DE00108108359V0000000000000088446'DTM+163:202012300000?+01:303

This works on GNU awk:

BEGIN { RS=".{9}" }
NR==1 { $0=substr(RT,1,8); RS=substr(RT,9,1) }
{ print $0 }

output:

UNA:+,?
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303

but not on goawk:

UNA:+,? 








@benhoyt
Copy link
Owner

benhoyt commented Aug 19, 2022

Interesting, thanks for the report! This is a tricky one. It seems that GNU Gawk (and other AWKs) allow you to set RS at any time when reading from an input file, and it'll dynamically update RS and then read/parse the rest (the unread part) of the file. However, GoAWK uses bufio.Scanner on each input file, which doesn't have an API that allows dynamically updating this as you read (some of the data read would still be in its buffer).

I can reproduce your case if I save your input file to rstest.in and the program to rstest.awk:

$ gawk -f rstest.awk rstest.in 
UNA:+,? 
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303

$ goawk -f rstest.awk rstest.in 
UNA:+,? 

... lots more blank lines ...

303

$

However, that program doesn't work in original-awk or mawk either, I guess because of the use of the Gawk-only RT variable. Here's a more portable program that shows the same "dynamic setting of RS" issue:

$ cat rstest2.awk
NR==1 { RS=substr($0,9,1) }
NR>1  { print $0 }
$ cat rstest.in rstest.in >rstest2.in
$ gawk -f rstest2.awk rstest2.in  # original-awk and mawk have the same output now
UNA:+,? 
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303
$ goawk -f rstest2.awk rstest2.in 
UNA:+,? 'UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL'UNH+1+MSCONS:D:04B:UN:2.3'BGM+7+EC33218279A-1+9'DTM+137:202012310206:203'RFF+Z13:13018'NAD+MS+4042805000102::9'NAD+MR+4016001000655::9'UNS+D'NAD+DP'LOC+172+DE00108108359V0000000000000088446'DTM+163:202012300000?+01:303
$ 

To work around this in GoAWK for now, I'd recommend actually reading (part of) the file twice. Note how rstest.in is specified twice on the command line. This works in GoAWK and other AWKs:

$ cat rstest3.awk 
NR==1   { RS=substr($0,9,1); next }
NR!=FNR { print $0 }
$ goawk -f rstest3.awk rstest.in rstest.in
UNA:+,? 
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303

$ 

That said, I think this is a bug (or at least a quirk) of GoAWK, so I'm going to leave it open. I'm not sure the best way to fix it without revamping the use of bufio.Scanner. I think I'd need a scanner variant that can transfer the remaining/buffered bytes to a new scanner we dynamically changing RS.

@benhoyt
Copy link
Owner

benhoyt commented Aug 19, 2022

@arnoldrobbins, any thoughts on this? Where is this behaviour (that one can change RS part way through a file) documented, or is it just assumed that this will work? I couldn't find it explicitly documented from a scan of RS in the Gawk manual, though I may have missed it.

@arnoldrobbins
Copy link

It's just assumed it will work. RS is like any other variable that you can change at any time you like. I agree with your assessment, that this is a bug in GoAWK. In C this is handled fairly naturally; there's a buffer, RS matches the end of the text, and then you start again with whatever is in the current value of RS to find the next end of the buffer (with appropriate buffer management and filling from the file). HTH.

@benhoyt
Copy link
Owner

benhoyt commented Aug 19, 2022

I think what I'll do here (at some point) is copy the bufio.Scanner implementation into the GoAWK codebase, add a Buffered() io.Reader method (similar to encoding/json's Decoder.Buffered), and then use that if changing RS in the middle of reading a file. If Buffered() works out well, propose adding Buffered to Go's bufio.Scanner.

@janxkoci
Copy link

I remember this fun example in the Gawk book that uses RS+print to implement sed-like find-and-replace - the RS is updated in every cycle of the implicit loop while reading the input. The idea is credited to Mike Brennan, so probably it's portable to mawk at minimum.

@arnoldrobbins
Copy link

Actually, at the moment, only gawk supports RT, which this program uses. Maybe one day RT will find its way into other awks.

@janxkoci
Copy link

Oh, I missed that! Also, I just read the page again and RS is only set once (in the BEGIN block, which itself usually implies "once"). So I was wrong on multiple fronts 🤦

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants