Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed-width tables don't account for special characters #41

Open
esimpsons3ti opened this issue Dec 3, 2024 · 0 comments
Open

Fixed-width tables don't account for special characters #41

esimpsons3ti opened this issue Dec 3, 2024 · 0 comments

Comments

@esimpsons3ti
Copy link
Collaborator

Current generation of a fixed-width index file relies on the absence of special characters (quotes, commas, and the like) to accurately calculate the location and size of the field. Below I am including former (paraphrased) discussion on the current state of the problem, as well as the two possible solutions.

There are a few issues here:
- If there is a quoted string, the fixed width generator is only taking into account the non-quoted string when
- determining table width.
- There is a possibility where some of the strings in an index file are surrounded by quotes and some aren't. 

Now with PDS3, the standard is:
- Put quotes around ALL strings
- The start byte of a column is the first character AFTER the starting quote
- The width of a column is determined by the number of characters INSIDE the quotes
- Trailing spaces should always be removed
- Quotes inside strings are not permitted



PROPOSAL #1: I chatted with Mark a little, and we suggest that we do something more like the PDS3 concept
in PDS4 fixed-width (and only fixed-width) tables:

- All strings have quotes around them whether they need them or not
- The start byte of the column is the first character AFTER the starting quote
- The width of a column is the number of characters BETWEEN the quotes (including padding)
- Putting a quote inside the string will be represented as ""

A big advantage of this is it makes the tables easier to parse, because you don't have to worry about whether
a string is surrounded by " s or not - you are always just looking at the characters between the "s. It does mean
that you have to scan through all of the strings ahead of time, figure out which ones are going to need to escape
quotes, and add those escape characters to the overall column width before writing out the CSV/label.

However, a downside is that all of the padding occurs inside the quotes.
This basically guarantees that trailing spaces must ALWAYS be stripped when reading a string column, and thus
trailing spaces can never been semantically meaningful. But stripping a column is not always possible - for example,
reading a column into Excel as a standard CSV (which should still work) will fail. At least I tried it with
LibreSheets and it kept the trailing spaces in the cells, even with "Trim Spaces" turned on. This means that,
for example, a cell that's in the table as "JUPITER  " will show up in the spreadsheet with the spaces, and can't
be compared to just the plain string JUPITER. A second problem is that if you're just reading the string between
the quotes, then having escaped quotes doesn't make any sense (and won't parse properly), because you only
need to escape quotes insideof other quotes, but you aren't looking at the other quotes.



PROPOSAL #2: This brings us to a second possibility, with different trade-offs. Put the padding AFTER the
second quote, instead of inside the string, and point the field start byte at the first quote instead of the first
character.

In this case every time we read a string we have to strip off the quotes, but we don't have any extra padding inside
the string so trailing spaces don't need to be handled separately. It also means the escaped quotes make sense.
This reads properly into LibreSheets with "Trim Spaces" turned on.

Note I want to avoid what might be the truly "correct" version, which is in a fixed width table you don't need to
have outside quotes or escaped quotes at all (since you know the column width), because that means you can't
read the file with a plain CSV reader. It also still has the strip-trailing-spaces problem.

NOTE: currently, development on this issue has been paused until a decision on the current standards the code should abide by is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant