Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary resources #1127

Open
michaelhkay opened this issue Mar 28, 2024 · 12 comments
Open

Binary resources #1127

michaelhkay opened this issue Mar 28, 2024 · 12 comments
Labels
Enhancement A change or improvement to an existing feature Feature A change that introduces a new feature PRG-easy Categorized as "easy" at the Prague f2f, 2024 PRG-required Categorized as "required for 4.0" at the Prague f2f, 2024 XQUF An issue related to the XQuery Update Facility

Comments

@michaelhkay
Copy link
Contributor

michaelhkay commented Mar 28, 2024

We have some functions that accept binary input (parse-html, parse-csv) and others that don't (parse-xml, parse-json). There seems to be no obvious justification for the inconsistency.

Related to this:

(a) we have no functions to convert (encode/decode) between binary and string given an encoding

(b) we have no function to read a binary resource from a URI

Both of these are available in the EXPath bin library but should perhaps be promoted to the main spec.

@ChristianGruen
Copy link
Contributor

We have some functions that accept binary input (parse-html, parse-csv) and others that don't (parse-xml, parse-json). There seems to be no obvious justification for the inconsistency.

Related: #748

(b) we have no function to read a binary resource from a URI

Related: #557

@ChristianGruen ChristianGruen added Enhancement A change or improvement to an existing feature XQUF An issue related to the XQuery Update Facility Feature A change that introduces a new feature and removed Enhancement A change or improvement to an existing feature labels Mar 28, 2024
@dnovatchev
Copy link
Contributor

dnovatchev commented Mar 28, 2024

I am trying to understand what this issue is about - please confirm or correct me if I am wrong.

Does the following code - converting an xs:base64Binary to "codepoints" - have something to do with this issue, or is it something completely different?

let $hexCodepoints := function($input as xs:hexBinary) as xs:integer*
{
  let $hexchars := map{
    "00": 0, "01": 1, "02": 2, "03": 3, "04": 4, "05": 5, "06": 6, "07": 7, "08": 8, "09": 9, "0A": 10, "0B": 11, "0C": 12, "0D": 13, "0E": 14, "0F": 15, 
    "10": 16, "11": 17, "12": 18, "13": 19, "14": 20, "15": 21, "16": 22, "17": 23, "18": 24, "19": 25, "1A": 26, "1B": 27, "1C": 28, "1D": 29, "1E": 30, "1F": 31, 
    "20": 32, "21": 33, "22": 34, "23": 35, "24": 36, "25": 37, "26": 38, "27": 39, "28": 40, "29": 41, "2A": 42, "2B": 43, "2C": 44, "2D": 45, "2E": 46, "2F": 47, 
    "30": 48, "31": 49, "32": 50, "33": 51, "34": 52, "35": 53, "36": 54, "37": 55, "38": 56, "39": 57, "3A": 58, "3B": 59, "3C": 60, "3D": 61, "3E": 62, "3F": 63, 
    "40": 64, "41": 65, "42": 66, "43": 67, "44": 68, "45": 69, "46": 70, "47": 71, "48": 72, "49": 73, "4A": 74, "4B": 75, "4C": 76, "4D": 77, "4E": 78, "4F": 79, 
    "50": 80, "51": 81, "52": 82, "53": 83, "54": 84, "55": 85, "56": 86, "57": 87, "58": 88, "59": 89, "5A": 90, "5B": 91, "5C": 92, "5D": 93, "5E": 94, "5F": 95, 
    "60": 96, "61": 97, "62": 98, "63": 99, "64": 100, "65": 101, "66": 102, "67": 103, "68": 104, "69": 105, "6A": 106, "6B": 107, "6C": 108, "6D": 109, "6E": 110, "6F": 111, 
    "70": 112, "71": 113, "72": 114, "73": 115, "74": 116, "75": 117, "76": 118, "77": 119, "78": 120, "79": 121, "7A": 122, "7B": 123, "7C": 124, "7D": 125, "7E": 126, "7F": 127, 
    "80": 128, "81": 129, "82": 130, "83": 131, "84": 132, "85": 133, "86": 134, "87": 135, "88": 136, "89": 137, "8A": 138, "8B": 139, "8C": 140, "8D": 141, "8E": 142, "8F": 143, 
    "90": 144, "91": 145, "92": 146, "93": 147, "94": 148, "95": 149, "96": 150, "97": 151, "98": 152, "99": 153, "9A": 154, "9B": 155, "9C": 156, "9D": 157, "9E": 158, "9F": 159, 
    "A0": 160, "A1": 161, "A2": 162, "A3": 163, "A4": 164, "A5": 165, "A6": 166, "A7": 167, "A8": 168, "A9": 169, "AA": 170, "AB": 171, "AC": 172, "AD": 173, "AE": 174, "AF": 175, 
    "B0": 176, "B1": 177, "B2": 178, "B3": 179, "B4": 180, "B5": 181, "B6": 182, "B7": 183, "B8": 184, "B9": 185, "BA": 186, "BB": 187, "BC": 188, "BD": 189, "BE": 190, "BF": 191, 
    "C0": 192, "C1": 193, "C2": 194, "C3": 195, "C4": 196, "C5": 197, "C6": 198, "C7": 199, "C8": 200, "C9": 201, "CA": 202, "CB": 203, "CC": 204, "CD": 205, "CE": 206, "CF": 207, 
    "D0": 208, "D1": 209, "D2": 210, "D3": 211, "D4": 212, "D5": 213, "D6": 214, "D7": 215, "D8": 216, "D9": 217, "DA": 218, "DB": 219, "DC": 220, "DD": 221, "DE": 222, "DF": 223, 
    "E0": 224, "E1": 225, "E2": 226, "E3": 227, "E4": 228, "E5": 229, "E6": 230, "E7": 231, "E8": 232, "E9": 233, "EA": 234, "EB": 235, "EC": 236, "ED": 237, "EE": 238, "EF": 239, 
    "F0": 240, "F1": 241, "F2": 242, "F3": 243, "F4": 244, "F5": 245, "F6": 246, "F7": 247, "F8": 248, "F9": 249, "FA": 250, "FB": 251, "FC": 252, "FD": 253, "FE": 254, "FF": 255 },
     $strInput := xs:string($input) 
   return
   (
    for $i in 1 to xs:integer(string-length($strInput) div 2),
        $j in 2 * $i -1
     return $hexchars(substring($strInput, $j, 2))
   )
},

$invertBase64Binary := function($input as xs:base64Binary) as xs:integer*
{
  let $hexBin := xs:hexBinary($input),
      $codePoints := $hexCodepoints($hexBin)
   return
     for $cp in $codePoints
      return 255 - $cp 
}
  return 
  (
    "Hex-Binary(YAYBBQEBBQA=): " || xs:hexBinary(xs:base64Binary("YAYBBQEBBQA=")),
    "==================================================================",
    "Hex Codepoints: ",
    $hexCodepoints(xs:hexBinary(xs:base64Binary("YAYBBQEBBQA="))),
    "==================================================================",
    "Inverted codepoints: ",
    $invertBase64Binary(xs:base64Binary("YAYBBQEBBQA="))
  )

And this produces as result:

Hex-Binary(YAYBBQEBBQA=): 6006010501010500
==================================================================
Hex Codepoints: 
96
6
1
5
1
1
5
0
==================================================================
Inverted codepoints: 
159
249
254
250
254
254
250
255

@michaelhkay
Copy link
Contributor Author

I think your code is converting a hexBinary value into a sequence of octets, represented as integers in the range 0-255. There's nothing I can see here that is anything to do with characters or codepoints. (The use of the term codepoints in your post is quite misleading: Unicode codepoints require multiple octets.) A simpler implementation of this function in pure XQuery 4.0 might be

declare function hexStringToOctets($hex) {
  if ($hex) { parse-integer(substring($hex, 1, 2), 16)), hexStringToOctets(substring($hex, 3) }
}

That is of course sometimes a useful thing to be able to do; and it is one of the many functions available in the EXPath binary module defined at https://expath.org/spec/binary (see function bin:to-octets).

This issue isn't asking for all the functionality of the EXPath binary module to be added into F+O, nor was I thinking about this specific function. Rather, it's observing that with things like parse-html() we are implicitly invoking binary-to-string encoding and decoding, and the ability to read binary resources, and if these operations are required to implement those functions, then perhaps they should be exposed as public functions, on the basis that decomposing the functions we offer into their primitives is probably good design. This might be achieved by bringing into F&O:

  • bin:decode-string and bin:encode string from the EXPath binary module
  • file:read-binary from the EXPath file module

@Arithmeticus
Copy link
Contributor

I happily agree. This functionality is key, in my opinion, to an important new QT4 feature, where a file that is mostly text but with "bad characters" can be read as binary, fixed, then cast to a string according to a given encoding.

I wrote a similar set of functions for TAN, converting between octets and UTF-8 ... https://github.com/textalign/TAN-2021/blob/master/functions/numerics/TAN-fn-octets.xsl ... in conjunction with functions supporting conversions across bits, base64Binary, octets: https://github.com/textalign/TAN-2021/blob/master/functions/numerics/TAN-fn-binary.xsl

Should the function have a feature allowing the implementer to detect the encoding and use the one it deems best?

@dnovatchev
Copy link
Contributor

This issue isn't asking for all the functionality of the EXPath binary module to be added into F+O, nor was I thinking about this specific function. Rather, it's observing that with things like parse-html() we are implicitly invoking binary-to-string encoding and decoding, and the ability to read binary resources, and if these operations are required to implement those functions, then perhaps they should be exposed as public functions, on the basis that decomposing the functions we offer into their primitives is probably good design. This might be achieved by bringing into F&O:

  • bin:decode-string and bin:encode string from the EXPath binary module
  • file:read-binary from the EXPath file module

Agreed.

Let us also specify these functions, which can simplify dealing with binary, hex and base64Binary:

  1. octets-from-bin

  2. octets-from-hex

  3. octets-from-base64Binary

I wouldn't be surprised if it might be possible to combine these three into a single function, using the union of the three types.

@ChristianGruen
Copy link
Contributor

Rather, it's observing that with things like parse-html() we are implicitly invoking binary-to-string encoding and decoding

For fn:parse-html, please note that it’s more than an implicit conversion: The encoding may be derived from the binary input, and applied to decode the remaining binary stream.

It could be helpful to extend fn:parse-xml to also accept xs:base64Binary and xs:hexBinary, and to derive the input encoding from the binary data. In principle, this applies to all parse functions, but it could be most interesting for XML and HTML input, which (contrary to JSON or CSV) can also have encoding directives embedded in the data.

@michaelhkay
Copy link
Contributor Author

Perhaps we should:

  • mandate (or strongly encourage) implementation of the EXPath bin module in a 4.0 processor,
  • add a read-binary function modelled on file:read-binary
  • make parse-xml and parse-json accept binary input.

@ChristianGruen
Copy link
Contributor

  • mandate (or strongly encourage) implementation of the EXPath bin module in a 4.0 processor,

Encouraging it sounds good; I wouldn't mandate it. At least in our case, other non-standard modules are used much more frequently than this specific module.

What will be the primary use cases for binary conversions, apart from parsing and serializing data? We shouldn’t duplicate more and more functions from other modules (in our implementation, we already have two functions for converting binary data, one in the Binary Module, one in our custom and older Conversion Module).

  • add a read-binary function modelled on file:read-binary

👍

  • make parse-xml and parse-json accept binary input.

…as well as parse-csv, to be consistent.

@michaelhkay
Copy link
Contributor Author

Encouraging it sounds good; I wouldn't mandate it

I think the problem is that users are reluctant to use features that aren't guaranteed to be present in every implementation.

@ChristianGruen
Copy link
Contributor

I think the problem is that users are reluctant to use features that aren't guaranteed to be present in every implementation.

As Florent Georges, the maintainer of EXPath, has almost been unreachable for the last years, chances are high that the public resources get lost, so I wonder whether there’s a chance to move the Binary, File and possibly other modules into the W3 domain domain and make them mandatory in a second step? It could additionally give us the chance to revise the specs, and align them with XPath 4.

@cedporter
Copy link
Contributor

Bringing them into the w3 domain would be ideal. EXPath is a helpful set of tools, but knowing any conformant 4.0 processor will have a given function is, as Michael said, what gives users the confidence to rely on a given feature.

@rhdunn
Copy link
Contributor

rhdunn commented Apr 4, 2024

I made parse-html support binary as the HTML spec has rules for detecting the encoding and decoding a binary data stream. It therefore makes sense to support that in the API.

Additionally, the EXPath extensions support reading binary data even if it is not present in the core XQT specs. Vendors can also have their own binary extensions.

@ndw ndw added PRG-easy Categorized as "easy" at the Prague f2f, 2024 PRG-required Categorized as "required for 4.0" at the Prague f2f, 2024 labels Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement A change or improvement to an existing feature Feature A change that introduces a new feature PRG-easy Categorized as "easy" at the Prague f2f, 2024 PRG-required Categorized as "required for 4.0" at the Prague f2f, 2024 XQUF An issue related to the XQuery Update Facility
Projects
None yet
Development

No branches or pull requests

7 participants