Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordered Maps: maps that retain insertion order #1651

Closed
ChristianGruen opened this issue Dec 10, 2024 · 55 comments · Fixed by #1703
Closed

Ordered Maps: maps that retain insertion order #1651

ChristianGruen opened this issue Dec 10, 2024 · 55 comments · Fixed by #1703
Labels
Feature A change that introduces a new feature PR Pending A PR has been raised to resolve this issue XDM An issue related to the XPath Data Model XPath An issue related to XPath XQFO An issue related to Functions and Operators

Comments

@ChristianGruen
Copy link
Contributor

Currently, XDM maps are “unordered”: An implementation is allowed to organize entries in a way that optimizes lookup, not order. The entries do not have a predictable order unless they are explicitly sorted.

There are cases in which it is helpful if the “insertion order” is preserved – i.e., the order in which new map entries are added to a map. While the insertion order is not relevant if a map is exclusively used for lookups, it may be beneficial if the input includes deliberately sorted key/value pairs, such as (often) in JSON data, configurations or key/value sequences.

I created this issue because there was some confusion in #564, and on Slack, about this map flavor and “sorted maps”, which are discussed in issue #564: Sorted maps hold all map entries sorted by the key, using a comparator or (in its basic variant) fn:data#1.

PR #1609 attempts to solve both requirements at once.

@ChristianGruen ChristianGruen added XPath An issue related to XPath XQFO An issue related to Functions and Operators XDM An issue related to the XPath Data Model Feature A change that introduces a new feature labels Dec 10, 2024
@ChristianGruen
Copy link
Contributor Author

There are many ways in which maps are used in practice, but I think there we can define two main categories of use:

  1. Records: few entries, with a focus on organizing data (the map constructor map { } is often used for those cases).
  2. Dictionaries: many entries, with a focus on lookup performance (generated with map:merge and map:build).

For 1., ordering is helpful and sometimes essential. For 2., it does not matter.

I would like to warm up my idea to retain the order in some cases, but to drop it (by default) when updates are performed. In particular, I think we should:

  1. Make the order mandatory for the map constructor: It is often used to create maps with just few entries. In addition, its syntax does not allow us to supply arguments that might control the map generation.
  2. Make the order mandatory for records: It is confusing if the contents of a record are returned in a way that differs from the order of a record declaration.
  3. Provide options to map:build and map:merge (and possibly to update operations like map:put, map:remove) to retain the order: map:build(1 to 5, options := { 'ordered': true() }. In all cases, we can expect map:put to retain the order if the value of an existing entry is changed.

If we do so, it will be up to processors to provide support for an implementation of an ordered map that supports updates, or to use a simple map that transfers the entries to an unordered map once updates are performed.

Finally, it is important to remember that in many cases maps are never updated after the creation – which means there is no need to store such entries in an immutable implementation.

@michaelhkay
Copy link
Contributor

During discussion, @dnovatchev raised the question of defining ordered maps as a subtype of maps rather than as a property associated with the map instance. I'd like to consider the advantages and disadvantages of both approaches.

The most obvious benefit of using a subtype is that it enables functions to declare that they require an ordered map as the supplied argument, and to fail with a type error if supplied with an unordered map. Is that a feature we think we need? At the moment we aren't proposing any system functions or operators that require a map to be ordered.

Using a subtype would also obviate the need for an interrogative function to test whether a map is ordered, since this could be done using the "instance of" mechanism.

The main disadvantage of using subtyping is that it requires new syntax and semantics for a new item type, and new subtyping rules, and therefore more tests. There's potential complexity in deciding whether record types are always ordered, always unordered, or whether they can be either. Adding a property to maps, as currently proposed, is a lot simpler.

@Arithmeticus
Copy link
Contributor

I'm inclined to favor subtyping for ordered maps, based on experience with other PLs. Yes, it will require work to set up in the specs, but once that happens, we have the means to introduce other map subtypes. (In fact, I think once that mechanism is in place, the CG will be more favorable to allowing sorted maps than they currently are.)

Obviously (or maybe not), the map functions should remain in a single namespace. That is, functions that would work only on ordered maps would remain in the map namespace, but the signature would require an error to be raised if non-ordered maps were used as input instead.

@dnovatchev
Copy link
Contributor

dnovatchev commented Dec 11, 2024

There are many ways in which maps are used in practice, but I think there we can define two main categories of use:

  1. Records: few entries, with a focus on organizing data (the map constructor map { } is often used for those cases).
  2. Dictionaries: many entries, with a focus on lookup performance (generated with map:merge and map:build).

For 1., ordering is helpful and sometimes essential

Absolutely not.

A record is the closest thing we have to an interface, the order of attributes/properties is not significant - the same way the order of keys in a map is not significant.

We have had records for more than a year and this is the first and only time someone is saying the order of entries in a record is significant.

@michaelhkay
Copy link
Contributor

@Arithmeticus wrote:

I'm inclined to favor subtyping for ordered maps, based on experience with other PLs.

I think it's dangerous to apply that analogy too closely. The model used in Java and C# is to have a single interface for all maps, supported by different implementation classes. Our types in XDM are much more like interfaces: they define the set of operations available, not the implementation characteristics. This gives us many advantages that we don't want to lose, for example the ability to change the implementation characteristics of an existing value dynamically, behind the scenes, based on observed usage.

@michaelhkay
Copy link
Contributor

@dnovatchev wrote:

We have had records for more than a year and this is the first and only time someone is saying the order of entries in a record is significant.

There is ample user feedback that ordering is important for records - not to make it information-bearing, but for human comprehension. It's primarily important for the same kind of reason that indentation is important: a great strength of XML and JSON over formats such as ASN.1 and protocol buffers is that it is human readable. This is the reason that a lot of JSON-oriented software has moved in the direction of preserving order.

Incidentally, users also frequently ask this for XML attributes. It's not that they want to attach meaning to the order of attributes, it's that predictable order improves readability. If that's true for XML attributes where things are usually all on one line, it's vastly more true for JSON structures where rearranging the fields of a record in the serialized output might move one of the fields 10,000 lines away. Consistent ordering makes it easier to compare two structures, both for humans and for machines.

This whole exercise to define ordering for maps arose, if you recall, in response to feedback from users trying out early implementations.

@dnovatchev
Copy link
Contributor

dnovatchev commented Dec 12, 2024

There is ample user feedback that ordering is important for records - not to make it information-bearing, but for human comprehension.

It is a well-known antipattern to have classes (records) with too-many members (keys).

Users who have this problem are violating this principle and now we are trying to support them in this???

@michaelhkay
Copy link
Contributor

It is a well-known antipattern to have classes (records) with too-many members (keys).

You don't need many fields to encounter this problem. Consider a record representing a telephone bill with five fields: (name, address, phone number, itemised charges, call details). The last two fields might contain large amounts of data. If this is displayed in the order (address, call details, phone number, itemised charges, name), then it's going to be very difficult to find the name and phone number when looking at the data visually.

The user who designed this structure hasn't violated any design principles.

We've been experiencing this problem for years with some of our own data, for example SEF files, and the output of the Java parser that we use in one of our toolchains. It's certainly a problem our users have reported as well, and yes, it would be nice to help them solve it.

For another example, consider if we held the function catalog in JSON rather than XML, and every time we edited a function, the order of fields changed so that the examples might appear randomly before or after the function signature. Editing the catalog would become far more difficult.

@michaelhkay
Copy link
Contributor

@ChristianGruen wrote:

I would like to warm up my idea to retain the order in some cases, but to drop it (by default) when updates are performed. In particular, I think we should:

* Make the order mandatory for the map constructor: It is often used to create maps with just few entries. In addition, its syntax does not allow us to supply arguments that might control the map generation.
* Make the order mandatory for records: It is confusing if the contents of a record are returned in a way that differs from the order of a record declaration.
* Provide options to map:build and map:merge (and possibly to update operations like map:put, map:remove) to retain the order: map:build(1 to 5, options := { 'ordered': true() }. In all cases, we can expect map:put to retain the order if the value of an existing entry is changed.

This is reasonably well aligned with the current PR. There are a couple of differences of detail:

(a) should the result of a map constructor be ordered? I have suggested no, but there are arguments for and against. We could introduce a constructor ordered-map{ .... } so the user has both options. We could also make map{...} deliver an unordered map, and have the bare brace constructor {...} default to ordered. However, there's some merit in the argument that you only get an ordered map if you ask for it, and the default is always unordered.

(b) should map:put() and map:remove() retain order if the input is an ordered map? I think they probably should. But an alternative would be to have a map:put() that destroys order and a map:append() that retains it. From an implementation perspective, if the data is already held in a data structure that maintains an ordering, then keeping the ordering on put() and remove() operations is easier than dropping it.

@ChristianGruen
Copy link
Contributor Author

(a) should the result of a map constructor be ordered?

I think it should be. For users who care about order, it would be the obvious behavior. Others will only care if they get worse performance – which will only be the case for large maps (Category 2, as named above). While it’s possible to write down map constructors with millions of entries in XPath/XQuery code, it will certainly be an edge case (and possibly make parsing the query more expensive than building the map). Using fn:parse-json and fn:json-doc will probably be better options in practice (even for those, I would prefer to have orderedness as default).

We could introduce a constructor ordered-map{ .... } so the user has both options. We could also make map{...} deliver an unordered map, and have the bare brace constructor {...} default to ordered.

It might confront users with performance considerations that they are not aware of. In particular, it gets confusing for users who read existing code. I think the JSON syntax has benefited from being as simple as possible.

(b) should map:put() and map:remove() retain order if the input is an ordered map? I think they probably should. But an alternative would be to have a map:put() that destroys order and a map:append() that retains it.

This distinction would seem very sensible to me.

From an implementation perspective, if the data is already held in a data structure that maintains an ordering, then keeping the ordering on put() and remove() operations is easier than dropping it.

It may depend. For example, a compact and ordered map representation (in terms of memory consumption) could be used at creation time. Such a map could, for example, store all hash values – keys, buckets, etc. – in flat arrays, thus making the storage of explicit references to the next hash entry obsolete. If such a map is updated later on, it could be transformed to an unordered map (based, e.g., on an HAMT). Usually, this needs to be done only once in the lifetime of the map, and the complexity is similar to resize/rehash operations in conventional maps, i.e., an operation that is usually pretty efficient. In other cases, if such an ordered map is never updated, it could possibly consume less memory than an updatable and unordered map.

I had a look at the query sum(map:build(1 to 1000000)?*). It consumes ~70 MB with BaseX and ~86 MB with Saxon, although it contains only 1 million entries. This is certainly an artificial example, but I think that the memory consumption could generally be lowered in both implementations if we take into consideration that many maps will never be updated once they are created.

But more generally, it would give implementers more freedom on how to proceed. It should be perfectly fine to only have one map implementation, and retain the order (or some defined order) in a map if updates are performed.

@michaelhkay
Copy link
Contributor

michaelhkay commented Dec 12, 2024

I had a look at the query sum(map:build(1 to 1000000)?*). It consumes ~70 MB with BaseX and ~86 MB with Saxon, although it contains only 1 million entries. This is certainly an artificial example, but I think that the memory consumption could generally be lowered in both implementations if we take into consideration that many maps will never be updated once they are created.

Using a structure that doesn't require persistent update is one approach to improving things; the other is to avoid some of the overheads caused by type annotations and the equality matching semantics. We're currently using a map implementation that doesn't allow customized equality/hashcode computation except by wrapping each key value in a wrapper that has the required equals/hashCode methods, so there's a lot of waste.

We do a lot better for maps built with parse-json(), because we know all the keys will be xs:string instances and Java has the right equality semantics for that case; we also build a map in that case that isn't updateable without first copying to a different structure. But for map:build we currently construct a map that has overheads to support functionality that you're probably not going to use.

There's immense scope for improvements. For example, as I think you're suggesting, one can even initially build a map that doesn't have any "search by key" functionality, and add a hash index on first use. In many JSON transformations, it's likely that many maps will be copied directly from the input to the output without ever being queried or modified in any way.

@ChristianGruen
Copy link
Contributor Author

There's immense scope for improvements. For example, as I think you're suggesting, one can even initially build a map that doesn't have any "search by key" functionality, and add a hash index on first use. In many JSON transformations, it's likely that many maps will be copied directly from the input to the output without ever being queried or modified in any way.

I had an implementation in mind that stores all data in flat arrays. We use such sets/maps for all kinds of things in BaseX: One array contains all entries in the order in which they have initially been added; the other arrays organize hash table buckets and next-pointers for entries that cause collisions. Our corresponding set implementation for byte arrays consumes 25% less memory than Java’s HashSet. Such a map could also be enhanced to be immutable/persistent. Appends would be cheap, but value replacements and deletions may become expensive.

It’s an interesting idea to only generate the hash index if it’s required. It could work if we know in advance that all keys that are to be added are distinct (which, as you wrote, would be the case for valid JSON input).

@dnovatchev
Copy link
Contributor

dnovatchev commented Dec 12, 2024

It is a well-known antipattern to have classes (records) with too-many members (keys).

You don't need many fields to encounter this problem. Consider a record representing a telephone bill with five fields: (name, address, phone number, itemised charges, call details). The last two fields might contain large amounts of data. If this is displayed in the order (address, call details, phone number, itemised charges, name), then it's going to be very difficult to find the name and phone number when looking at the data visually.

This is a presentation problem. It has been solved successfully in different IDEs and tools, and it is not our job in this CG to provide any such IDE.

Here are a few examples:

Example1. A person's medical problems. This is an array of "problems", and each "problem" has/contains "medications" and "labs", and these contain other complex data.

A specific instance:

{
  "problems": [
    {
      "Diabetes": [
        {
          "medications": [
            {
              "medicationsClasses": [
                {
                  "className": [
                    {
                      "associatedDrug": [
                        {
                          "name": "asprin",
                          "dose": "",
                          "strength": "500 mg"
                        }
                      ],
                      "associatedDrug#2": [
                        {
                          "name": "somethingElse",
                          "dose": "",
                          "strength": "500 mg"
                        }
                      ]
                    }
                  ],
                  "className2": [
                    {
                      "associatedDrug": [
                        {
                          "name": "asprin",
                          "dose": "",
                          "strength": "500 mg"
                        }
                      ],
                      "associatedDrug#2": [
                        {
                          "name": "somethingElse",
                          "dose": "",
                          "strength": "500 mg"
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ],
          "labs": [
            {
              "missing_field": "missing_value"
            }
          ]
        }
      ],
      "Asthma": [
        {}
      ]
    }
  ]
}

If the reader wants to find easily the "labs" for the "Diabetes" problem, it would be really difficult and time-consuming.

And this is where a good IDE / tool comes handy:

image

With a single click the user compacts the huge "medications" object on line 6 and gets a perfect view of exactly what he wants to see - the "labs" object.

Here is another tool/IDE - the NotePad++ editor - gives us exactly the same nice user experience:

image

And yet another - the Visual Studio IDE - exactly the same nice and wanted user experience:

image

Example2. Our own documentation - it is overwhelming just to navigate the TOC and reach the wanted sections. But a well-designed Html gives us this possibility:

image

Here the reader wants to ignore everything but to learn about "Functions and Operators on Boolean Values". He just clicks on the … character of Chapter 7 - and voilà - he gets exactly the wanted information.

What is common in these examples:

  1. The data sources are not ordered to satisfy a specific anticipated user viewing preference - the wise authors realized that in the general case it is impossible to correctly anticipate what a reader's information needs would be at different moments - and there are so many different readers, with so many different preferences and information needs - certainly satisfying an individual one will not solve the problems of the rest...
  2. There exist a multitude of tools that solve this problem for the reader. We don't have to take the job of the IDE/tool developers - which they do much better than anyone else.

Conclusion:
There is a line which we shouldn't cross. We are not IDE developers. We are language and function-library designers.

Let's just do our job and leave to the most qualified people the IDE/tools/presentation.

@ChristianGruen ChristianGruen added the PR Pending A PR has been raised to resolve this issue label Dec 12, 2024
@ChristianGruen
Copy link
Contributor Author

I had a look at the query sum(map:build(1 to 1000000)?*). It consumes ~70 MB with BaseX and ~86 MB with Saxon, although it contains only 1 million entries. This is certainly an artificial example, but I think that the memory consumption could generally be lowered in both implementations if we take into consideration that many maps will never be updated once they are created.

…works fine: sum(map:build(1 to 10_000_000)?*) is now processed in 1.2 seconds (before: more than 10 seconds), with 20% less memory consumption. All keys are stored in insertion order. Performance will certainly suffer if updates other than appends are performed (in which case I currently transform the representation to an HAMT). It will excel especially when a map is created only once.

@michaelhkay
Copy link
Contributor

michaelhkay commented Dec 13, 2024

@dnovatchev wrote:

This is a presentation problem. It has been solved successfully in different IDEs and tools, and it is not our job in this CG to provide any such IDE.

I disagree. As I said before, one of the major benefits of XML and JSON is that they are human-readable; you don't need any software (like an IDE) sitting between you and the data; all you need is a text editor or a printer. We get lots of feedback from users indicating that they value that principle, including the feedback that led to this issue being raised in the first place. It would be a totally inappropriate response to those users to say "it's not a problem, just get yourself an IDE".

It's the job of this CG to ensure that users can continue to enjoy having human-readable data without needing an IDE to act as a viewer.

@dnovatchev
Copy link
Contributor

dnovatchev commented Dec 13, 2024

@dnovatchev wrote:

This is a presentation problem. It has been solved successfully in different IDEs and tools, and it is not our job in this CG to provide any such IDE.

I disagree. As I said before, one of the major benefits of XML and JSON is that they are human-readable; you don't need any software (like an IDE) sitting between you and the data; all you need is a text editor or a printer. We get lots of feedback from users indicating that they value that principle, including the feedback that led to this issue being raised in the first place. It would be a totally inappropriate response to those users to say "it's not a problem, just get yourself an IDE".

It's the job of this CG to ensure that users can continue to enjoy having human-readable data without needing an IDE to act as a viewer.

This, unfortunately, is just wishful thinking.

We don't need to spend much time looking at the XML documents, or JSON files that already exist and that many users need to access, and read at least some parts of these - to see that users do have problems with understanding and assimilating those files.

This is actually the main use-case for the IDEs and tools that solve this problem for the users. The IDEs and tools do not address a non-existent problem, on the contrary - their functionality was necessary, and asked for by significant number of real users, to alleviate/solve this existing problem.

I absolutely wish that we could solve the problem by providing "ordered maps", but unfortunately, this is an illusion. The problem is that:

  1. Nobody can prevent anyone from writing badly-structured documents - the same way nobody can prevent "writers" write completely unreadable "novels" and "stories".
  2. An "ordered map" as proposed, has only a top-level order specified. A deeply-nested "ordered map" has no guarantees whatsoever that its content contains only "ordered maps". It would be practically infeasible to require that all values of an "ordered map" that are maps, must also be "ordered maps" themselves. For example, validating such a requirement would require visiting all values of a map, and all values of the values of the map that are maps themselves, and so on, which in the general case is proportional to the product of the number of keys in all contained maps.
  3. Due to (2) above, a user does not benefit from reading an "ordered map", because it will still be "unordered" deeper down its hierarchy.

It would be a totally inappropriate response to those users to say "it's not a problem, just get yourself an IDE".

Why? As we all know, and as demonstrated in my previous comment, there are a multitude of such IDEs and editors, and probably each of us is already using one - be it a more specialized, like Oxygen, or BaseX UI, or a more general developer's tool, such as Visual Studio or Visual Studio Code, or just a regular, general text editor - such as Notepad++ - each of these used currently by millions of people.

Why should anyone have to wait for years, trusting us that we would provide the "ultimate solution", such as "ordered maps"... when they can simply continue using their favorite editor, or install one in a matter of seconds?

Let us do our job as language and library designers, where hopefully we are good at, and leave presentation, UI, etc. to more "artistic persons" who are best at this. We are not jacks of all trades, or else we are doing a disservice to our users.

@michaelhkay
Copy link
Contributor

michaelhkay commented Dec 13, 2024

An "ordered map" as proposed, has only a top-level order specified.

There's a misunderstanding there. The proposal is that if you say parse-json(..., {'retain-order':true()}}, the maps at every level resulting from the parse will retain the JSON input order. The same will apply, I'm assuming, to the output of elements-to-maps (which is what motivated this proposal).

Incidentally, I've spent much of today debugging a stylesheet that's processing the XML syntax tree output by a Java parser. The fact that the syntax tree doesn't respect input order of the Java code (e.g. else branches appear before then branches) has made the job vastly more difficult. There's no way any tooling is going to help with that - once order is lost, no general purpose tool can reinstate it.

@dnovatchev
Copy link
Contributor

An "ordered map" as proposed, has only a top-level order specified.

There's a misunderstanding there.

Are you saying that it will be required that within an "ordered map" any deeper-level maps will be required to be "ordered maps"?

If not, then an "ordered map" clearly doesn't solve the user's presentation/readability problem.

Another problem, impossible to solve, is that it cannot be known in advance what map-keys are most important to what users (and different users generally have different preferences for such importance), thus there isn't a single, best ordering for the keys and values of a map.

@michaelhkay
Copy link
Contributor

michaelhkay commented Dec 13, 2024

Are you saying that it will be required that within an "ordered map" any deeper-level maps will be required to be "ordered maps"?

No, there's no such constraint in the data model. But operations that create nested maps, such as parse-json() and xml-to-json(), will typically be defined to make the maps at every level either all ordered or all unordered. Manually constructed maps can do anything they like, for example $map1 => map:put('x', $map2) can add an ordered map to an unordered map, or vice versa.

there isn't a single, best ordering for the keys and values of a map.

Indeed so. That's an argument for putting it under user control.

Interesting question, if you have a map $M with fields (name, address, phone, hatsize), how do you create an ordered map that has the entries in that order? I would suggest map:of-pairs($M?pairs::("name", "address", "phone", "hatsize"), {'retain-order':true()}) but there may be better solutions.

@ChristianGruen
Copy link
Contributor Author

I gathered feedback within our group and from users, and the response was unequivocal (partially enthusiastic, I would say): Ordered maps would be very welcome.

The tendency was clear that a simple solution would be preferred, in a nutshell: orderedness by default, without any new options and tweaks. Insertion order is preserved since ES6 (2015) and Python 3.7 (2018); it may be time to follow them.

My clear favorite now would be to:

  1. preserve the order in all new maps without introducing further options, and
  2. make order optional if updates are performed. Reasons:
  • Functions like map:put give no information about where a new entry will be added.
  • We would have to introduce new functions like map:append and map:insert-before (in analogy with array:insert-before), but they feel a bit incongruous for maps (would the index need to be positional, a key, anything else?)
  • For creating new ordered maps from existing maps with a specific order, map:filter or map:merge seem better candidates to me, and they already exists.

@michaelhkay
Copy link
Contributor

orderedness by default

Certainly I think there are many cases where any extra costs would not be noticed. The exception is very large maps (where the keys are typically "data" rather than "metadata") and in these cases space saving can be critical. In fact I'd like to find other ways of saving space, such as allowing the keys() function to return any value that is atomic-equal to the original key, rather than requiring the original key to be maintained.

@ChristianGruen
Copy link
Contributor Author

The exception is very large maps (where the keys are typically "data" rather than "metadata") and in these cases space saving can be critical.

After having played around a bit with alternative map variants, I am quite optimistic that good solutions can be implemented. It seems (by interpreting the previous replies in this thread) that the existing map implementations in our processors may be suboptimal anyway, and nevertheless they seem to have done their job fairly well in the last years.

But I would definitely like to hear the opinion of other implementers, too (such as @benibela, maybe @line-o?).

@ChristianGruen
Copy link
Contributor Author

Ironically, Python’s compact (mutable) dictionary…

https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-compactdict

…bears resemblance to the map representation that we have drafted 20 years ago (and I am sure others have done it decades before us). The basic idea is simple: Avoid objects, use dense arrays, get insertion order as a side effect.

@dnovatchev
Copy link
Contributor

dnovatchev commented Dec 13, 2024

For example, I currently have an enormous amount of lexicographical data, where the key is a word form and the value is a normalized lemma (dictionary entry). This would be ideal in a sorted map, which the CG is not yet favoring, but it could also be beneficial in an ordered map. (In a sense, once we get ordered maps we also have access to sorted maps, at least for those maps that do not need to have members removed or added. You can do the sort first, then build the ordered map based on that sort.)

@Arithmeticus Joel, This is unjustified waste of space.
In the very distant past I led 2 projects that developed the first spelling checker for Bulgarian, and a similar spelling checker for Russian. At the time even secondary storage was a luxury and the goal of the projects was to be able to represent on a single floppy disk with capacity of 720KB (not all computers at that time had two floppy disk devices) all wordforms for a dictionary of 30 000 - 40 000 most frequent words.

We used linguistic, morphological studies that identified all ways of generation of word-forms from a base-word. For Bulgarian we used the work of Branimir Krastev, that identified 103 such types of wordforms generation. For Russian - the work of Evgeniy Zalizniak - about 700 types of wordforms generation.

This allowed us to keep in the dictionary file only a part of the base word and the rule-number of the rule according which its wordforms must be generated.

No doubt the same approach would be possible for other languages.

You will still need a (sorted) dictionary to keep just a part of the wordform togehter with the rule-number, but its total size will be many times smaller (bigger savings for languages with high inflexion, such as Bulgarian, where verbs may have more than 10 wordforms and there are more than 30 different tenses, and smaller for less-inflectionary languages such as English - but even in this case the savings will result in decreasing the memory requirements several times).

I placed "sorted" above in parenthesis, because searching in a sorted sequence is considerably slower that finding a key in a hashset - O(logN) vs. O(1). Sorted can come handy if the data is on a device, but if it is completely in memory, then having it in a hashtable would perform faster.

@Arithmeticus
Copy link
Contributor

@dnovatchev
You're under the false impression that my project/data set is like yours. It's not, and it's already efficient, because it's tied to a specific textual corpus. I could have made it more efficient by moving toward a trie model, but that's more work than is needed. What is demanded is a sorted order of keys, because of other downstream requirements that repurpose the key-value sequence, which I won't get into here.

That was just an example. Think about ordered maps whose keys are dateTimes, or decimals, or ..... I envision many cases where ordered keys in a map would be beneficial, which corroborates the enthusiasm @ChristianGruen sees in his customers when he tells them about the proposed feature.

@dnovatchev
Copy link
Contributor

I envision many cases where ordered keys in a map would be beneficial, which corroborates the enthusiasm @ChristianGruen sees in his customers when he tells them about the proposed feature.

Well, I am not against having an ordered-map type - we just need to have this as something separate from the well-established map type that we have used and known for many years.

@ChristianGruen
Copy link
Contributor Author

we just need to have this as something separate from the well-established map type that we have used and known for many years.

Why? If a user does not care about order, nothing would need to change for this user.

@dnovatchev
Copy link
Contributor

we just need to have this as something separate from the well-established map type that we have used and known for many years.

Why? If a user does not care about order, nothing would need to change for this user.

Nothing???

Even reading the updated documentation for the well-known and often-used functions would be a scary experience causing confusion and uncertainty, and raising questions.

If I have to read through 20 or more function specifications and try to understand what has changed, is this "nothing" ???

Even after I will be relatively sure that I understand the changes, I still will have questions about the performance implications, ..., etc.

I believe that adding more properties and functionality to well-established type is an example of bad design. We have the SOLID principles, and in this acronym "I" stands for "Interface Segregation"

image

@ChristianGruen
Copy link
Contributor Author

Nothing???

I said “ nothing would need to change”, and I did not refer to the PR; see my comments, specifically #1651 (comment).

@dnovatchev
Copy link
Contributor

dnovatchev commented Dec 13, 2024

Nothing???

I said “ nothing would need to change”, and I did not refer to the PR; see my comments, specifically #1651 (comment).

@ChristianGruen This comment clearly shows some of the new complexity that needs to be added, thank you.

Compare this to having an "ordered-map" separate from map, in which case really nothing is being added to the meaning and documentation of map.

@ChristianGruen
Copy link
Contributor Author

This comment clearly shows some of the new complexity that needs to be added, thank you.

If a map would use insertion order by default, nothing would change for users who do not care about order. Others would benefit from it. I cannot see which complexity you refer to; indeed I think it would be the approach that avoids any additional complexity.

@dnovatchev
Copy link
Contributor

This comment clearly shows some of the new complexity that needs to be added, thank you.

If a map would use insertion order by default, nothing would change for users who do not care about order. Others would benefit from it. I cannot see which complexity you refer to; indeed I think it would be the approach that avoids any additional complexity.

The need for users to read and understand the updated (actually enlarged) documentation?

Their possible confusion and frequent questions on a daily basis that someone must provide answers to?

The sudden decrease of performance of some of their maybe critical applications - just because a map suddenly became "ordered-map" by default!!!!!.

The need to scroll through functions that are specifically designed to operate upon/with "ordered-map"s - which will not arise at all in the case when an "ordered-map" were something different from maps and thus "ordered-map"s functions would have a separate chapter and would not be inserted in the same chapter as "Map Functions".

@ChristianGruen
Copy link
Contributor Author

The need for users to read and understand the updated (actually enlarged) documentation?
Their possible confusion and frequent questions on a daily basis that someone must provide answers to?

I cannot follow you. Our practice has shown that the unorderedness has frequently caused confusion in the past. After all, not every user is a mathematician who has an abstract understanding of maps; many will just type in data and expect that they get the same representation back. Some are not even aware of the fact that the “braced data” they are working with is a map.

Maybe you need to specify which user base you have in mind. It seems to differ from the people who have commented on this discussion or on Slack so far.

The sudden decrease of performance of some of their maybe critical applications - just because a map suddenly became "ordered-map" by default!!!!!.

This could be a challenge, but one that implementers need to tackle (and you talk to two of them in this thread). If you follow the comments in this discussion, you may have observed that orderedness will not necessarily result in a decrease of performance, as the existing implementations can still be improved. The question is how much time is spent by an implementer to tackle these issues. As Michael indicated, it is often actual customer workloads that trigger optimizations.

@ChristianGruen
Copy link
Contributor Author

PS @Arithmeticus If the input of your map is lexicographically sorted, you could simply write:

$lemma
=> map:filter(fn { . >= 'albatross' and . <= 'bonono' })

@michaelhkay
Copy link
Contributor

$lemma => map:filter(fn { . >= 'albatross' and . <= 'bonono' })

Alternatively

$lemma?[ ?key >= 'albatross' and ?key <= 'bonono' ]

Though it's going to be a lot easier to optimise that kind of query if we know the map is sorted.

@line-o
Copy link
Contributor

line-o commented Dec 14, 2024

I am trying to catch up with the discussion in this thread. Let me reiterate what I have said in the CG meetings:

I am in favour of switching to insertion-order for maps by default.

The fact that both ES and Python in recent history did switch to insertion-order by default is a firm indicator that this provides a good trade-off in user-expectation vs. performance and memory overhead.
I do follow the discussion around ES language changes closely and did not seen any cries for them to switch back for any reason.

I am currently pondering the necessity to destroy the order on map:put and maybe map:merge. Since you will always end up with a new map regardless of the operation, one could see this as insertion-order being reset at that point.
What I can think of is:

  • updating existing keys: Is this a new insertion or changed in place?
  • adding new keys: These would always be added to the end.

I do fear that new methods for manipulating maps like map:insert-before will ultimately limit implementation options. But this is just a gut feeling at the moment.
I will need some more time tinkering with the ideas presented here.

@ChristianGruen
Copy link
Contributor Author

I am currently pondering the necessity to destroy the order on map:put and maybe map:merge.

Some more general thoughts on map updates (thinking out loud): If we extend the notion of orderedness to all update operations, it is basically the decision to go one big step further and get rid of unorderedness in general. Perhaps that would be the most consistent way forward. On the other hand, it may trigger various new user requirements in the short term, such as positional insertions and deletes.

Our core languages (without XQUF) so far provide no functionality to update XML data (related: #1225). As they were initially designed for processing XML, I wonder whether we should pursue this in more depth at this point in time. Before pushing map updates further, I think we should focus on XML first, and define update operations than can possibly be applied to maps in a second step. Otherwise, people might end up using fn:elements-to-maps to convert data to a format that can be properly updated to finally convert it back to XML – which is probably something we would not want to envision.

@michaelhkay
Copy link
Contributor

So I think we are starting to converge towards the following:

  • All maps have an ordering of their entries (exposed by keys(), pairs(), serialization, etc), but some functions and operators leave the order predictable and others leave it unpredictable.
  • There is no ordering property as such. It's not meaningful to ask whether an empty map or a singleton map is ordered or unordered.
  • Functions and operators that create a map "in bulk" create maps with predictable ordering (based on the order of the input) by default. This includes map constructors, record constructors, map:build(), map:merge(), map:of-pairs(), map:filter(), parse-json(), json-doc(), xml-to-json(), elements-to-maps().
  • Where appropriate, some of these functions offer an option {'retain-order':false()} which permits the implementation to create a map that does not retain the original order, allowing potential for improved performance (eg. a memory saving). Since this option makes the resulting order unpredictable, the implementation can ignore this option if it chooses.
  • A map created using map:put or map:remove has unpredictable order. If you want predictable order, use a different function such as map:merge or map:filter to achieve the same effect.

@dnovatchev
Copy link
Contributor

dnovatchev commented Dec 14, 2024

So I think we are starting to converge towards the following:

  • All maps have an ordering of their entries (exposed by keys(), pairs(), serialization, etc), but some functions and operators leave the order predictable and others leave it unpredictable.
  • There is no ordering property as such. It's not meaningful to ask whether an empty map or a singleton map is ordered or unordered.
  • Functions and operators that create a map "in bulk" create maps with predictable ordering (based on the order of the input) by default. This includes map constructors, record constructors, map:build(), map:merge(), map:of-pairs(), map:filter(), parse-json(), json-doc(), xml-to-json(), elements-to-maps().
  • Where appropriate, some of these functions offer an option {'retain-order':false()} which permits the implementation to create a map that does not retain the original order, allowing potential for improved performance (eg. a memory saving). Since this option makes the resulting order unpredictable, the implementation can ignore this option if it chooses.
  • A map created using map:put or map:remove has unpredictable order. If you want predictable order, use a different function such as map:merge or map:filter to achieve the same effect.

To all who "wondered" why I said all this will make things more complicated? ==> Read the above and you will know why.

To a person who is not interested in ordering at all, this is all noise - that is better not to exist in what one is reading, unless he is specifically interested in this topic.

Besides all this, I find the above design quite unsatisfactory. With this design, a given map is firmly bound/connected to a given order, and this binding remains forever.

It would be preferrable if a map hasn't ordering in itself, and can at any moment be associated with a different ordering, as needed/required by concrete/specific and changing needs.

Add to this the fact that an "ordered" map is only guaranteed to be ordered on its first level of nesting, and without guarantees that any maps that are values at any level are ordered... I don't see why on Earth I would ever would wish to use such a (ugly) beast.

Please, do not alter the current, pristine map type - leave it as it is.

Use a ordering type (an array or a map itself) to associate with a map - in any case when this is needed.

You can associate the same map with different orderings at different moments - without the need to create a new map.

@michaelhkay
Copy link
Contributor

To a person who is not interested in ordering at all, this is all noise

One thing I have learnt from this discussion is that I have written quite a lot of Javascript without ever knowing that it maintained the order of maps (objects) by default; I would probably have noticed if the debugger and stringify() displayed the properties of an object in the "wrong" order, but I never noticed that they did what I would consider natural.

Yes, it may be noise, but the people who aren't interested in it are unlikely to notice it.

@dnovatchev
Copy link
Contributor

dnovatchev commented Dec 14, 2024

@michaelhkay I updated my last comment above with additions that may be also important to read and think about.

In particular, I said:

"Besides all this, I find the above design quite unsatisfactory. With this design, a given map is firmly bound/connected to a given order, and this binding remains forever.

It would be preferrable if a map hasn't ordering in itself, and can at any moment be associated with a different ordering, as needed/required by concrete/specific and changing needs.

Add to this the fact that an "ordered" map is only guaranteed to be ordered on its first level of nesting, and without guarantees that any maps that are values at any level are ordered... I don't see why on Earth I would ever would wish to use such a (ugly) beast.

Please, do not alter the current, pristine map type - leave it as it is.

Use a ordering type (an array or a map itself) to associate with a map - in any case when this is needed.

You can associate the same map with different orderings at different moments - without the need to create a new map."

@MarkNicholls
Copy link

So I think we are starting to converge towards the following:

  • All maps have an ordering of their entries (exposed by keys(), pairs(), serialization, etc), but some functions and operators leave the order predictable and others leave it unpredictable.
  • There is no ordering property as such. It's not meaningful to ask whether an empty map or a singleton map is ordered or unordered.
  • Functions and operators that create a map "in bulk" create maps with predictable ordering (based on the order of the input) by default. This includes map constructors, record constructors, map:build(), map:merge(), map:of-pairs(), map:filter(), parse-json(), json-doc(), xml-to-json(), elements-to-maps().
  • Where appropriate, some of these functions offer an option {'retain-order':false()} which permits the implementation to create a map that does not retain the original order, allowing potential for improved performance (eg. a memory saving). Since this option makes the resulting order unpredictable, the implementation can ignore this option if it chooses.
  • A map created using map:put or map:remove has unpredictable order. If you want predictable order, use a different function such as map:merge or map:filter to achieve the same effect.

I think this is a good starting point, my only reservation is having functions that don't preserve the order, that would feel unexpected to me as a user.

I think I would expect map:remove to preserve the order of non removed entries, and I'd probably expect map:put to update the current entry in its order, or put a new entry "at the end".

Is this a problem?

@ChristianGruen
Copy link
Contributor Author

I think I would expect map:remove to preserve the order of non removed entries, and I'd probably expect map:put to update the current entry in its order, or put a new entry "at the end".

There has been some discussion on map updates in this issue (it may take a while to find all relevant comments though, it is a long thread). One formal detail that I can add: If a map entry is updated, it may look like a replacement, but it is actually a re-insertion (this is the case if the keys are regarded as equal, but if their types differ).

@MarkNicholls
Copy link

there's not much discussion of it.

looking at the JSON library I use to create ordered JSON, its quite ugly, lots of "addAfterSelf" sort of stuff, but I would prefer all functions to result in some sort of deterministic order after them, even if that isn't especially flexible, I think having an ordered map that can easily lose its order is unintuitive and will be error prone.

If merge creates an ordered map and there is a way to split maps, then the programmer can pretty much insert an entry whereever they want, by splitting the map at the new position.

There is something in scala/haskell called SplitAt for lists, so you'd split your map, and then merge it with the new entry. I think that gives you the control you need without a raft of new functions in a way thats conceptually consistent with whats already there.

@ChristianGruen
Copy link
Contributor Author

There is something in scala/haskell called SplitAt for lists, so you'd split your map, and then merge it with the new entry. I think that gives you the control you need without a raft of new functions in a way thats conceptually consistent with whats already there.

True, this could be an option. Currently, this can already be achieved with a mixture of map:merge and map:filter function calls, but having a custom function may be more intuitive. As (I believe has been) indicated, the requirement of map updates gets complex pretty soon once we take order serious. I agree it is an important task, and it is definitely worth looking at, but it could make sense to pursue this in a separate issue, which is why I have just created #1656 as a starting point.

@ChristianGruen
Copy link
Contributor Author

The latest request for ordered JSON data on our mailing list (thanks @martin-honnen for the swift reply):
https://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg15987.html

@graydon2014
Copy link

At the present time, I think XPath maps are much more capable than JSON and have a reflexive "making maps more like JSON won't improve them in general" response.

map:merge(
    for $x in db:get('data')//@id
      let $value as xs:string := $x/string()
      group by $value
      (: where $x[2] :)
      return map:entry($value,$x)
    )

takes about five times as long as it does with the where clause in place. Having neither version get less performant would be important to me. (In my experience, "Where in a content set are the nodes we've discovered we care about?" is one of the most useful things about maps.)

If there's a discussion of how map ordering interacts with "you've created multiple entries with the same key" above, I've missed it. It doesn't seem like the order in which entries with the same key arrived would be irrelevant in an ordered map.

For individual maps, I think there are cases for "I always care about order" and cases for "I never care about order"; there may be a case for "caring about order is itself a mistake". I don't think there are cases for "In this specific map, I care about order sometimes." This generally makes me think that conceptually, the specialization of ordered maps from maps would be preferable to making all maps ordered.

@michaelhkay
Copy link
Contributor

If there's a discussion of how map ordering interacts with "you've created multiple entries with the same key" above, I've missed it. It doesn't seem like the order in which entries with the same key arrived would be irrelevant in an ordered map.

Indeed, we haven't spent much time talking about how control of ordering interacts with control of duplicates in functions like map:merge() and map:build(), and the current PR probably needs a few extra words in this area.

@dnovatchev
Copy link
Contributor

@dnovatchev You're under the false impression that my project/data set is like yours. It's not, and it's already efficient, because it's tied to a specific textual corpus. I could have made it more efficient by moving toward a trie model, but that's more work than is needed. What is demanded is a sorted order of keys, because of other downstream requirements that repurpose the key-value sequence, which I won't get into here.

That was just an example. Think about ordered maps whose keys are dateTimes, or decimals, or ..... I envision many cases where ordered keys in a map would be beneficial, which corroborates the enthusiasm @ChristianGruen sees in his customers when he tells them about the proposed feature.

@Arithmeticus ,

I wonder how you deal with homonyms? If you want all instances of a homonym to point to different base-words, then you will need a map that allows duplicate keys. Or, you may have a single key with value - the sequence of all corresponding base-words. This may not be too-efficient, and still it is not clear to which base-word to bind the key.

Very interesting subject-matter, indeed.

@line-o
Copy link
Contributor

line-o commented Dec 17, 2024

If we extend the notion of orderedness to all update operations, it is basically the decision to go one big step further and get rid of unorderedness in general. Perhaps that would be the most consistent way forward.

I would like to see that change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A change that introduces a new feature PR Pending A PR has been raised to resolve this issue XDM An issue related to the XPath Data Model XPath An issue related to XPath XQFO An issue related to Functions and Operators
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants