-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ordered Maps: maps that retain insertion order #1651
Comments
There are many ways in which maps are used in practice, but I think there we can define two main categories of use:
For 1., ordering is helpful and sometimes essential. For 2., it does not matter. I would like to warm up my idea to retain the order in some cases, but to drop it (by default) when updates are performed. In particular, I think we should:
If we do so, it will be up to processors to provide support for an implementation of an ordered map that supports updates, or to use a simple map that transfers the entries to an unordered map once updates are performed. Finally, it is important to remember that in many cases maps are never updated after the creation – which means there is no need to store such entries in an immutable implementation. |
During discussion, @dnovatchev raised the question of defining ordered maps as a subtype of maps rather than as a property associated with the map instance. I'd like to consider the advantages and disadvantages of both approaches. The most obvious benefit of using a subtype is that it enables functions to declare that they require an ordered map as the supplied argument, and to fail with a type error if supplied with an unordered map. Is that a feature we think we need? At the moment we aren't proposing any system functions or operators that require a map to be ordered. Using a subtype would also obviate the need for an interrogative function to test whether a map is ordered, since this could be done using the "instance of" mechanism. The main disadvantage of using subtyping is that it requires new syntax and semantics for a new item type, and new subtyping rules, and therefore more tests. There's potential complexity in deciding whether record types are always ordered, always unordered, or whether they can be either. Adding a property to maps, as currently proposed, is a lot simpler. |
I'm inclined to favor subtyping for ordered maps, based on experience with other PLs. Yes, it will require work to set up in the specs, but once that happens, we have the means to introduce other map subtypes. (In fact, I think once that mechanism is in place, the CG will be more favorable to allowing sorted maps than they currently are.) Obviously (or maybe not), the map functions should remain in a single namespace. That is, functions that would work only on ordered maps would remain in the |
Absolutely not. A record is the closest thing we have to an interface, the order of attributes/properties is not significant - the same way the order of keys in a map is not significant. We have had records for more than a year and this is the first and only time someone is saying the order of entries in a record is significant. |
@Arithmeticus wrote:
I think it's dangerous to apply that analogy too closely. The model used in Java and C# is to have a single interface for all maps, supported by different implementation classes. Our types in XDM are much more like interfaces: they define the set of operations available, not the implementation characteristics. This gives us many advantages that we don't want to lose, for example the ability to change the implementation characteristics of an existing value dynamically, behind the scenes, based on observed usage. |
@dnovatchev wrote:
There is ample user feedback that ordering is important for records - not to make it information-bearing, but for human comprehension. It's primarily important for the same kind of reason that indentation is important: a great strength of XML and JSON over formats such as ASN.1 and protocol buffers is that it is human readable. This is the reason that a lot of JSON-oriented software has moved in the direction of preserving order. Incidentally, users also frequently ask this for XML attributes. It's not that they want to attach meaning to the order of attributes, it's that predictable order improves readability. If that's true for XML attributes where things are usually all on one line, it's vastly more true for JSON structures where rearranging the fields of a record in the serialized output might move one of the fields 10,000 lines away. Consistent ordering makes it easier to compare two structures, both for humans and for machines. This whole exercise to define ordering for maps arose, if you recall, in response to feedback from users trying out early implementations. |
It is a well-known antipattern to have classes (records) with too-many members (keys). Users who have this problem are violating this principle and now we are trying to support them in this??? |
You don't need many fields to encounter this problem. Consider a record representing a telephone bill with five fields: (name, address, phone number, itemised charges, call details). The last two fields might contain large amounts of data. If this is displayed in the order (address, call details, phone number, itemised charges, name), then it's going to be very difficult to find the name and phone number when looking at the data visually. The user who designed this structure hasn't violated any design principles. We've been experiencing this problem for years with some of our own data, for example SEF files, and the output of the Java parser that we use in one of our toolchains. It's certainly a problem our users have reported as well, and yes, it would be nice to help them solve it. For another example, consider if we held the function catalog in JSON rather than XML, and every time we edited a function, the order of fields changed so that the examples might appear randomly before or after the function signature. Editing the catalog would become far more difficult. |
@ChristianGruen wrote:
This is reasonably well aligned with the current PR. There are a couple of differences of detail: (a) should the result of a map constructor be ordered? I have suggested no, but there are arguments for and against. We could introduce a constructor (b) should |
I think it should be. For users who care about order, it would be the obvious behavior. Others will only care if they get worse performance – which will only be the case for large maps (Category 2, as named above). While it’s possible to write down map constructors with millions of entries in XPath/XQuery code, it will certainly be an edge case (and possibly make parsing the query more expensive than building the map). Using
It might confront users with performance considerations that they are not aware of. In particular, it gets confusing for users who read existing code. I think the JSON syntax has benefited from being as simple as possible.
This distinction would seem very sensible to me.
It may depend. For example, a compact and ordered map representation (in terms of memory consumption) could be used at creation time. Such a map could, for example, store all hash values – keys, buckets, etc. – in flat arrays, thus making the storage of explicit references to the next hash entry obsolete. If such a map is updated later on, it could be transformed to an unordered map (based, e.g., on an HAMT). Usually, this needs to be done only once in the lifetime of the map, and the complexity is similar to resize/rehash operations in conventional maps, i.e., an operation that is usually pretty efficient. In other cases, if such an ordered map is never updated, it could possibly consume less memory than an updatable and unordered map. I had a look at the query But more generally, it would give implementers more freedom on how to proceed. It should be perfectly fine to only have one map implementation, and retain the order (or some defined order) in a map if updates are performed. |
Using a structure that doesn't require persistent update is one approach to improving things; the other is to avoid some of the overheads caused by type annotations and the equality matching semantics. We're currently using a map implementation that doesn't allow customized equality/hashcode computation except by wrapping each key value in a wrapper that has the required equals/hashCode methods, so there's a lot of waste. We do a lot better for maps built with parse-json(), because we know all the keys will be xs:string instances and Java has the right equality semantics for that case; we also build a map in that case that isn't updateable without first copying to a different structure. But for map:build we currently construct a map that has overheads to support functionality that you're probably not going to use. There's immense scope for improvements. For example, as I think you're suggesting, one can even initially build a map that doesn't have any "search by key" functionality, and add a hash index on first use. In many JSON transformations, it's likely that many maps will be copied directly from the input to the output without ever being queried or modified in any way. |
I had an implementation in mind that stores all data in flat arrays. We use such sets/maps for all kinds of things in BaseX: One array contains all entries in the order in which they have initially been added; the other arrays organize hash table buckets and next-pointers for entries that cause collisions. Our corresponding set implementation for byte arrays consumes 25% less memory than Java’s HashSet. Such a map could also be enhanced to be immutable/persistent. Appends would be cheap, but value replacements and deletions may become expensive. It’s an interesting idea to only generate the hash index if it’s required. It could work if we know in advance that all keys that are to be added are distinct (which, as you wrote, would be the case for valid JSON input). |
This is a presentation problem. It has been solved successfully in different IDEs and tools, and it is not our job in this CG to provide any such IDE. Here are a few examples: Example1. A person's medical problems. This is an array of "problems", and each "problem" has/contains "medications" and "labs", and these contain other complex data. A specific instance:
If the reader wants to find easily the "labs" for the "Diabetes" problem, it would be really difficult and time-consuming. And this is where a good IDE / tool comes handy: With a single click the user compacts the huge "medications" object on line 6 and gets a perfect view of exactly what he wants to see - the "labs" object. Here is another tool/IDE - the NotePad++ editor - gives us exactly the same nice user experience: And yet another - the Visual Studio IDE - exactly the same nice and wanted user experience: Example2. Our own documentation - it is overwhelming just to navigate the TOC and reach the wanted sections. But a well-designed Html gives us this possibility: Here the reader wants to ignore everything but to learn about "Functions and Operators on Boolean Values". He just clicks on the … character of Chapter 7 - and voilà - he gets exactly the wanted information. What is common in these examples:
Conclusion: Let's just do our job and leave to the most qualified people the IDE/tools/presentation. |
…works fine: |
@dnovatchev wrote:
I disagree. As I said before, one of the major benefits of XML and JSON is that they are human-readable; you don't need any software (like an IDE) sitting between you and the data; all you need is a text editor or a printer. We get lots of feedback from users indicating that they value that principle, including the feedback that led to this issue being raised in the first place. It would be a totally inappropriate response to those users to say "it's not a problem, just get yourself an IDE". It's the job of this CG to ensure that users can continue to enjoy having human-readable data without needing an IDE to act as a viewer. |
This, unfortunately, is just wishful thinking. We don't need to spend much time looking at the XML documents, or JSON files that already exist and that many users need to access, and read at least some parts of these - to see that users do have problems with understanding and assimilating those files. This is actually the main use-case for the IDEs and tools that solve this problem for the users. The IDEs and tools do not address a non-existent problem, on the contrary - their functionality was necessary, and asked for by significant number of real users, to alleviate/solve this existing problem. I absolutely wish that we could solve the problem by providing "ordered maps", but unfortunately, this is an illusion. The problem is that:
Why? As we all know, and as demonstrated in my previous comment, there are a multitude of such IDEs and editors, and probably each of us is already using one - be it a more specialized, like Oxygen, or BaseX UI, or a more general developer's tool, such as Visual Studio or Visual Studio Code, or just a regular, general text editor - such as Notepad++ - each of these used currently by millions of people. Why should anyone have to wait for years, trusting us that we would provide the "ultimate solution", such as "ordered maps"... when they can simply continue using their favorite editor, or install one in a matter of seconds? Let us do our job as language and library designers, where hopefully we are good at, and leave presentation, UI, etc. to more "artistic persons" who are best at this. We are not jacks of all trades, or else we are doing a disservice to our users. |
There's a misunderstanding there. The proposal is that if you say Incidentally, I've spent much of today debugging a stylesheet that's processing the XML syntax tree output by a Java parser. The fact that the syntax tree doesn't respect input order of the Java code (e.g. |
Are you saying that it will be required that within an "ordered map" any deeper-level maps will be required to be "ordered maps"? If not, then an "ordered map" clearly doesn't solve the user's presentation/readability problem. Another problem, impossible to solve, is that it cannot be known in advance what map-keys are most important to what users (and different users generally have different preferences for such importance), thus there isn't a single, best ordering for the keys and values of a map. |
No, there's no such constraint in the data model. But operations that create nested maps, such as
Indeed so. That's an argument for putting it under user control. Interesting question, if you have a map |
I gathered feedback within our group and from users, and the response was unequivocal (partially enthusiastic, I would say): Ordered maps would be very welcome. The tendency was clear that a simple solution would be preferred, in a nutshell: orderedness by default, without any new options and tweaks. Insertion order is preserved since ES6 (2015) and Python 3.7 (2018); it may be time to follow them. My clear favorite now would be to:
|
Certainly I think there are many cases where any extra costs would not be noticed. The exception is very large maps (where the keys are typically "data" rather than "metadata") and in these cases space saving can be critical. In fact I'd like to find other ways of saving space, such as allowing the keys() function to return any value that is atomic-equal to the original key, rather than requiring the original key to be maintained. |
After having played around a bit with alternative map variants, I am quite optimistic that good solutions can be implemented. It seems (by interpreting the previous replies in this thread) that the existing map implementations in our processors may be suboptimal anyway, and nevertheless they seem to have done their job fairly well in the last years. But I would definitely like to hear the opinion of other implementers, too (such as @benibela, maybe @line-o?). |
Ironically, Python’s compact (mutable) dictionary… https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-compactdict …bears resemblance to the map representation that we have drafted 20 years ago (and I am sure others have done it decades before us). The basic idea is simple: Avoid objects, use dense arrays, get insertion order as a side effect. |
@Arithmeticus Joel, This is unjustified waste of space. We used linguistic, morphological studies that identified all ways of generation of word-forms from a base-word. For Bulgarian we used the work of Branimir Krastev, that identified 103 such types of wordforms generation. For Russian - the work of Evgeniy Zalizniak - about 700 types of wordforms generation. This allowed us to keep in the dictionary file only a part of the base word and the rule-number of the rule according which its wordforms must be generated. No doubt the same approach would be possible for other languages. You will still need a (sorted) dictionary to keep just a part of the wordform togehter with the rule-number, but its total size will be many times smaller (bigger savings for languages with high inflexion, such as Bulgarian, where verbs may have more than 10 wordforms and there are more than 30 different tenses, and smaller for less-inflectionary languages such as English - but even in this case the savings will result in decreasing the memory requirements several times). I placed "sorted" above in parenthesis, because searching in a sorted sequence is considerably slower that finding a key in a hashset - O(logN) vs. O(1). Sorted can come handy if the data is on a device, but if it is completely in memory, then having it in a hashtable would perform faster. |
@dnovatchev That was just an example. Think about ordered maps whose keys are dateTimes, or decimals, or ..... I envision many cases where ordered keys in a map would be beneficial, which corroborates the enthusiasm @ChristianGruen sees in his customers when he tells them about the proposed feature. |
Well, I am not against having an ordered-map type - we just need to have this as something separate from the well-established map type that we have used and known for many years. |
Why? If a user does not care about order, nothing would need to change for this user. |
Nothing??? Even reading the updated documentation for the well-known and often-used functions would be a scary experience causing confusion and uncertainty, and raising questions. If I have to read through 20 or more function specifications and try to understand what has changed, is this "nothing" ??? Even after I will be relatively sure that I understand the changes, I still will have questions about the performance implications, ..., etc. I believe that adding more properties and functionality to well-established type is an example of bad design. We have the SOLID principles, and in this acronym "I" stands for "Interface Segregation" |
I said “ nothing would need to change”, and I did not refer to the PR; see my comments, specifically #1651 (comment). |
@ChristianGruen This comment clearly shows some of the new complexity that needs to be added, thank you. Compare this to having an "ordered-map" separate from map, in which case really nothing is being added to the meaning and documentation of map. |
If a map would use insertion order by default, nothing would change for users who do not care about order. Others would benefit from it. I cannot see which complexity you refer to; indeed I think it would be the approach that avoids any additional complexity. |
The need for users to read and understand the updated (actually enlarged) documentation? Their possible confusion and frequent questions on a daily basis that someone must provide answers to? The sudden decrease of performance of some of their maybe critical applications - just because a map suddenly became "ordered-map" by default!!!!!. The need to scroll through functions that are specifically designed to operate upon/with "ordered-map"s - which will not arise at all in the case when an "ordered-map" were something different from maps and thus "ordered-map"s functions would have a separate chapter and would not be inserted in the same chapter as "Map Functions". |
I cannot follow you. Our practice has shown that the unorderedness has frequently caused confusion in the past. After all, not every user is a mathematician who has an abstract understanding of maps; many will just type in data and expect that they get the same representation back. Some are not even aware of the fact that the “braced data” they are working with is a map. Maybe you need to specify which user base you have in mind. It seems to differ from the people who have commented on this discussion or on Slack so far.
This could be a challenge, but one that implementers need to tackle (and you talk to two of them in this thread). If you follow the comments in this discussion, you may have observed that orderedness will not necessarily result in a decrease of performance, as the existing implementations can still be improved. The question is how much time is spent by an implementer to tackle these issues. As Michael indicated, it is often actual customer workloads that trigger optimizations. |
PS @Arithmeticus If the input of your map is lexicographically sorted, you could simply write: $lemma
=> map:filter(fn { . >= 'albatross' and . <= 'bonono' }) |
Alternatively
Though it's going to be a lot easier to optimise that kind of query if we know the map is sorted. |
I am trying to catch up with the discussion in this thread. Let me reiterate what I have said in the CG meetings: I am in favour of switching to insertion-order for maps by default. The fact that both ES and Python in recent history did switch to insertion-order by default is a firm indicator that this provides a good trade-off in user-expectation vs. performance and memory overhead. I am currently pondering the necessity to destroy the order on
I do fear that new methods for manipulating maps like |
Some more general thoughts on map updates (thinking out loud): If we extend the notion of orderedness to all update operations, it is basically the decision to go one big step further and get rid of unorderedness in general. Perhaps that would be the most consistent way forward. On the other hand, it may trigger various new user requirements in the short term, such as positional insertions and deletes. Our core languages (without XQUF) so far provide no functionality to update XML data (related: #1225). As they were initially designed for processing XML, I wonder whether we should pursue this in more depth at this point in time. Before pushing map updates further, I think we should focus on XML first, and define update operations than can possibly be applied to maps in a second step. Otherwise, people might end up using |
So I think we are starting to converge towards the following:
|
To all who "wondered" why I said all this will make things more complicated? ==> Read the above and you will know why. To a person who is not interested in ordering at all, this is all noise - that is better not to exist in what one is reading, unless he is specifically interested in this topic. Besides all this, I find the above design quite unsatisfactory. With this design, a given map is firmly bound/connected to a given order, and this binding remains forever. It would be preferrable if a map hasn't ordering in itself, and can at any moment be associated with a different ordering, as needed/required by concrete/specific and changing needs. Add to this the fact that an "ordered" map is only guaranteed to be ordered on its first level of nesting, and without guarantees that any maps that are values at any level are ordered... I don't see why on Earth I would ever would wish to use such a (ugly) beast. Please, do not alter the current, pristine map type - leave it as it is. Use a ordering type (an array or a map itself) to associate with a map - in any case when this is needed. You can associate the same map with different orderings at different moments - without the need to create a new map. |
One thing I have learnt from this discussion is that I have written quite a lot of Javascript without ever knowing that it maintained the order of maps (objects) by default; I would probably have noticed if the debugger and stringify() displayed the properties of an object in the "wrong" order, but I never noticed that they did what I would consider natural. Yes, it may be noise, but the people who aren't interested in it are unlikely to notice it. |
@michaelhkay I updated my last comment above with additions that may be also important to read and think about. In particular, I said: "Besides all this, I find the above design quite unsatisfactory. With this design, a given map is firmly bound/connected to a given order, and this binding remains forever. It would be preferrable if a map hasn't ordering in itself, and can at any moment be associated with a different ordering, as needed/required by concrete/specific and changing needs. Add to this the fact that an "ordered" map is only guaranteed to be ordered on its first level of nesting, and without guarantees that any maps that are values at any level are ordered... I don't see why on Earth I would ever would wish to use such a (ugly) beast. Please, do not alter the current, pristine map type - leave it as it is. Use a ordering type (an array or a map itself) to associate with a map - in any case when this is needed. You can associate the same map with different orderings at different moments - without the need to create a new map." |
I think this is a good starting point, my only reservation is having functions that don't preserve the order, that would feel unexpected to me as a user. I think I would expect map:remove to preserve the order of non removed entries, and I'd probably expect map:put to update the current entry in its order, or put a new entry "at the end". Is this a problem? |
There has been some discussion on map updates in this issue (it may take a while to find all relevant comments though, it is a long thread). One formal detail that I can add: If a map entry is updated, it may look like a replacement, but it is actually a re-insertion (this is the case if the keys are regarded as equal, but if their types differ). |
there's not much discussion of it. looking at the JSON library I use to create ordered JSON, its quite ugly, lots of "addAfterSelf" sort of stuff, but I would prefer all functions to result in some sort of deterministic order after them, even if that isn't especially flexible, I think having an ordered map that can easily lose its order is unintuitive and will be error prone. If merge creates an ordered map and there is a way to split maps, then the programmer can pretty much insert an entry whereever they want, by splitting the map at the new position. There is something in scala/haskell called SplitAt for lists, so you'd split your map, and then merge it with the new entry. I think that gives you the control you need without a raft of new functions in a way thats conceptually consistent with whats already there. |
True, this could be an option. Currently, this can already be achieved with a mixture of |
The latest request for ordered JSON data on our mailing list (thanks @martin-honnen for the swift reply): |
At the present time, I think XPath maps are much more capable than JSON and have a reflexive "making maps more like JSON won't improve them in general" response.
takes about five times as long as it does with the If there's a discussion of how map ordering interacts with "you've created multiple entries with the same key" above, I've missed it. It doesn't seem like the order in which entries with the same key arrived would be irrelevant in an ordered map. For individual maps, I think there are cases for "I always care about order" and cases for "I never care about order"; there may be a case for "caring about order is itself a mistake". I don't think there are cases for "In this specific map, I care about order sometimes." This generally makes me think that conceptually, the specialization of ordered maps from maps would be preferable to making all maps ordered. |
Indeed, we haven't spent much time talking about how control of ordering interacts with control of duplicates in functions like |
I wonder how you deal with homonyms? If you want all instances of a homonym to point to different base-words, then you will need a map that allows duplicate keys. Or, you may have a single key with value - the sequence of all corresponding base-words. This may not be too-efficient, and still it is not clear to which base-word to bind the key. Very interesting subject-matter, indeed. |
I would like to see that change. |
Currently, XDM maps are “unordered”: An implementation is allowed to organize entries in a way that optimizes lookup, not order. The entries do not have a predictable order unless they are explicitly sorted.
There are cases in which it is helpful if the “insertion order” is preserved – i.e., the order in which new map entries are added to a map. While the insertion order is not relevant if a map is exclusively used for lookups, it may be beneficial if the input includes deliberately sorted key/value pairs, such as (often) in JSON data, configurations or key/value sequences.
I created this issue because there was some confusion in #564, and on Slack, about this map flavor and “sorted maps”, which are discussed in issue #564: Sorted maps hold all map entries sorted by the key, using a comparator or (in its basic variant)
fn:data#1
.PR #1609 attempts to solve both requirements at once.
The text was updated successfully, but these errors were encountered: