Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enum constraint with arrays in table schema #549

Closed
jungshadow opened this issue Dec 8, 2017 · 18 comments
Closed

Enum constraint with arrays in table schema #549

jungshadow opened this issue Dec 8, 2017 · 18 comments

Comments

@jungshadow
Copy link

I'm using an array field with an enum constraint that consists of an array of strings:

...
{
    "name": "ApplicationRequestStatusType",
    "type": "array",
    "title": "Application Request Status",
    "description": "Specifies the current status of the application. …",
    "constraints": {
        "required": true,
        "enum": [
            "duplicate",
            "invalid",
            "missing-ssn",
            "missing-state-id-number",
            "pending",
            "valid"
        ]
    }
}
...

I’m trying to save a datapackage–using Package–but running into validation errors. The error that’s being thrown is Field "ApplicationRequestStatusType" can't cast value "duplicate" for type "array" with format “default".

From my Gitter conversation with @roll:

@jungshadow It's interesting the specs says that enum constraint is applicable to an array field type. But as an implementator I'm confused here. Now it should work but not the way you expect. Because it uses a general approach (as for other types) every enum item should be an array. And for users I think this behaviour doesn't really make sense if it's not clarified in the specs. @rufuspollock should the specs specify a special approach for treating an enum constraint for arrays/objects? Or probably it should be just a different constraint like constraints.itemEnum?
It's also related to the typed arrays discussion

I believe I'm implementing this correctly and it looks like both types of enum constraints (e.g. array of strings and array of arrays) are supported based on the table-schema.json file:

              "constraints": {
                "title": "Constraints",
                "description": "The following constraints apply for `array` fields.",
                "type": "object",
                "properties": {
                  ...
                  },
                  "unique": {
                    ...
                  },
                  "enum": {
                    "oneOf": [
                      {
                        "type": "array",
                        "minItems": 1,
                        "uniqueItems": true,
                        "items": {
                          "type": "string"
                        }
                      },
                      {
                        "type": "array",
                        "minItems": 1,
                        "uniqueItems": true,
                        "items": {
                          "type": "array"
                        }
                      }
                    ]
                  },
                  ...
                }
              },

Is my assessment that my previous snippet is technically correct, accurate? My use case is to constrain an array (prefer a set in this case) to a list of potential values (the strings in the example above). Thanks!

@roll
Copy link
Member

roll commented Dec 10, 2017

@jungshadow
So the specs say:

enum - The value of the field must exactly match a value in the enum array`

Based on this words for the array field type the enum constraint should look like:

{
    "name": "ApplicationRequestStatusType",
    "type": "array",
    "title": "Application Request Status",
    "description": "Specifies the current status of the application. …",
    "constraints": {
        "required": true,
        "enum": [
            ["duplicate"],
            ["invalid"],
            ["missing-ssn"],
            ["missing-state-id-number"],
            ["pending"],
            ["valid"]
        ]
    }
}

And valid data values will be:

["duplicate"]
["invalid"]
["missing-ssn"]
["missing-state-id-number"]
["pending"]
["valid"]

Could you share your data package to better understand the use case?

@jungshadow
Copy link
Author

jungshadow commented Dec 10, 2017

Thanks for the update, @roll!

So the specs say:

enum - The value of the field must exactly match a value in the enum array`

I admit though the structure is a bit confusing. First, does the file I mentioned above, table-schema.json in tableschema/profiles/ contradict the spec (assuming I'm reading it correctly)? Second, I can see having the strings wrapped in arrays if you'd like to have specific groups of items as potential values. For example (NB: the following assumes that an application request can be a duplicate, can have a missing SSN and be pending, can be generally invalid, can be missing and SSN and a state identifier, can be pending, and can be valid):

{
    "name": "ApplicationRequestStatusType",
    "type": "array",
    "title": "Application Request Status",
    "description": "Specifies the current status of the application. …",
    "constraints": {
        "required": true,
        "enum": [
            ["duplicate"],
            ["missing-ssn", "pending"],
            ["invalid"],
            ["missing-ssn", "missing-state-id-number"],
            ["pending"],
            ["valid"]
        ]
    }
}

(NB: worth noting that the above scenario isn't indicative of real state policy, just a hypothetical example)

If any combination of the enum items is acceptable, wrapping every string value in an array feels cumbersome, but maybe I'm missing a key component of the thought process.

Could you share your data package to better understand the use case?

Certainly. The datapackage is here and the documentation for the various fields is here.

@roll
Copy link
Member

roll commented Dec 10, 2017

Based on this description the type of the field could be a string. And if it's a string the initial enum will work.

The value of ApplicationRequestStatusType must be one of the following:

duplicate
invalid
mismatch-voter-signature
missing-ssn
missing-state-id-number
missing-voter-signature
pending
valid
other

@jungshadow
Copy link
Author

@roll I apologize if I'm not being entirely clear (and it looks like I need to update the description in the documentation, so thanks for pointing that out). I want the ability to capture multiple values from the enum, so the array field type is the most appropriate for my use case, but I don't want to limit to only certain combinations of values. My use case above was only to illustrate how I thought wrapping the enum strings in arrays would work, not that I need it to work as such.

@roll
Copy link
Member

roll commented Dec 11, 2017

@jungshadow
No worries. I was also thinking that probably multiple values are needed. Problem that for now the specs doesn't cover this use case. enum is a constraint for field value. And for your use case we need a constraint for field value subitem. So it's probably could be something like constraints.itemEnum.

@jungshadow
Copy link
Author

Problem that for now the specs doesn't cover this use case.

Bummer. As a stop-gap, if I wrap all the enum strings as arrays in your first example above, will that accomplish what I'm looking to do? As an example, will the following row validate (see 5 value in)?

'da347b903c8dcb62...','47673a7d20346b72...','','2016-03-14', '2016-03-15','[missing-ssn,missing-voter-signature]','online','untracked','','','2016-10-06','untracked','other','2016-09-16','','mail','','2016-11-08','2016 General Election','55-31000','fips','City Of Green Bay','Wisconsin','United States','military','False'

If so, great! Personally, I'm still advocating for the separation of duties I outlined above (i.e. enum array of strings for any combinations of values and an enum array of arrays for particular combinations of values), but I'm happy with a working solution for now. Thanks for all the help, @roll!

@roll
Copy link
Member

roll commented Dec 12, 2017

@jungshadow
No. It will work only for single values in the ApplicationRequestStatusType field. E.g. for the provided row the enum constraint should also contain [missing-ssn,missing-voter-signature].

@jungshadow
Copy link
Author

@roll That's a bummer. This project is somewhat of a fact-finding mission. Currently, we're not sure how many or which of the enum values would invalidate a ballot application or a ballot. Is there some way that y'all would consider my use case for inclusion into the spec?

@roll
Copy link
Member

roll commented Dec 15, 2017

@jungshadow
As a quick fix on top of my head there is only an idea to store this field a string and use a pattern constraint like:

name: ApplicationRequestStatusType
type: string
constraints:
  pattern: '^((missing-ssn|missing-voter-signature|...)\|?)+$' # not tested

With data like:

ApplicationRequestStatusType
missing-ssn
missing-ssn|missing-voter-signature

Related to the possible specs changes I'm cc here @pwalsh @rufuspollock

@jungshadow
Copy link
Author

I appreciate the quick fix, @roll! This isn't a knock on the fix and maybe I'm being overly pedantic, but this functionally seems more like an array than a string. Interested in @pwalsh's and @rufuspollock's thoughts on this, too.

@akariv
Copy link
Member

akariv commented Dec 25, 2017

This is a perfect application of the 'itemType' property I suggested here: #409

So - you'd have an array, and you'll be able to state the inner item type (in this case, a string with an enum constraint).

@jungshadow
Copy link
Author

@akariv Sounds like that would work perfectly. I noticed @rufuspollock mentioned it may go into 1.1. Is there a rough timeline for it?

Thinking aloud, I'm still wondering about the usefulness of a set type (or constraint) that's distinct from array. The former would allow only distinct objects while the latter would allow repeated objects. My use case would benefit from the former.

@jungshadow
Copy link
Author

Hi @roll, @rufuspollock, and @pwalsh! Wanted to quickly bump this thread. Any ideas on this or a possible timeline for #409, which may similarly work for my purposes? Thanks!

@rufuspollock
Copy link
Contributor

@jungshadow right now we don't have an ETA on new features for 1.1 but my guess is H2 this year (the real constraint is resourcing work on this). What really helps us is:

  • A PR for a specific pattern for a given approach (which can then become part of spec later)
  • A PR (or forked repo) implementing said feature in key libraries so people can use it

@jungshadow
Copy link
Author

@rufuspollock I completely understand and I'd be happy to do either of the above, but it would be helpful to have some idea on which implementation to focus. Would it be more helpful to keep it narrowly focused on my use case or should I concentrate on the idea in #409?

@jungshadow
Copy link
Author

Considering the discussion in frictionlessdata/tableschema-js#152, I wanted to quickly bump this, again. Is #409 the recommended approach or should we carve out a case for this particular issue? /cc @roll @rufuspollock

@roshcagra
Copy link

Any update on adding this?

@rufuspollock
Copy link
Contributor

DUPLICATE / MERGING. Closing in favour of #409 since doing that appropriately would resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

5 participants