From 523940be99a5745e4bad79511fa9ae9cbcd6439f Mon Sep 17 00:00:00 2001 From: roll Date: Tue, 2 Apr 2024 16:56:21 +0100 Subject: [PATCH 01/20] Added "Data Representation" section --- content/docs/specifications/glossary.md | 86 +++++++++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/content/docs/specifications/glossary.md b/content/docs/specifications/glossary.md index 1512008a..ce42fb11 100644 --- a/content/docs/specifications/glossary.md +++ b/content/docs/specifications/glossary.md @@ -46,3 +46,89 @@ Example of a relative path that this will work both as a relative path on disk a :::caution[Security] `/` (absolute path) and `../` (relative parent path) are forbidden to avoid security vulnerabilities when implementing data package software. These limitations on resource `path` ensure that resource paths only point to files within the data package directory and its subdirectories. This prevents data package software being exploited by a malicious user to gain unintended access to sensitive information. For example, suppose a data package hosting service stores packages on disk and allows access via an API. A malicious user uploads a data package with a resource path like `/etc/passwd`. The user then requests the data for that resource and the server naively opens `/etc/passwd` and returns that data to the caller. ::: + +### Data Representation + +In order to talk about data representation and processing of tabular data from data sources, it is useful to introduce the concepts of the `physical`, `native`, and `logical` representation of data. + +#### Physical Representation + +The `physical` representation of data refers to the representation of data in any form that is used to store data, for example, in a CSV or JSON serialized file on a disk. Usually, the data stored is some binary format but strictly speaking not limited to it in the context of the Data Package standard. + +For example, here is a hexadecimal representation of a CSV file encoded using "UTF-8" encoding and stored on a disk: + +```text title=table.csv +69 64 7C 6E 61 6D 65 0A 31 7C 61 70 70 6C 65 0A 32 7C 6F 72 61 6E 67 65 +``` + +For a reference the same file in textual form: + +```text +id|name +1|apple +2|orange +``` + +#### Native Representation + +The `native` representation of data refers to the representation of data in a form that is produced by a format-specific driver in some computational environment. The Data Package Standard itself does not define any data formats and relies on existent data formats and corresponding drivers on the implementations level. + +Having the Data Resource definition as below: + +```json +{ + "path": "table.csv", + "format": "csv", + "dialect": { + "delimiter": "|" + } +} +``` + +The data from the exemplar CSV in `native` representation will be: + +```javascript +{id: "1", name: "apple"} +{id: "2", name: "orange"} +``` + +Note that handled by a CSV reader that took into account the dialect information, the data has been transformed from a binary form to a data structure. In real implementation it could be a data stream, a data frame, or other forms. + +#### Logical Representation + +The `logical` representation of data refers to the "ideal" representation of the data in terms of the Data Package standard types, data structures, and relations, all as defined by the specifications. We could say that the specifications is about the logical representation of data, as well as about ways in which to handle serialization and deserialization between `physical` representation of data and the `logical` representation of data. + +Having the Data Resource definition as below: + +```json +{ + "path": "table.csv", + "format": "csv", + "dialect": { + "delimiter": "|" + }, + "schema": { + "fields": [ + { "name": "id", "type": "integer" }, + { "name": "name", "type": "string" } + ] + } +} +``` + +The data from the exemplar CSV in `logical` representation will be: + +```javascript +{id: 1, name: "apple"} +{id: 2, name: "orange"} +``` + +Note that handled by a post-processor that took into account the schema information, the data has been transformed from a partially typed data structure to the fully typed data structure that is compliant to the provided Table Schema. + +:::tip[Data Formats] +The example below uses the CSV format that has only one native data type i.e. `string`. Other popular data formats like JSON or Parquet have more native data types that in many cases make data in `native` and `logical` form closer to each other, or, sometimes, even identical. +::: + +:::note[Implementation Note] +Due to diversity of data formats and computational environments, there is no clear boundary between Table Dialect and Table Schema metadata and their roles in `physical-to-native` and `native-to-logical` transformation. It is recommended to maximize the usage of an available data format driver to get `native` data as closer as possible to `logical` data and do post-processing for all unsupported features. +::: From aa2a10b265c75aa1e222f657cfb9fb4093897031 Mon Sep 17 00:00:00 2001 From: roll Date: Tue, 2 Apr 2024 17:04:33 +0100 Subject: [PATCH 02/20] Moved tabular data to definitions --- content/docs/specifications/glossary.md | 30 ++++++++++++++ content/docs/specifications/table-schema.md | 43 ++------------------- 2 files changed, 34 insertions(+), 39 deletions(-) diff --git a/content/docs/specifications/glossary.md b/content/docs/specifications/glossary.md index ce42fb11..db6c9d4b 100644 --- a/content/docs/specifications/glossary.md +++ b/content/docs/specifications/glossary.md @@ -47,6 +47,36 @@ Example of a relative path that this will work both as a relative path on disk a `/` (absolute path) and `../` (relative parent path) are forbidden to avoid security vulnerabilities when implementing data package software. These limitations on resource `path` ensure that resource paths only point to files within the data package directory and its subdirectories. This prevents data package software being exploited by a malicious user to gain unintended access to sensitive information. For example, suppose a data package hosting service stores packages on disk and allows access via an API. A malicious user uploads a data package with a resource path like `/etc/passwd`. The user then requests the data for that resource and the server naively opens `/etc/passwd` and returns that data to the caller. ::: +### Tabular Data + +Tabular data consists of a set of rows. Each row has a set of fields (columns). We usually expect that each row has the same set of fields and thus we can talk about _the_ fields for the table as a whole. + +In case of tables in spreadsheets or CSV files we often interpret the first row as a header row, giving the names of the fields. By contrast, in other situations, e.g. tables in SQL databases, the field names are explicitly designated. + +To illustrate, here's a classic spreadsheet table: + +```text +field field + | | + | | + V V + + A | B | C | D <--- Row (Header) + ------------------------------------ + valA | valB | valC | valD <--- Row + ... +``` + +In JSON, a table would be: + +```json +[ + { "A": value, "B": value, ... }, + { "A": value, "B": value, ... }, + ... +] +``` + ### Data Representation In order to talk about data representation and processing of tabular data from data sources, it is useful to introduce the concepts of the `physical`, `native`, and `logical` representation of data. diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 8bcbc5b2..3f325f96 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -27,47 +27,12 @@ Table Schema is a simple language- and implementation-agnostic way to declare a ## Concepts -### Tabular Data +This specification havily relies on the following concepts: -Tabular data consists of a set of rows. Each row has a set of fields (columns). We usually expect that each row has the same set of fields and thus we can talk about _the_ fields for the table as a whole. +- [Tabular Data](../glossary/#tabular-data) +- [Data Representation](../glossary/#data-representation) -In case of tables in spreadsheets or CSV files we often interpret the first row as a header row, giving the names of the fields. By contrast, in other situations, e.g. tables in SQL databases, the field names are explicitly designated. - -To illustrate, here's a classic spreadsheet table: - -```text -field field - | | - | | - V V - - A | B | C | D <--- Row (Header) - ------------------------------------ - valA | valB | valC | valD <--- Row - ... -``` - -In JSON, a table would be: - -```json -[ - { "A": value, "B": value, ... }, - { "A": value, "B": value, ... }, - ... -] -``` - -### Data Representation - -In order to talk about the representation and processing of tabular data from text-based sources, it is useful to introduce the concepts of the _physical_ and the _logical_ representation of data. - -The _physical representation_ of data refers to the representation of data as text on disk, for example, in a CSV or JSON file. This representation can have some _type_ information (JSON, where the primitive types that JSON supports can be used) or not (CSV, where all data is represented in string form). - -The _logical representation_ of data refers to the "ideal" representation of the data in terms of primitive types, data structures, and relations, all as defined by the specification. We could say that the specification is about the logical representation of data, as well as about ways in which to handle conversion of a physical representation to a logical one. - -In this document, we'll explicitly refer to either the _physical_ or _logical_ representation in places where it prevents ambiguity for those engaging with the specification, especially implementors. - -For example, `constraints` `SHOULD` be tested on the logical representation of data, whereas a property like `missingValues` applies to the physical representation of the data. +In this document, we will explicitly refer to either the `native` or `logical` representation of data in places where it prevents ambiguity for those engaging with the specification, especially implementors. ## Descriptor From 478d8ada7d3063442b41ba2b68ec575c31adac22 Mon Sep 17 00:00:00 2001 From: roll Date: Tue, 2 Apr 2024 17:17:09 +0100 Subject: [PATCH 03/20] Replace lexical mentions --- content/docs/specifications/table-schema.md | 24 ++++++++++----------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 3f325f96..2c96c742 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -386,7 +386,7 @@ Supported formats: The field contains numbers of any kind including decimals. -The lexical formatting follows that of decimal in [XMLSchema](https://www.w3.org/TR/xmlschema-2/#decimal): a non-empty finite-length sequence of decimal digits separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: '-1.23', '12678967.543233', '+100000.00', '210'. +If [native representation](../glossary/#native-representation) is a string, formatting follows that of decimal in [XMLSchema](https://www.w3.org/TR/xmlschema-2/#decimal): a non-empty finite-length sequence of decimal digits separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: '-1.23', '12678967.543233', '+100000.00', '210'. The following special string values are permitted (case need not be respected): @@ -398,7 +398,7 @@ A number `MAY` also have a trailing: - exponent: this `MUST` consist of an E followed by an optional + or - sign followed by one or more decimal digits (0-9) -This lexical formatting `MAY` be modified using these additional properties: +If [native representation](../glossary/#native-representation) is a string, formatting `MAY` be modified using these additional properties: - **decimalChar**: A string whose value is used to represent a decimal point within the number. The default value is ".". - **groupChar**: A string whose value is used to group digits within the number. This property does not have a default value. A common value is "," e.g. "100,000". @@ -410,7 +410,7 @@ The field contains integers - that is whole numbers. Integer values are indicated in the standard way for any valid integer. -This lexical formatting `MAY` be modified using these additional properties: +If [native representation](../glossary/#native-representation) is a string, formatting `MAY` be modified using these additional properties: - **groupChar**: A string whose value is used to group digits within the integer. This property does not have a default value. A common value is "," e.g. "100,000". - **bareNumber**: a boolean field with a default of `true`. If `true` the physical contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text. @@ -436,14 +436,14 @@ The field contains a valid JSON array. ### `list` -The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. In the lexical representation, the field `MUST` contain a string with values separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. In comparison to the `array` type, the `list` type is directly modelled on the concept of SQL typed collections. +The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. If [native representation](../glossary/#native-representation) is a string,, the field `MUST` contain a string with values separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. In comparison to the `array` type, the `list` type is directly modelled on the concept of SQL typed collections. `format`: no options (other than the default). The list field can be customised with these additional properties: -- **delimiter**: specifies the character sequence which separates lexically represented list items. If not present, the default is `,` (comma). -- **itemType**: specifies the list item type in terms of existent Table Schema types. If present, it `MUST` be one of `string`, `integer`, `boolean`, `number`, `datetme`, `date`, and `time`. If not present, the default is `string`. A data consumer `MUST` process list items as it were individual values of the corresponding data type. Note, that on lexical level only default formats are supported, for example, for a list with `itemType` set to `date`, items have to be in default form for dates i.e. `yyyy-mm-dd`. +- **delimiter**: specifies the character sequence which separates, if [native representation](../glossary/#native-representation) is a string. If not present, the default is `,` (comma). +- **itemType**: specifies the list item type in terms of existent Table Schema types. If present, it `MUST` be one of `string`, `integer`, `boolean`, `number`, `datetme`, `date`, and `time`. If not present, the default is `string`. A data consumer `MUST` process list items as it were individual values of the corresponding data type. Note, that if [native representation](../glossary/#native-representation) is a string, only default formats are supported, for example, for a list with `itemType` set to `date`, items have to be in default form for dates i.e. `yyyy-mm-dd`. ### `datetime` @@ -451,7 +451,7 @@ The field contains a date with a time. Supported formats: -- **default**: The lexical representation `MUST` be in a form defined by [XML Schema](https://www.w3.org/TR/xmlschema-2/#dateTime) containing required date and time parts, followed by optional milliseconds and timezone parts, for example, `2024-01-26T15:00:00` or `2024-01-26T15:00:00.300-05:00`. +- **default**: If [native representation](../glossary/#native-representation) is a string, `MUST` be in a form defined by [XML Schema](https://www.w3.org/TR/xmlschema-2/#dateTime) containing required date and time parts, followed by optional milliseconds and timezone parts, for example, `2024-01-26T15:00:00` or `2024-01-26T15:00:00.300-05:00`. - **\**: values in this field can be parsed according to ``. `` `MUST` follow the syntax of [standard Python / C strptime](https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior). Values in the this field `SHOULD` be parsable by Python / C standard `strptime` using ``. Example for `"format": ""%d/%m/%Y %H:%M:%S"` which would correspond to a date with time like: `12/11/2018 09:15:32`. - **any**: Any parsable representation of the value. The implementing library can attempt to parse the datetime via a range of strategies. An example is `dateutil.parser.parse` from the `python-dateutils` library. It is `NOT RECOMMENDED` to use `any` format as it might cause interoperability issues. @@ -461,7 +461,7 @@ The field contains a date without a time. Supported formats: -- **default**: The lexical representation `MUST` be `yyyy-mm-dd` e.g. `2024-01-26` +- **default**: If [native representation](../glossary/#native-representation) is a string, `MUST` be `yyyy-mm-dd` e.g. `2024-01-26` - **\**: The same as for `datetime` - **any**: The same as for `datetime` @@ -471,17 +471,17 @@ The field contains a time without a date. Supported formats: -- **default**: The lexical representation `MUST` be `hh:mm:ss` e.g. `15:00:00` +- **default**: If [native representation](../glossary/#native-representation) is a string, `MUST` be `hh:mm:ss` e.g. `15:00:00` - **\**: The same as for `datetime` - **any**: The same as for `datetime` ### `year` -A calendar year as per [XMLSchema `gYear`](https://www.w3.org/TR/xmlschema-2/#gYear). Usual lexical representation is `YYYY`. There are no format options. +A calendar year as per [XMLSchema `gYear`](https://www.w3.org/TR/xmlschema-2/#gYear). Usual [native representation](../glossary/#native-representation) representation as a string is `YYYY`. There are no format options. ### `yearmonth` -A specific month in a specific year as per [XMLSchema `gYearMonth`](https://www.w3.org/TR/xmlschema-2/#gYearMonth). Usual lexical representation is: `YYYY-MM`. There are no format options. +A specific month in a specific year as per [XMLSchema `gYearMonth`](https://www.w3.org/TR/xmlschema-2/#gYearMonth). Usual [native representation](../glossary/#native-representation) as a string is `YYYY-MM`. There are no format options. ### `duration` @@ -489,7 +489,7 @@ A duration of time. We follow the definition of [XML Schema duration datatype](http://www.w3.org/TR/xmlschema-2/#duration) directly and that definition is implicitly inlined here. -To summarize: the lexical representation for duration is the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) extended format PnYnMnDTnHnMnS, where nY represents the number of years, nM the number of months, nD the number of days, 'T' is the date/time separator, nH the number of hours, nM the number of minutes and nS the number of seconds. The number of seconds can include decimal digits to arbitrary precision. Date and time elements including their designator `MAY` be omitted if their value is zero, and lower order elements `MAY` also be omitted for reduced precision. +If [native representation](../glossary/#native-representation) is a string, the duration is the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) extended format `PnYnMnDTnHnMnS`, where `nY` represents the number of years, `nM` the number of months, `nD` the number of days, `T` is the date/time separator, `nH` the number of hours, `nM` the number of minutes and `nS` the number of seconds. The number of seconds can include decimal digits to arbitrary precision. Date and time elements including their designator `MAY` be omitted if their value is zero, and lower order elements `MAY` also be omitted for reduced precision. ### `geopoint` From a98c8eb119d199fb2296006859790fea3dbf8fa0 Mon Sep 17 00:00:00 2001 From: roll Date: Tue, 2 Apr 2024 17:23:30 +0100 Subject: [PATCH 04/20] Replace physical mentions --- content/docs/specifications/table-schema.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 2c96c742..8d154c09 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -402,7 +402,7 @@ If [native representation](../glossary/#native-representation) is a string, form - **decimalChar**: A string whose value is used to represent a decimal point within the number. The default value is ".". - **groupChar**: A string whose value is used to group digits within the number. This property does not have a default value. A common value is "," e.g. "100,000". -- **bareNumber**: a boolean field with a default of `true`. If `true` the physical contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text. +- **bareNumber**: a boolean field with a default of `true`. If `true` the contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text. ### `integer` @@ -413,13 +413,13 @@ Integer values are indicated in the standard way for any valid integer. If [native representation](../glossary/#native-representation) is a string, formatting `MAY` be modified using these additional properties: - **groupChar**: A string whose value is used to group digits within the integer. This property does not have a default value. A common value is "," e.g. "100,000". -- **bareNumber**: a boolean field with a default of `true`. If `true` the physical contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text. +- **bareNumber**: a boolean field with a default of `true`. If `true` the contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text. ### `boolean` The field contains boolean (true/false) data. -In the physical representations of data where boolean values are represented with strings, the values set in `trueValues` and `falseValues` are to be cast to their logical representation as booleans. `trueValues` and `falseValues` are arrays which can be customised to user need. The default values for these are in the additional properties section below. +If [native representation](../glossary/#native-representation) is a string, the values set in `trueValues` and `falseValues` are to be cast to their logical representation as booleans. `trueValues` and `falseValues` are arrays which can be customised to user need. The default values for these are in the additional properties section below. The boolean field can be customised with these additional properties: @@ -436,7 +436,7 @@ The field contains a valid JSON array. ### `list` -The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. If [native representation](../glossary/#native-representation) is a string,, the field `MUST` contain a string with values separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. In comparison to the `array` type, the `list` type is directly modelled on the concept of SQL typed collections. +The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. If [native representation](../glossary/#native-representation) is a string, the field `MUST` contain a string with values separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. In comparison to the `array` type, the `list` type is directly modelled on the concept of SQL typed collections. `format`: no options (other than the default). @@ -560,7 +560,7 @@ Note, that for the CSV data source the `id` field is interpreted as a string bec The `constraints` property on Table Schema Fields can be used by consumers to list constraints for validating field values. For example, validating the data in a [Tabular Data Resource](https://specs.frictionlessdata.io/tabular-data-package/) against its Table Schema; or as a means to validate data being collected or updated via a data entry interface. -All constraints `MUST` be tested against the logical representation of data, and the physical representation of constraint values `MAY` be primitive types as possible in JSON, or represented as strings that are castable with the `type` and `format` rules of the field. +All constraints `MUST` be tested against the logical representation of data, and the native representation of constraint values `MAY` be primitive types as possible in JSON, or represented as strings that are castable with the `type` and `format` rules of the field. A constraints descriptor `MUST` be a JSON `object` and `MAY` contain one or more of the following properties: @@ -569,7 +569,7 @@ A constraints descriptor `MUST` be a JSON `object` and `MAY` contain one or more - **Type**: boolean - **Fields**: all -Indicates whether this field cannot be `null`. If required is `false` (the default), then `null` is allowed. See the section on `missingValues` for how, in the physical representation of the data, strings can represent `null` values. +Indicates whether this field cannot be `null`. If required is `false` (the default), then `null` is allowed. See the section on `missingValues` for how, in the native representation of the data, strings can represent `null` values. ### `unique` From 542c45c068d68171692be8e037d192474b87f4d6 Mon Sep 17 00:00:00 2001 From: roll Date: Tue, 2 Apr 2024 17:54:20 +0100 Subject: [PATCH 05/20] Normalized native representation for types --- content/docs/specifications/table-schema.md | 88 ++++++++++++++++----- 1 file changed, 70 insertions(+), 18 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 8d154c09..32acc76a 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -382,11 +382,19 @@ Supported formats: - **binary**: A base64 encoded string representing binary data. - **uuid**: A string that is a uuid. +**Native Representaiton** + +Values `MUST` be represented as strings. + ### `number` The field contains numbers of any kind including decimals. -If [native representation](../glossary/#native-representation) is a string, formatting follows that of decimal in [XMLSchema](https://www.w3.org/TR/xmlschema-2/#decimal): a non-empty finite-length sequence of decimal digits separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: '-1.23', '12678967.543233', '+100000.00', '210'. +**Native Representaiton** + +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings following the rules below. + +Formatting follows that of decimal in [XMLSchema](https://www.w3.org/TR/xmlschema-2/#decimal): a non-empty finite-length sequence of decimal digits separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: '-1.23', '12678967.543233', '+100000.00', '210'. The following special string values are permitted (case need not be respected): @@ -398,7 +406,7 @@ A number `MAY` also have a trailing: - exponent: this `MUST` consist of an E followed by an optional + or - sign followed by one or more decimal digits (0-9) -If [native representation](../glossary/#native-representation) is a string, formatting `MAY` be modified using these additional properties: +Formatting `MAY` be modified using these additional properties: - **decimalChar**: A string whose value is used to represent a decimal point within the number. The default value is ".". - **groupChar**: A string whose value is used to group digits within the number. This property does not have a default value. A common value is "," e.g. "100,000". @@ -408,9 +416,11 @@ If [native representation](../glossary/#native-representation) is a string, form The field contains integers - that is whole numbers. -Integer values are indicated in the standard way for any valid integer. +**Native Representaiton** -If [native representation](../glossary/#native-representation) is a string, formatting `MAY` be modified using these additional properties: +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings following the rules below. + +Integer values are indicated in the standard way for any valid integer. Formatting `MAY` be modified using these additional properties: - **groupChar**: A string whose value is used to group digits within the integer. This property does not have a default value. A common value is "," e.g. "100,000". - **bareNumber**: a boolean field with a default of `true`. If `true` the contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text. @@ -419,7 +429,11 @@ If [native representation](../glossary/#native-representation) is a string, form The field contains boolean (true/false) data. -If [native representation](../glossary/#native-representation) is a string, the values set in `trueValues` and `falseValues` are to be cast to their logical representation as booleans. `trueValues` and `falseValues` are arrays which can be customised to user need. The default values for these are in the additional properties section below. +**Native Representaiton** + +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings following the rules below. + +The values set in `trueValues` and `falseValues` are to be cast to their logical representation as booleans. `trueValues` and `falseValues` are arrays which can be customised to user need. The default values for these are in the additional properties section below. The boolean field can be customised with these additional properties: @@ -430,15 +444,27 @@ The boolean field can be customised with these additional properties: The field contains a valid JSON object. +**Native Representaiton** + +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be strings that are valid serialized JSON objects. + ### `array` The field contains a valid JSON array. +**Native Representaiton** + +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be strings that are valid serialized JSON arrays. + ### `list` -The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. If [native representation](../glossary/#native-representation) is a string, the field `MUST` contain a string with values separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. In comparison to the `array` type, the `list` type is directly modelled on the concept of SQL typed collections. +The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. + +**Native Representaiton** -`format`: no options (other than the default). +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings following the rules below. + +The field `MUST` contain a string with values separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. In comparison to the `array` type, the `list` type is directly modelled on the concept of SQL typed collections. The list field can be customised with these additional properties: @@ -449,9 +475,11 @@ The list field can be customised with these additional properties: The field contains a date with a time. -Supported formats: +**Native Representaiton** -- **default**: If [native representation](../glossary/#native-representation) is a string, `MUST` be in a form defined by [XML Schema](https://www.w3.org/TR/xmlschema-2/#dateTime) containing required date and time parts, followed by optional milliseconds and timezone parts, for example, `2024-01-26T15:00:00` or `2024-01-26T15:00:00.300-05:00`. +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings in one of the following formats: + +- **default**: values `MUST` be in a form defined by [XML Schema](https://www.w3.org/TR/xmlschema-2/#dateTime) containing required date and time parts, followed by optional milliseconds and timezone parts, for example, `2024-01-26T15:00:00` or `2024-01-26T15:00:00.300-05:00`. - **\**: values in this field can be parsed according to ``. `` `MUST` follow the syntax of [standard Python / C strptime](https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior). Values in the this field `SHOULD` be parsable by Python / C standard `strptime` using ``. Example for `"format": ""%d/%m/%Y %H:%M:%S"` which would correspond to a date with time like: `12/11/2018 09:15:32`. - **any**: Any parsable representation of the value. The implementing library can attempt to parse the datetime via a range of strategies. An example is `dateutil.parser.parse` from the `python-dateutils` library. It is `NOT RECOMMENDED` to use `any` format as it might cause interoperability issues. @@ -459,9 +487,11 @@ Supported formats: The field contains a date without a time. -Supported formats: +**Native Representaiton** -- **default**: If [native representation](../glossary/#native-representation) is a string, `MUST` be `yyyy-mm-dd` e.g. `2024-01-26` +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings in one of the following formats: + +- **default**: values `MUST` be `yyyy-mm-dd` e.g. `2024-01-26` - **\**: The same as for `datetime` - **any**: The same as for `datetime` @@ -469,33 +499,49 @@ Supported formats: The field contains a time without a date. -Supported formats: +**Native Representaiton** -- **default**: If [native representation](../glossary/#native-representation) is a string, `MUST` be `hh:mm:ss` e.g. `15:00:00` +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings in one of the following formats: + +- **default**: values `MUST` be `hh:mm:ss` e.g. `15:00:00` - **\**: The same as for `datetime` - **any**: The same as for `datetime` ### `year` -A calendar year as per [XMLSchema `gYear`](https://www.w3.org/TR/xmlschema-2/#gYear). Usual [native representation](../glossary/#native-representation) representation as a string is `YYYY`. There are no format options. +A calendar year. + +**Native Representaiton** + +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings as per [XMLSchema `gYear`](https://www.w3.org/TR/xmlschema-2/#gYear). Usual representation as a string is `YYYY`. ### `yearmonth` -A specific month in a specific year as per [XMLSchema `gYearMonth`](https://www.w3.org/TR/xmlschema-2/#gYearMonth). Usual [native representation](../glossary/#native-representation) as a string is `YYYY-MM`. There are no format options. +A specific month in a specific year. + +**Native Representaiton** + +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings as per [XMLSchema `gYearMonth`](https://www.w3.org/TR/xmlschema-2/#gYearMonth). Usual representation as a string is `YYYY-MM`. ### `duration` A duration of time. +**Native Representaiton** + +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings following the rules below. + We follow the definition of [XML Schema duration datatype](http://www.w3.org/TR/xmlschema-2/#duration) directly and that definition is implicitly inlined here. -If [native representation](../glossary/#native-representation) is a string, the duration is the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) extended format `PnYnMnDTnHnMnS`, where `nY` represents the number of years, `nM` the number of months, `nD` the number of days, `T` is the date/time separator, `nH` the number of hours, `nM` the number of minutes and `nS` the number of seconds. The number of seconds can include decimal digits to arbitrary precision. Date and time elements including their designator `MAY` be omitted if their value is zero, and lower order elements `MAY` also be omitted for reduced precision. +The duration is the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) extended format `PnYnMnDTnHnMnS`, where `nY` represents the number of years, `nM` the number of months, `nD` the number of days, `T` is the date/time separator, `nH` the number of hours, `nM` the number of minutes and `nS` the number of seconds. The number of seconds can include decimal digits to arbitrary precision. Date and time elements including their designator `MAY` be omitted if their value is zero, and lower order elements `MAY` also be omitted for reduced precision. ### `geopoint` The field contains data describing a geographic point. -Supported formats: +**Native Representaiton** + +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings in one of the following formats: - **default**: A string of the pattern "lon, lat", where each value is a number, and `lon` is the longitude and `lat` is the latitude (note the space is optional after the `,`). E.g. `"90.50, 45.50"`. - **array**: A JSON array, or a string parsable as a JSON array, of exactly two items, where each item is a number, and the first item is `lon` and the second @@ -506,7 +552,9 @@ Supported formats: The field contains a JSON object according to GeoJSON or TopoJSON spec. -Supported formats: +**Native Representaiton** + +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings in one of the following formats: - **default**: A geojson object as per the [GeoJSON spec](http://geojson.org/). - **topojson**: A topojson object as per the [TopoJSON spec](https://github.com/topojson/topojson-specification/blob/master/README.md) @@ -556,6 +604,10 @@ While this JSON data file will have logical values as below: Note, that for the CSV data source the `id` field is interpreted as a string because CSV supports only one data type i.e. string, and for the JSON data source the `id` field is interpreted as an integer because JSON supports a numeric data type and the value was declared as an integer. Also, for the Table Schema above a `type` property for each field can be omitted as it is a default field type. +**Native Representaiton** + +Values `MUST` be natively represented by a data format. + ## Field Constraints The `constraints` property on Table Schema Fields can be used by consumers to list constraints for validating field values. For example, validating the data in a [Tabular Data Resource](https://specs.frictionlessdata.io/tabular-data-package/) against its Table Schema; or as a means to validate data being collected or updated via a data entry interface. From 285aad7293b29451fe5ef764bf6cf30948430979 Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 09:52:44 +0100 Subject: [PATCH 06/20] Minor changes --- content/docs/specifications/glossary.md | 6 +++--- content/docs/specifications/table-schema.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/specifications/glossary.md b/content/docs/specifications/glossary.md index db6c9d4b..02f86fea 100644 --- a/content/docs/specifications/glossary.md +++ b/content/docs/specifications/glossary.md @@ -91,7 +91,7 @@ For example, here is a hexadecimal representation of a CSV file encoded using "U 69 64 7C 6E 61 6D 65 0A 31 7C 61 70 70 6C 65 0A 32 7C 6F 72 61 6E 67 65 ``` -For a reference the same file in textual form: +For a reference, the file contents after being decoded to a textual form: ```text id|name @@ -103,7 +103,7 @@ id|name The `native` representation of data refers to the representation of data in a form that is produced by a format-specific driver in some computational environment. The Data Package Standard itself does not define any data formats and relies on existent data formats and corresponding drivers on the implementations level. -Having the Data Resource definition as below: +Having a Data Resource definition as below: ```json { @@ -128,7 +128,7 @@ Note that handled by a CSV reader that took into account the dialect information The `logical` representation of data refers to the "ideal" representation of the data in terms of the Data Package standard types, data structures, and relations, all as defined by the specifications. We could say that the specifications is about the logical representation of data, as well as about ways in which to handle serialization and deserialization between `physical` representation of data and the `logical` representation of data. -Having the Data Resource definition as below: +Having a Data Resource definition as below: ```json { diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 32acc76a..102c201d 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -32,7 +32,7 @@ This specification havily relies on the following concepts: - [Tabular Data](../glossary/#tabular-data) - [Data Representation](../glossary/#data-representation) -In this document, we will explicitly refer to either the `native` or `logical` representation of data in places where it prevents ambiguity for those engaging with the specification, especially implementors. +In this document, we will explicitly refer to either the [Native Representation](../glossary/#native-representation) or [Logical Representation](../glossary/#logical-representation) of data in places where it prevents ambiguity for those engaging with the specification, especially implementors. Note, that this specification does not deal in any way with [Physical Representation](../glossary/#physical-representation) of data. ## Descriptor From d7466ffe90d3a39365d50b621f7c6606f8819136 Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 10:03:17 +0100 Subject: [PATCH 07/20] Updated boolean --- content/docs/specifications/table-schema.md | 12 ++++-------- profiles/dictionary/schema.yaml | 4 ---- 2 files changed, 4 insertions(+), 12 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 102c201d..2cf3e244 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -427,18 +427,14 @@ Integer values are indicated in the standard way for any valid integer. Formatti ### `boolean` -The field contains boolean (true/false) data. +The field contains boolean data i.e. logical `true` or logical `false`. **Native Representaiton** -If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings following the rules below. - -The values set in `trueValues` and `falseValues` are to be cast to their logical representation as booleans. `trueValues` and `falseValues` are arrays which can be customised to user need. The default values for these are in the additional properties section below. - -The boolean field can be customised with these additional properties: +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as defined by the `trueValues` and `falseValues` properties that can be customized to user need: -- **trueValues**: `[ "true", "True", "TRUE", "1" ]` -- **falseValues**: `[ "false", "False", "FALSE", "0" ]` +- **trueValues**: An array of native values to be interpreted as logical `true`. The default is `[ "true", "True", "TRUE", "1" ]`. +- **falseValues**: An array of native values to be interpreted as logical `false`. The default is `[ "false", "False", "FALSE", "0" ]`. ### `object` diff --git a/profiles/dictionary/schema.yaml b/profiles/dictionary/schema.yaml index ab49784a..98ceaf52 100644 --- a/profiles/dictionary/schema.yaml +++ b/profiles/dictionary/schema.yaml @@ -210,14 +210,10 @@ tableSchemaForeignKey: tableSchemaTrueValues: type: array minItems: 1 - items: - type: string default: ["true", "True", "TRUE", "1"] tableSchemaFalseValues: type: array minItems: 1 - items: - type: string default: ["false", "False", "FALSE", "0"] tableSchemaMissingValues: type: array From 954a220b68fa0a6ae5b2550574308ff54cfdd2c2 Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 10:11:11 +0100 Subject: [PATCH 08/20] Updated list --- content/docs/specifications/table-schema.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 2cf3e244..389cb7db 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -454,18 +454,17 @@ If supported, values `MUST` be natively represented by a data format. If not sup ### `list` -The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. +The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. In comparison to the `array` type, the `list` type is directly modelled on the concept of SQL typed collections. -**Native Representaiton** +The list field can be customised with this additional property: -If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings following the rules below. +- **itemType**: specifies the list item type in terms of existent Table Schema types. If present, it `MUST` be one of `string`, `integer`, `boolean`, `number`, `datetme`, `date`, and `time`. If not present, the default is `string`. A data consumer `MUST` process list items as it were individual values of the corresponding data type. -The field `MUST` contain a string with values separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. In comparison to the `array` type, the `list` type is directly modelled on the concept of SQL typed collections. +**Native Representaiton** -The list field can be customised with these additional properties: +If supported, values `MUST` be natively represented by a data format. If not supported, the field `MUST` contain a string with list items separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. The list items `MUST` be serialized in default format of the corresponding `itemType`. The delimiter can be customised with this additional property: -- **delimiter**: specifies the character sequence which separates, if [native representation](../glossary/#native-representation) is a string. If not present, the default is `,` (comma). -- **itemType**: specifies the list item type in terms of existent Table Schema types. If present, it `MUST` be one of `string`, `integer`, `boolean`, `number`, `datetme`, `date`, and `time`. If not present, the default is `string`. A data consumer `MUST` process list items as it were individual values of the corresponding data type. Note, that if [native representation](../glossary/#native-representation) is a string, only default formats are supported, for example, for a list with `itemType` set to `date`, items have to be in default form for dates i.e. `yyyy-mm-dd`. +- **delimiter**: specifies the character sequence which separates list items. If not present, the default is `,` (comma). ### `datetime` From 22c5b903037abb415ddbf7a6f0ea8c26b64b5a6b Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 10:18:07 +0100 Subject: [PATCH 09/20] Minor updates --- content/docs/specifications/table-schema.md | 32 ++++++++++----------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 389cb7db..390c0f18 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -468,7 +468,7 @@ If supported, values `MUST` be natively represented by a data format. If not sup ### `datetime` -The field contains a date with a time. +The field contains a date with a time and an optional timezone. **Native Representaiton** @@ -504,7 +504,7 @@ If supported, values `MUST` be natively represented by a data format. If not sup ### `year` -A calendar year. +The field contains a calendar year. **Native Representaiton** @@ -512,7 +512,7 @@ If supported, values `MUST` be natively represented by a data format. If not sup ### `yearmonth` -A specific month in a specific year. +The field containts a specific month in a specific year. **Native Representaiton** @@ -520,19 +520,17 @@ If supported, values `MUST` be natively represented by a data format. If not sup ### `duration` -A duration of time. +The field contains a duration of time. **Native Representaiton** -If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings following the rules below. - -We follow the definition of [XML Schema duration datatype](http://www.w3.org/TR/xmlschema-2/#duration) directly and that definition is implicitly inlined here. +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings as per [XML Schema `duration`](http://www.w3.org/TR/xmlschema-2/#duration). -The duration is the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) extended format `PnYnMnDTnHnMnS`, where `nY` represents the number of years, `nM` the number of months, `nD` the number of days, `T` is the date/time separator, `nH` the number of hours, `nM` the number of minutes and `nS` the number of seconds. The number of seconds can include decimal digits to arbitrary precision. Date and time elements including their designator `MAY` be omitted if their value is zero, and lower order elements `MAY` also be omitted for reduced precision. +The duration `MUST` be in the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) extended format `PnYnMnDTnHnMnS`, where `nY` represents the number of years, `nM` the number of months, `nD` the number of days, `T` is the date/time separator, `nH` the number of hours, `nM` the number of minutes and `nS` the number of seconds. The number of seconds can include decimal digits to arbitrary precision. Date and time elements including their designator `MAY` be omitted if their value is zero, and lower order elements `MAY` also be omitted for reduced precision. ### `geopoint` -The field contains data describing a geographic point. +The field contains data describing a geographic point i.e. `lon` and `lat` values that are floating point numbers. **Native Representaiton** @@ -541,22 +539,24 @@ If supported, values `MUST` be natively represented by a data format. If not sup - **default**: A string of the pattern "lon, lat", where each value is a number, and `lon` is the longitude and `lat` is the latitude (note the space is optional after the `,`). E.g. `"90.50, 45.50"`. - **array**: A JSON array, or a string parsable as a JSON array, of exactly two items, where each item is a number, and the first item is `lon` and the second item is `lat` e.g. `[90.50, 45.50]` -- **object**: A JSON object with exactly two keys, `lat` and `lon` and each value is a number e.g. `{"lon": 90.50, "lat": 45.50}` +- **object**: A JSON object with exactly two keys, `lon` and `lat` and each value is a number e.g. `{"lon": 90.50, "lat": 45.50}` ### `geojson` -The field contains a JSON object according to GeoJSON or TopoJSON spec. +The field contains a JSON object according to GeoJSON or TopoJSON specifications. -**Native Representaiton** - -If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings in one of the following formats: +Supported formats: - **default**: A geojson object as per the [GeoJSON spec](http://geojson.org/). -- **topojson**: A topojson object as per the [TopoJSON spec](https://github.com/topojson/topojson-specification/blob/master/README.md) +- **topojson**: A topojson object as per the [TopoJSON spec](https://github.com/topojson/topojson-specification/blob/master/README.md). + +**Native Representaiton** + +If supported, values `MUST` be natively represented by a data format. If not supported, values `MUST` be represented as strings that are valid serialized JSON objects. ### `any` -The field contains values of a unspecified or mixed type. A data consumer `MUST NOT` perform any processing on this field's values and `MUST` interpret them as it is in the data source. This data type is directly modelled on the concept of the `any` type of strongly typed object-oriented languages like [TypeScript](https://www.typescriptlang.org/docs/handbook/2/everyday-types.html#any). +The field contains values of a unspecified or mixed type. A data consumer `MUST NOT` perform any processing on values and `MUST` interpret them as it is in [Native Representaiton](../glossary/#native-representation) of data. This data type is directly modelled on the concept of the `any` type of strongly typed object-oriented languages like [TypeScript](https://www.typescriptlang.org/docs/handbook/2/everyday-types.html#any). For example, having a Table Schema below: From 4c25362f6d2c0fa8b4b70c4517bd13e975f19da6 Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 10:36:05 +0100 Subject: [PATCH 10/20] Updated missingValues --- content/docs/specifications/table-schema.md | 9 ++++----- profiles/dictionary/schema.yaml | 2 -- 2 files changed, 4 insertions(+), 7 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 390c0f18..0fa59092 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -88,18 +88,17 @@ A Table Schema descriptor `MAY` contain a property `fieldsMatch` that `MUST` be Many datasets arrive with missing data values, either because a value was not collected or it never existed. Missing values may be indicated simply by the value being empty in other cases a special value may have been used e.g. `-`, `NaN`, `0`, `-9999` etc. -`missingValues` dictates which string values `MUST` be treated as `null` values. This conversion to `null` is done before any other attempted type-specific string conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to null will be done, on any value. +The `missingValues` property configures which native values `MUST` be treated as logical `null` values. If provided, the `missingValues` property `MUST` be an `array` of values. -`missingValues` `MUST` be an `array` where each entry is a `string`. +This conversion to `null` is done before any other attempted type-specific conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to null will be done, on any value. -**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`. - -Examples: +Examples of the `missingValues` property: ```text "missingValues": [""] "missingValues": ["-"] "missingValues": ["NaN", "-"] +"missingValues": [-9999] ``` #### `primaryKey` diff --git a/profiles/dictionary/schema.yaml b/profiles/dictionary/schema.yaml index 98ceaf52..e2f2869e 100644 --- a/profiles/dictionary/schema.yaml +++ b/profiles/dictionary/schema.yaml @@ -217,8 +217,6 @@ tableSchemaFalseValues: default: ["false", "False", "FALSE", "0"] tableSchemaMissingValues: type: array - items: - type: string default: - "" description: Values that when encountered in the source, should be considered From 91491348e8c3f267d277b7a774fb8c0773a9dc5c Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 10:44:23 +0100 Subject: [PATCH 11/20] Minor update --- content/docs/specifications/table-schema.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 0fa59092..8c14f8c6 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -555,7 +555,7 @@ If supported, values `MUST` be natively represented by a data format. If not sup ### `any` -The field contains values of a unspecified or mixed type. A data consumer `MUST NOT` perform any processing on values and `MUST` interpret them as it is in [Native Representaiton](../glossary/#native-representation) of data. This data type is directly modelled on the concept of the `any` type of strongly typed object-oriented languages like [TypeScript](https://www.typescriptlang.org/docs/handbook/2/everyday-types.html#any). +The field contains values of a unspecified or mixed type. A data consumer `MUST NOT` perform any processing on values and `MUST` interpret them as it is in native representation of data. This data type is directly modelled on the concept of the `any` type of strongly typed object-oriented languages like [TypeScript](https://www.typescriptlang.org/docs/handbook/2/everyday-types.html#any). For example, having a Table Schema below: From 217f535a3b8aa4cc5e39b8bfff2d928e78969349 Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 10:46:47 +0100 Subject: [PATCH 12/20] Removed invalid json --- content/docs/specifications/table-schema.md | 1 - 1 file changed, 1 deletion(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 8c14f8c6..999e1e08 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -359,7 +359,6 @@ The corresponding Table Schema is: "type": "string", "rdfType": "http://schema.org/Country" } - ... } } ``` From e25c79ec37959028cba362516409c0d723982b21 Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 10:47:39 +0100 Subject: [PATCH 13/20] Revert "Removed invalid json" This reverts commit 217f535a3b8aa4cc5e39b8bfff2d928e78969349. --- content/docs/specifications/table-schema.md | 1 + 1 file changed, 1 insertion(+) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 999e1e08..8c14f8c6 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -359,6 +359,7 @@ The corresponding Table Schema is: "type": "string", "rdfType": "http://schema.org/Country" } + ... } } ``` From 3600ae9e8c9d0a6c40d0079de7ea3b830570f1dd Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 11:08:51 +0100 Subject: [PATCH 14/20] Fixed typo --- content/docs/specifications/table-schema.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index fa3aa6f8..b71ab7a6 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -457,7 +457,7 @@ The field contains data that is an ordered one-level depth collection of primiti The list field can be customised with this additional property: -- **itemType**: specifies the list item type in terms of existent Table Schema types. If present, it `MUST` be one of `string`, `integer`, `boolean`, `number`, `datetme`, `date`, and `time`. If not present, the default is `string`. A data consumer `MUST` process list items as it were individual values of the corresponding data type. +- **itemType**: specifies the list item type in terms of existent Table Schema types. If present, it `MUST` be one of `string`, `integer`, `boolean`, `number`, `datetime`, `date`, and `time`. If not present, the default is `string`. A data consumer `MUST` process list items as it were individual values of the corresponding data type. **Native Representaiton** From 32a672788602ab2ae03865b89edf1454bed7da50 Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 11:09:57 +0100 Subject: [PATCH 15/20] Updated wording --- content/docs/specifications/table-schema.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index b71ab7a6..cc11554f 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -461,7 +461,7 @@ The list field can be customised with this additional property: **Native Representaiton** -If supported, values `MUST` be natively represented by a data format. If not supported, the field `MUST` contain a string with list items separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. The list items `MUST` be serialized in default format of the corresponding `itemType`. The delimiter can be customised with this additional property: +If supported, values `MUST` be natively represented by a data format. If not supported, the field `MUST` contain a string with list items separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. The list items `MUST` be serialized using a default format of the corresponding `itemType`. The delimiter can be customised with this additional property: - **delimiter**: specifies the character sequence which separates list items. If not present, the default is `,` (comma). From 48dea00e7439a13d3895735083f58ed4a889d130 Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 11:37:19 +0100 Subject: [PATCH 16/20] Minor updates --- content/docs/specifications/glossary.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/specifications/glossary.md b/content/docs/specifications/glossary.md index 02f86fea..0434efb2 100644 --- a/content/docs/specifications/glossary.md +++ b/content/docs/specifications/glossary.md @@ -49,7 +49,7 @@ Example of a relative path that this will work both as a relative path on disk a ### Tabular Data -Tabular data consists of a set of rows. Each row has a set of fields (columns). We usually expect that each row has the same set of fields and thus we can talk about _the_ fields for the table as a whole. +Tabular data consists of a list of rows. Each row has a list of fields (columns). We usually expect that each row has the same list of fields and thus we can talk about the fields for the table as a whole. In case of tables in spreadsheets or CSV files we often interpret the first row as a header row, giving the names of the fields. By contrast, in other situations, e.g. tables in SQL databases, the field names are explicitly designated. @@ -101,7 +101,7 @@ id|name #### Native Representation -The `native` representation of data refers to the representation of data in a form that is produced by a format-specific driver in some computational environment. The Data Package Standard itself does not define any data formats and relies on existent data formats and corresponding drivers on the implementations level. +The `native` representation of data refers to the representation of data in a form that is produced by a format-specific driver in some computational environment. The Data Package Standard itself does not define any data formats and relies on existent data formats (such as CSV, JSON, or SQL) and corresponding drivers on the implementations level. Having a Data Resource definition as below: @@ -115,7 +115,7 @@ Having a Data Resource definition as below: } ``` -The data from the exemplar CSV in `native` representation will be: +The data from the CSV example above will be in `native` representation (we use a JavaScript-based example for illustration): ```javascript {id: "1", name: "apple"} @@ -146,7 +146,7 @@ Having a Data Resource definition as below: } ``` -The data from the exemplar CSV in `logical` representation will be: +The data from the CSV example above will be in `logical` representation (we use a JavaScript-based example for illustration): ```javascript {id: 1, name: "apple"} From e7def485ef4836c0bd768b8d37069bd3cdc7e0e7 Mon Sep 17 00:00:00 2001 From: roll Date: Wed, 3 Apr 2024 11:39:06 +0100 Subject: [PATCH 17/20] Minor updates --- content/docs/specifications/glossary.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/specifications/glossary.md b/content/docs/specifications/glossary.md index 0434efb2..c525ddf7 100644 --- a/content/docs/specifications/glossary.md +++ b/content/docs/specifications/glossary.md @@ -115,7 +115,7 @@ Having a Data Resource definition as below: } ``` -The data from the CSV example above will be in `native` representation (we use a JavaScript-based example for illustration): +The data from the CSV example above will be in `native` representation (we use a JavaScript-based environment for illustration): ```javascript {id: "1", name: "apple"} @@ -146,7 +146,7 @@ Having a Data Resource definition as below: } ``` -The data from the CSV example above will be in `logical` representation (we use a JavaScript-based example for illustration): +The data from the CSV example above will be in `logical` representation (we use a JavaScript-based environment for illustration): ```javascript {id: 1, name: "apple"} From 11b9a03f4106f4c7cc12331351c793c659e76b8a Mon Sep 17 00:00:00 2001 From: roll Date: Thu, 4 Apr 2024 10:12:11 +0100 Subject: [PATCH 18/20] Fixed typo --- content/docs/specifications/table-schema.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index cc11554f..232b3cdd 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -27,7 +27,7 @@ Table Schema is a simple language- and implementation-agnostic way to declare a ## Concepts -This specification havily relies on the following concepts: +This specification heavily relies on the following concepts: - [Tabular Data](../glossary/#tabular-data) - [Data Representation](../glossary/#data-representation) From 0dfc4f8b9660bd3222bd63c7d22e8980f8271d1c Mon Sep 17 00:00:00 2001 From: roll Date: Thu, 4 Apr 2024 10:13:52 +0100 Subject: [PATCH 19/20] Added native data in descriptor note --- content/docs/specifications/table-schema.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 232b3cdd..51825c1d 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -34,6 +34,8 @@ This specification heavily relies on the following concepts: In this document, we will explicitly refer to either the [Native Representation](../glossary/#native-representation) or [Logical Representation](../glossary/#logical-representation) of data in places where it prevents ambiguity for those engaging with the specification, especially implementors. Note, that this specification does not deal in any way with [Physical Representation](../glossary/#physical-representation) of data. +Note, that whenever a native value is allowed to be provided in this spec, the most similar JSON type should be used to represent it. If no such type exists (e.g. in case there's a native date value), a string representation of that value should be provided. Such mappings between native types and JSON types, and the string representations described above are file format specific and left for implementors to decide (unless defined explicitly in this specification or its appendixes). + ## Descriptor A Table Schema is represented by a descriptor. The descriptor `MUST` be a JSON `object` (JSON is defined in [RFC 4627](http://www.ietf.org/rfc/rfc4627.txt)). From d46eaaed8dd5761aa789a9235aa2f06a788f4a0b Mon Sep 17 00:00:00 2001 From: roll Date: Fri, 12 Apr 2024 09:25:34 +0100 Subject: [PATCH 20/20] Added list to min/maxLenght constraints --- content/docs/specifications/table-schema.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 9e16be0c..5a9cc450 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -633,14 +633,14 @@ If `true`, then all values for that field `MUST` be unique within the data file ### `minLength` - **Type**: integer -- **Fields**: collections (string, array, object) +- **Fields**: collections (string, list, array, object) An integer that specifies the minimum length of a value. ### `maxLength` - **Type**: integer -- **Fields**: collections (string, array, object) +- **Fields**: collections (string, list, array, object) An integer that specifies the maximum length of a value.