Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain concepts incorrectly trigger "unknown concept" error (64 character limit) #49

Open
jheeffer opened this issue Apr 29, 2020 · 2 comments
Labels
bug Something isn't working

Comments

@jheeffer
Copy link
Member

jheeffer commented Apr 29, 2020

Some concepts do exist in the dataset, and are returned in the availability queries, but on datapoint queries they fail with "unknown concept".

query:

{
  "select": {
    "key": [
      "geo",
      "time"
    ],
    "value": [
      "debt_servicing_costs_percent_of_exports_and_net_income_from_abroad"
    ]
  },
  "from": "datapoints"
}

url:
https://big-waffle.gapminder.org/sg-master/75ab7c3?_select_key@=geo&=time;&value@=debt/_servicing/_costs/_percent/_of/_exports/_and/_net/_income/_from/_abroad;;&from=datapoints

Here's the availability query for that dataset:
https://big-waffle.gapminder.org/sg-master/75ab7c3?_select_key@=key&=value;&value@;;&from=datapoints.schema

Which contains <geo,time>,debt_servicing_costs_percent_of_exports_and_net_income_from_abroad

One hypothesis: it seems to be mostly concepts with long id's, might be some undocumented concept id length limitation?

@jheeffer
Copy link
Member Author

jheeffer commented May 29, 2020

Most likely it's a maximum of 64 characters:
https://dev.mysql.com/doc/refman/8.0/en/identifier-length.html (see column)

debt_servicing_costs_percent_of_exports_and_net_income_from_abroad is 67 characters

The column does not exist in the big waffle datapoints table after loading SG.
Loading in SG does not give any errors in the log (it just says "finished loading").

We will need a renaming function that also prevents collissions (however unlikely) when creating and querying tables. Maybe we substring anything longer than 60 chars to 60 and suffix a 4 character hash (probably by base64 encoding) of the complete string?

Overview of non-crypto hashes in (node.)js: https://github.com/joliss/fast-js-hash-benchmark

@jheeffer
Copy link
Member Author

jheeffer commented May 30, 2020

There was already code implemented for long column names, including a test case. However the code checked for > 64 characters

if (columnName.length > 64) {

while the test case used a concept < 64 characters.

select: { key: ['country', 'gas', 'time'], value: ['emissions_as_an_extremely_long_indicator_name_of_60plus_chars'] },

Meaning the test case passed because it didn't actually check the long concept implementation.

The problem seems to be at the CONNECT engine. It correctly creates a table with the shortened column name but then gives this error when trying to select the data: Error Code: 1296. Got error 122 'Missing field 1 in TTbb13f11f7172454ad327054ecb50fc35 line 1' from CONNECT

When renaming the header in the csv file to the shortened name, data is loaded correctly (can select) but the rest of the script fails because the csv file has an unexpected header.

My hypothesis is that this is due to a mismatch between csv table header and CONNECT engine table header. CONNECT tries to read the shortened name from csv but it doesn't exist there.

I thought CONNECT engine might allow longer column names than INNODB but a test showed that the CONNECT engine also can't work with long names. It's a MariaDB wide limitation: https://mariadb.com/kb/en/identifier-names/

Identifier name 'a_really_long_concept_name_that_is_too_long_for_many_databases_since_they_have_a_maximum_column_l...' is too long
  sql: CREATE TABLE TTbb13f11f7172454ad327054ecb50fc35 ( `geo` VARCHAR(14), `time` INTEGER, `a_really_long_concept_name_that_is_too_long_for_many_databases_since_they_have_a_maximum_column_length` INTEGER)
    engine=CONNECT table_type=CSV file_name='/mnt/c/Users/Jasper/Documents/github/gapminder/big-waffle/test/ddf--testdata/v0/ddf--datapoints--a_really_long_concept_name_that_is_too_long_for_many_databases_since_they_have_a_maximum_column_length--by--geo--time.csv' header=1 sep_char=',' quoted=1; - parameters:[]

So it seems it's either:

  1. We don't allow > 64 character concept id's
  2. We edit csv files before loading them into maria
  3. We find another way to load csv files
  4. We find a hidden feature in CONNECT which allows mapping a csv column to a different sql column

@angiehjort angiehjort added the bug Something isn't working label Oct 26, 2021
@angiehjort angiehjort changed the title Certain concepts incorrectly trigger "unknown concept" error Certain concepts incorrectly trigger "unknown concept" error (64 character limit) Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants