Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistakes in Thesaurus #1616

Open
matthewkumar opened this issue Apr 24, 2020 · 10 comments
Open

Mistakes in Thesaurus #1616

matthewkumar opened this issue Apr 24, 2020 · 10 comments
Assignees
Labels
controlled vocab extended-DCE relating to features or work not in MVP of DCE; or that are in the extended or advanced "sandbox" on hold currently inactive tasks or projects

Comments

@matthewkumar
Copy link
Collaborator

If you notice any mistakes in the thesaurus that need to be manually overridden, please list them below.

@matthewkumar
Copy link
Collaborator Author

animal.csv
line 124
'calandra lark' should normalize to 'lark'

@ronikaufman
Copy link
Collaborator

Some files are missing the prefLabel_en column: measurement.csv, material.csv, sensory.csv.
And definition.csv is empty.

@njr2128 njr2128 added controlled vocab extended-DCE relating to features or work not in MVP of DCE; or that are in the extended or advanced "sandbox" labels Apr 27, 2020
@matthewkumar
Copy link
Collaborator Author

All fixed except definition

@danachaillard
Copy link
Collaborator

In animal.csv there is a 'smalls female lizards' that goes to smalls female lizard instead of lizard

@matthewkumar
Copy link
Collaborator Author

I imagine 'smalls' is being read as a noun rather than an adjective, which throws off the semantic head identification. I wonder why 'smalls' got added to the english version of the manuscript.

@ps2270
Copy link
Collaborator

ps2270 commented May 16, 2020 via email

@danachaillard
Copy link
Collaborator

danachaillard commented May 16, 2020 via email

@danachaillard
Copy link
Collaborator

Few other things I found while making the word clouds, I'm sorry I don't have the lines in the csv though.

In animals : Those with a name containing two nouns do not default (ex: "goat kid"). That's the weakness of the way we do this. I don't see a way to fix this apart from retaining all exceptions. Also there is a "viper color" tagged as an animal, I don't know what to do with that.

In body parts : Do we want "joint of the little finger" to point to "finger". Or will that be part of the abstracting part.

In materials : many things go wrong. Some I don't understand "large steel dice" for example is not changed. Some are just too long or too complicated semantically "latten mad brittle by the calamine. I don't know what to think of "tin is base".

In weapons : We have the double noun problem again with "arquebus powder". "arquebus a croc" doesn't work either but it's probably because it's in french so it doesn't find the name ?

@matthewkumar
Copy link
Collaborator Author

Re: smalls little lizards
It's good that this is the only instance, and we could fix it by adding it to our list of manual corrections. However, it brings up a point Naomi, Terry, and I have discussed on how we should represent del tags throughout the repository. I'll make sure to bring this case up next time we speak.

@danachaillard
It's probably beyond the scope of this project to correctly normalize every term that appears in the manuscript. Our best hope is to maximize the amount we can normalize while minimizing the amount of errors. In general, I'd say its better to avoid generating errors than generate something for everything. I agree that terms like 'large steel dice' would ideally be normalized, but I'm not sure what protocol we would follow to make that happen. What have you been thinking?

@danachaillard
Copy link
Collaborator

It depends on how the method to find the 'noun' works, because in every instance where one of he adjectives is also a noun it seems to fail. What does the method return when it doesn't find ? Could it give us all the nouns it finds.

@njr2128 njr2128 added the on hold currently inactive tasks or projects label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
controlled vocab extended-DCE relating to features or work not in MVP of DCE; or that are in the extended or advanced "sandbox" on hold currently inactive tasks or projects
Projects
None yet
Development

No branches or pull requests

6 participants