Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong Result in search query #556

Open
mohitgoyal201617 opened this issue Jun 7, 2016 · 8 comments
Open

Wrong Result in search query #556

mohitgoyal201617 opened this issue Jun 7, 2016 · 8 comments
Labels

Comments

@mohitgoyal201617
Copy link

mohitgoyal201617 commented Jun 7, 2016

I have inserted some openaddress, When I search NH-11 it gives me result but when i search NH 11 it does not give me result.
this text it present in street.

@orangejulius
Copy link
Member

Hey @mohitgoyal201617,
Thanks for reporting this! We definitely have some areas to improve around parsing text with separating characters like -. I think we may have some work being done now that will help, but it's generally something we are getting better at over time.

To help us out, can you send along some full queries that demonstrate both the good and bad behavior? You can use our compare tool to look at the queries, and send along the links to that tool. Thanks!

@mohitgoyal201617
Copy link
Author

I have modified punctutions.js(Schema) commented '-' in allowed character.
I have cloned pelias-api started it using npm start. so just want to know how to deploy schema module which is cloned and modified.

@orangejulius
Copy link
Member

orangejulius commented Jun 8, 2016

Cool, let us know how that works. I can't remember if we've tried something similar.

From the pelias/schema directory, you can run this to reset the schema:

WARNING: This will remove all the data you've imported into the pelias index in Elasticsearch, and then you'll have to re-index it.

node scripts/drop_index.js
node scripts/create_index.js

@mohitgoyal201617
Copy link
Author

mohitgoyal201617 commented Jun 9, 2016

It worked. Now it treats NH-14 as NH14.
If my '-' correction is fine. How can I push it into pelias/schema

One more thing
I tried to add one character filter

 "peliasIndexOneEdgeGram" : {
          "type": "custom",
          "tokenizer" : "peliasNameTokenizer",
          "char_filter" : ["punctuation","specialChar"],
          "filter": [
            "lowercase",
            "asciifolding",
            "trim",
            "full_token_address_suffix_expansion",
            "ampersand",
            "remove_ordinals",
            "removeAllZeroNumericPrefix",
            "peliasOneEdgeGramFilter",
            "unique",
            "notnull"
          ]
        },
  "char_filter": {
        "punctuation" : {
          "type" : "mapping",
          "mappings" : punctuation.blacklist.map(function(c){
            return c + '=>';
          })
        },
        "alphanumeric" : {
          "type" : "pattern_replace",
          "pattern": "[^a-zA-Z0-9]",
          "replacement": ""
        },
        "numeric" : {
          "type" : "pattern_replace",
          "pattern": "[^0-9]",
          "replacement": " "
        },
        "specialChar" : {
          "type":"pattern_replace",
          "pattern":"NH\\s*",
          "replacement":"NH"
        }
      }
    },

i checked by adding test in analyzer_peliasIndexOneEdgeGram.js.
Input: nh 14
output:['n','nh','nh1','nh14']
But when i loaded data into elasticsearch and search using pelias/api(text=nh14)
The search results have many records having nh with high scrore and record having nh14 with low score.
Can I please explain this thing.

@orangejulius
Copy link
Member

Hey @mohitgoyal201617,
First, I edited your comment above a little bit to make the formatting of code better. Let me know if i got it wrong, and I would definitely suggest using similar formatting in the future :)

I'm glad adding handling for the - character worked for you. @missinglink, do think we should add this filtering so that NH-14 and NH14 housenumbers both get indexed as NH14?

@mohitgoyal201617 Your changes as posted in your comment are (obviously) a bit too specific to be merged, but we would gladly help you create a pull request against pelias/schema that works across more cases. We can also help you figure out why the scoring isn't as you expect. Our gitter chat room might be a better place than Github comments, let me know.

@missinglink
Copy link
Member

missinglink commented Jun 14, 2016

hi @mohitgoyal201617 I'm assuming you're referring to "National Highway No. 14 Route: Radhanpur to Beawar" (India)?

there are two ways of handling it, you can either split or combine:

split:
[ "NH-14" ] -> [ "NH", "14" ]

combine:
[ "NH-14" ] -> [ "NH14" ]

it looks like you're using combine which I think is correct for this situation.

Looking at your replacement function it doesn't seem to be correct, I'm surprised it's working for you.

"char_filter": {
  "specialChar" : {
    "type":"pattern_replace",
    "pattern":"NH\\s*",
    "replacement":"NH"
  }
}

The way I read this regex it says "anything starting with NH followed by zero or more whitespace characters should be replaced with the text "NH".

Don't you want something like this?

"filter": {
  "indianMotorwayFilter" : {
    "type":"pattern_replace",
    "pattern":"nh-([0-9]{2})",
    "replacement":"nh$1"
  }
}

... which says "anything starting with "nh", followed by a "-" and then two digits is replaced with "nh" followed by the digits"?

note: I would suggest changing it from a char_filter to a filter and updating the analyzer to:

 "peliasIndexOneEdgeGram" : {
          "type": "custom",
          "tokenizer" : "peliasNameTokenizer",
          "char_filter" : ["punctuation"],
          "filter": [
            "lowercase",
            "asciifolding",
            "trim",
            "indianMotorwayFilter"
            "full_token_address_suffix_expansion",
            "ampersand",
            "remove_ordinals",
            "removeAllZeroNumericPrefix",
            "peliasOneEdgeGramFilter",
            "unique",
            "notnull"
          ]
        },

The char_filter in elasticsearch is run before any of the filter(s) and is better used for working with single characters, the filter section is a better place to put it because you want to operate on tokens.

note: the tokens should be lowercased by this point.

@mohitgoyal201617
Copy link
Author

yes ,, right
One more thing current addressit is not able to extract region,state
information correctly for indian address. Can you suggest some other
address normalizer which can be plugged.

On Tue, Jun 14, 2016 at 9:09 PM, Peter Johnson a.k.a. insertcoffee <
[email protected]> wrote:

hi @mohitgoyal201617 https://github.com/mohitgoyal201617 I'm assuming
you're referring to "National Highway No. 14 Route: Radhanpur to Beawar"
(India)?

there are two ways of handling it, you can either split or combine:

split:
[ "NH-14" ] -> [ "NH", "14" ]

combine:
[ "NH-14" ] -> [ "NH14" ]

it looks like you're using combine which I think is correct for this
situation.

Looking at your replacement function it doesn't seem to be correct, I'm
surprised it's working for you.

"specialChar" : {
"type":"pattern_replace",
"pattern":"NH\s*",
"replacement":"NH"
}

The way I read this regex it says "anything starting with NH followed by a
single whitespace character and then a single character of any value should
be replaced with the text "NH".

Don't you want something like this?

"specialChar" : {
"type":"pattern_replace",
"pattern":"nh-([0-9]{2})",
"replacement":"nh$1"
}

... which says "anything starting with "nh", followed by a "-" and then
two digits is replaced with "nh" followed by the digits"?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#556 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ASqL4R1oT7TMLpmvVvJ8AwCsqo0EvF_6ks5qLss-gaJpZM4Ivs4C
.

Mohit Goyal
09582396263

@orangejulius
Copy link
Member

You are totally right. AddressIt is not very flexible globally. Fortunately we have work ongoing to use libpostal, a machine learning project to do address parsing globally trained on OSM data. @trescube is working on integrating it into Pelias right now, and I believe initial versions for testing will be available soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants