Wrong Result in search query #556

mohitgoyal201617 · 2016-06-07T08:57:02Z

I have inserted some openaddress, When I search NH-11 it gives me result but when i search NH 11 it does not give me result.
this text it present in street.

orangejulius · 2016-06-08T15:38:07Z

Hey @mohitgoyal201617,
Thanks for reporting this! We definitely have some areas to improve around parsing text with separating characters like -. I think we may have some work being done now that will help, but it's generally something we are getting better at over time.

To help us out, can you send along some full queries that demonstrate both the good and bad behavior? You can use our compare tool to look at the queries, and send along the links to that tool. Thanks!

mohitgoyal201617 · 2016-06-08T16:41:08Z

I have modified punctutions.js(Schema) commented '-' in allowed character.
I have cloned pelias-api started it using npm start. so just want to know how to deploy schema module which is cloned and modified.

orangejulius · 2016-06-08T18:19:49Z

Cool, let us know how that works. I can't remember if we've tried something similar.

From the pelias/schema directory, you can run this to reset the schema:

WARNING: This will remove all the data you've imported into the pelias index in Elasticsearch, and then you'll have to re-index it.

node scripts/drop_index.js
node scripts/create_index.js

mohitgoyal201617 · 2016-06-09T06:12:06Z

It worked. Now it treats NH-14 as NH14.
If my '-' correction is fine. How can I push it into pelias/schema

One more thing
I tried to add one character filter

 "peliasIndexOneEdgeGram" : {
          "type": "custom",
          "tokenizer" : "peliasNameTokenizer",
          "char_filter" : ["punctuation","specialChar"],
          "filter": [
            "lowercase",
            "asciifolding",
            "trim",
            "full_token_address_suffix_expansion",
            "ampersand",
            "remove_ordinals",
            "removeAllZeroNumericPrefix",
            "peliasOneEdgeGramFilter",
            "unique",
            "notnull"
          ]
        },

  "char_filter": {
        "punctuation" : {
          "type" : "mapping",
          "mappings" : punctuation.blacklist.map(function(c){
            return c + '=>';
          })
        },
        "alphanumeric" : {
          "type" : "pattern_replace",
          "pattern": "[^a-zA-Z0-9]",
          "replacement": ""
        },
        "numeric" : {
          "type" : "pattern_replace",
          "pattern": "[^0-9]",
          "replacement": " "
        },
        "specialChar" : {
          "type":"pattern_replace",
          "pattern":"NH\\s*",
          "replacement":"NH"
        }
      }
    },

i checked by adding test in analyzer_peliasIndexOneEdgeGram.js.
Input: nh 14
output:['n','nh','nh1','nh14']
But when i loaded data into elasticsearch and search using pelias/api(text=nh14)
The search results have many records having nh with high scrore and record having nh14 with low score.
Can I please explain this thing.

orangejulius · 2016-06-13T15:49:32Z

Hey @mohitgoyal201617,
First, I edited your comment above a little bit to make the formatting of code better. Let me know if i got it wrong, and I would definitely suggest using similar formatting in the future :)

I'm glad adding handling for the - character worked for you. @missinglink, do think we should add this filtering so that NH-14 and NH14 housenumbers both get indexed as NH14?

@mohitgoyal201617 Your changes as posted in your comment are (obviously) a bit too specific to be merged, but we would gladly help you create a pull request against pelias/schema that works across more cases. We can also help you figure out why the scoring isn't as you expect. Our gitter chat room might be a better place than Github comments, let me know.

missinglink · 2016-06-14T15:39:41Z

hi @mohitgoyal201617 I'm assuming you're referring to "National Highway No. 14 Route: Radhanpur to Beawar" (India)?

there are two ways of handling it, you can either split or combine:

split:
[ "NH-14" ] -> [ "NH", "14" ]

combine:
[ "NH-14" ] -> [ "NH14" ]

it looks like you're using combine which I think is correct for this situation.

Looking at your replacement function it doesn't seem to be correct, I'm surprised it's working for you.

"char_filter": {
  "specialChar" : {
    "type":"pattern_replace",
    "pattern":"NH\\s*",
    "replacement":"NH"
  }
}

The way I read this regex it says "anything starting with NH followed by zero or more whitespace characters should be replaced with the text "NH".

Don't you want something like this?

"filter": {
  "indianMotorwayFilter" : {
    "type":"pattern_replace",
    "pattern":"nh-([0-9]{2})",
    "replacement":"nh$1"
  }
}

... which says "anything starting with "nh", followed by a "-" and then two digits is replaced with "nh" followed by the digits"?

note: I would suggest changing it from a char_filter to a filter and updating the analyzer to:

 "peliasIndexOneEdgeGram" : {
          "type": "custom",
          "tokenizer" : "peliasNameTokenizer",
          "char_filter" : ["punctuation"],
          "filter": [
            "lowercase",
            "asciifolding",
            "trim",
            "indianMotorwayFilter"
            "full_token_address_suffix_expansion",
            "ampersand",
            "remove_ordinals",
            "removeAllZeroNumericPrefix",
            "peliasOneEdgeGramFilter",
            "unique",
            "notnull"
          ]
        },

The char_filter in elasticsearch is run before any of the filter(s) and is better used for working with single characters, the filter section is a better place to put it because you want to operate on tokens.

note: the tokens should be lowercased by this point.

mohitgoyal201617 · 2016-06-15T11:05:59Z

yes ,, right
One more thing current addressit is not able to extract region,state
information correctly for indian address. Can you suggest some other
address normalizer which can be plugged.

On Tue, Jun 14, 2016 at 9:09 PM, Peter Johnson a.k.a. insertcoffee <
[email protected]> wrote:

hi @mohitgoyal201617 https://github.com/mohitgoyal201617 I'm assuming
you're referring to "National Highway No. 14 Route: Radhanpur to Beawar"
(India)?

there are two ways of handling it, you can either split or combine:

split:
[ "NH-14" ] -> [ "NH", "14" ]

combine:
[ "NH-14" ] -> [ "NH14" ]

it looks like you're using combine which I think is correct for this
situation.

Looking at your replacement function it doesn't seem to be correct, I'm
surprised it's working for you.

"specialChar" : {
"type":"pattern_replace",
"pattern":"NH\s*",
"replacement":"NH"
}

The way I read this regex it says "anything starting with NH followed by a
single whitespace character and then a single character of any value should
be replaced with the text "NH".

Don't you want something like this?

"specialChar" : {
"type":"pattern_replace",
"pattern":"nh-([0-9]{2})",
"replacement":"nh$1"
}

... which says "anything starting with "nh", followed by a "-" and then
two digits is replaced with "nh" followed by the digits"?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#556 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ASqL4R1oT7TMLpmvVvJ8AwCsqo0EvF_6ks5qLss-gaJpZM4Ivs4C
.

Mohit Goyal
09582396263

orangejulius · 2016-06-15T15:51:10Z

You are totally right. AddressIt is not very flexible globally. Fortunately we have work ongoing to use libpostal, a machine learning project to do address parsing globally trained on OSM data. @trescube is working on integrating it into Pelias right now, and I believe initial versions for testing will be available soon.

orangejulius added the ideas label Jun 8, 2016

ecgreb mentioned this issue Jun 13, 2016

Search struggles with hyphenated names (Bon-ton) when given inexact input mapzen/eraser-map#630

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong Result in search query #556

Wrong Result in search query #556

mohitgoyal201617 commented Jun 7, 2016 •

edited by orangejulius

Loading

orangejulius commented Jun 8, 2016

mohitgoyal201617 commented Jun 8, 2016

orangejulius commented Jun 8, 2016 •

edited

Loading

mohitgoyal201617 commented Jun 9, 2016 •

edited by orangejulius

Loading

orangejulius commented Jun 13, 2016

missinglink commented Jun 14, 2016 •

edited

Loading

mohitgoyal201617 commented Jun 15, 2016

orangejulius commented Jun 15, 2016

Wrong Result in search query #556

Wrong Result in search query #556

Comments

mohitgoyal201617 commented Jun 7, 2016 • edited by orangejulius Loading

orangejulius commented Jun 8, 2016

mohitgoyal201617 commented Jun 8, 2016

orangejulius commented Jun 8, 2016 • edited Loading

mohitgoyal201617 commented Jun 9, 2016 • edited by orangejulius Loading

orangejulius commented Jun 13, 2016

missinglink commented Jun 14, 2016 • edited Loading

mohitgoyal201617 commented Jun 15, 2016

orangejulius commented Jun 15, 2016

mohitgoyal201617 commented Jun 7, 2016 •

edited by orangejulius

Loading

orangejulius commented Jun 8, 2016 •

edited

Loading

mohitgoyal201617 commented Jun 9, 2016 •

edited by orangejulius

Loading

missinglink commented Jun 14, 2016 •

edited

Loading