Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review handling of accented characters and/or stop words #35

Open
dcabo opened this issue Oct 22, 2019 · 9 comments
Open

Review handling of accented characters and/or stop words #35

dcabo opened this issue Oct 22, 2019 · 9 comments

Comments

@dcabo
Copy link
Member

dcabo commented Oct 22, 2019

At the moment, "Bárcenas" and "Barcenas" return different search results. We can configure an "analyzer" at index time to remove accents and stop words, based on the language. Documented here.

@esebastian esebastian self-assigned this Dec 10, 2019
@dcabo dcabo added this to the Launch milestone Feb 4, 2020
@dcabo dcabo modified the milestones: Launch, Verba 1.1 Feb 18, 2020
@esebastian
Copy link
Contributor

This has turned out be slightly more challenging than expected, as there are multiple trade offs we need to consider.

The basics are that we need to leverage the stop and the asciifolding token filters, which respectively remove stop words from the text and fold any diacritics in words, but in order to do that we need to create a custom analyzer because it isn't possible to simply add a filter to an existing built-in analyzer.

At the moment (as we didn't specified anything when we created the captions index) we're using the built-in standard analyzer, which provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and includes the lower case token filter.

We could configure a diacritics insensitive analyzer by recreating the built-in standard analyzer from its actual configuration and by including the asciifolding token filter, but given that all the text we're working with is in Spanish, it would make sense to use the built-in spanish analyzer as the starting point instead, but the fact that the spanish analyzer makes use of an specific Spanish stemmer (spanish or light_spanish) can render some potentially undesirable results, as the search terms would be reduced to their root forms.

For reference, given the text "Is this a déjà vu? La piragua baja por el río Segura y la Niña Pastori canta por soleares en el château; mientras la cigüeña vuela hacia el sur.", the standard based custom analyzer would return the tokens [is, this, a, deja, vu, la, piragua, baja, por, el, rio, segura, y, la, nina, pastori, canta, por, soleares, en, el, chateau, mientras, la, ciguena, vuela, hacia, el, sur] while the spanish based custom analyzer would return [is, this, deja, vu, piragu, baja, rio, segur, nina, pastori, cant, solear, chateau, mientr, ciguen, vuel, haci, sur].

Having the search terms reduced to their root forms would mean that searching for the term "cantó" would render results including "canto", but also "canta", "cante" or "cantos".

There is also another angle to consider, and that is related to the way we're currently highlighting the query terms in the search results. At the moment we're applying a simple regular expression on the client side of the app to put some custom markup around the query terms, but that regular expression would miss the folded (diacritics free) terms and also the words different from the query terms but with the same root:

Screenshot 2020-02-19 at 20 02 26

A possible solution for this could be to leverage the built-in highlighting capabilities of the Elasticsearch engine, but that would pose a different kind of challenge because the way highlighting works is returning text snippets in a separate key in the search response that are limited to 100 chars length by default.

The search for the term "canto" would render the following response (I'm removing the non-relevant parts):

{
  [...]
  "hits": {
    [...]
    "hits": [
      {
        [...]
        "_source": {
          [...]
          "text": "Se verá si estos días se rebaja la tensión y deja de sonar el canto de los hombres enojados Así dice la letra que estos estudiantes hongkoneses han inventando sobre el himno nacional de China en un video que ha circulado por las redes hasta que ha dejado de estar disponible.\n",
          [...]
        },
        "highlight": {
          "text": [
            "Se verá si estos días se rebaja la tensión y deja de sonar el <mark>canto</mark> de los hombres enojados Así dice"
          ]
        },
       [...]
      },
      {
        [...]
        "_source": {
          [...]
          "text": "Así canta Kevin Spacy la bamba, con la tuna de Derecho de Sevilla.\n",
          [...]
        },
        "highlight": {
          "text": [
            "Así <mark>canta</mark> Kevin Spacy la bamba, con la tuna de Derecho de Sevilla."
          ]
        },
       [...]
      },
      {
        [...]
        "_source": {
          [...]
          "text": "No podía ser otro lugar sino el emblemático y legendario Luna Park, donde Gardel cantó por última vez en Buenos Aires, el escenario de la final del mundial de tango, que como en ediciones anteriores mantiene su habitual presencia internacional, con participantes de 36 países, entre ellos España.\n",
          [...]
        },
        "highlight": {
          "text": [
            "No podía ser otro lugar sino el emblemático y legendario Luna Park, donde Gardel <mark>cantó</mark> por última vez"
          ]
        },
       [...]
      },
    ]
  },
  [...]
}

The length of the text snippets can be configured, and we could get the whole field content in the highlighting results, but apart from other considerations like performance (it seems that highlighting may incur in some performance penalties, at least without specific tuning) we would be forced to reimplement the parsing of the results on the client side of the app.

So from my point of view the way to go would be to configure a custom analyzer using the built-in Spanish stopwords list for the stop token filter and also including the asciifolding one, but disregarding the use of any stemmer.

That wouldn't prevent cases like searching for "niña" and getting results containing "Nina", but all the matching words in the results would be an exact match of the query terms (diacritics aside) and it would be relatively simple to tweak the regular expression we're using in the client side of the app to take into account all the diacritics variants when highlighting the query terms.

@dcabo
Copy link
Member Author

dcabo commented Feb 27, 2020

@EvaBelmonte @MAGavilanes @javidevega @carmen-tm resumiendo lo de Eduardo: ya podemos cambiar la configuración del buscador para que ignore acentos. Ha costado más de lo previsto porque hay que tocar como pintamos los resultados y tal, entre otras cosas, pero ya está. Puede haber casos donde alguien busque "Cantó" y le devuelva cosas de "canto", pero entiendo que son casos nicho y que el beneficio en general es mayor: "climático" vs "climatico" y todo eso.

Solo hay una cosa un tanto molesta, y es que para Elastic la tilde de la eñe es un acento también, y lo quita. O sea, que "niña" y "Nina" es lo mismo. Eduardo y yo no encontramos ejemplos donde esto sea un problema real, pero por comentar

¿Os parece bien que hagamos el cambio, a pesar del tema de la eñe?

@civio civio deleted a comment from dcabo Feb 27, 2020
@esebastian
Copy link
Contributor

esebastian commented Feb 27, 2020

@EvaBelmonte @MAGavilanes @javidevega @carmen-tm @dcabo Además de lo de ignorar acentos, también hemos configurado el buscador para que ignore la lista de stop words (palabras vacías) propias del español.

La lista de palabras que el buscador considera stop words se puede ver aquí. Es bastante amplia, así que no estaría de más que le pegaseis un vistazo para ver si la lista nos vale tal cual, si tendríamos que ajustar las palabras que la componen (que se puede) o si simplemente no deberíamos de ignorarlas.

@mgavilanes
Copy link

¿Es viable que el buscador por defecto ignore acentos y mayúsculas pero que tenga dos selectores, para decidir si quieres que sea o no case o accent sensitive? A la hora de afinar las menciones, por ejemplo, al partido 'Ciudadanos' (nombre propio) y distinguirla del sustantivo plural ('ciudadanos de todo el mundo'), el uso de la mayúscula hace parte del trabajo (salvo fallos en la transcripción). Puedo hacerlo en csv o, si el usuario es más conformista, pedirle a Verba que haga el trabajo. ¿Cómo lo véis?

@dcabo
Copy link
Member Author

dcabo commented Feb 28, 2020 via email

@EvaBelmonte
Copy link

He revisado las palabras vacías y tienen sentido. No tengo 100% claro lo del cambio, pero lo que sí es que si se modifica hay que cambiar las querys de las fichas, que están ahí con acentos y sí y se han usado para enseñar cómo funciona Verba. O sea: además de cambiarlo en las fichas habría que comunicar ese cambio a los usuarios (en la web, en twitter, en github...)

@dcabo
Copy link
Member Author

dcabo commented Mar 2, 2020 via email

@esebastian
Copy link
Contributor

@dcabo ¿Dejamos esto en standby o lo cerramos ahora y lo abrimos de nuevo en caso de que lo llevemos adelante en el futuro?

@dcabo
Copy link
Member Author

dcabo commented Apr 14, 2020

Pff, Verba, blast from the past, jaja, tengo pendiente escribir a los que se ofrecieron ayudar 😞 Déjalo de momento a ver si retomamos la normalidad en algún momento, yo esto lo quiero hacer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants