Review handling of accented characters and/or stop words #35

dcabo · 2019-10-22T10:31:30Z

At the moment, "Bárcenas" and "Barcenas" return different search results. We can configure an "analyzer" at index time to remove accents and stop words, based on the language. Documented here.

esebastian · 2020-02-19T20:03:27Z

This has turned out be slightly more challenging than expected, as there are multiple trade offs we need to consider.

The basics are that we need to leverage the stop and the asciifolding token filters, which respectively remove stop words from the text and fold any diacritics in words, but in order to do that we need to create a custom analyzer because it isn't possible to simply add a filter to an existing built-in analyzer.

At the moment (as we didn't specified anything when we created the captions index) we're using the built-in standard analyzer, which provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and includes the lower case token filter.

We could configure a diacritics insensitive analyzer by recreating the built-in standard analyzer from its actual configuration and by including the asciifolding token filter, but given that all the text we're working with is in Spanish, it would make sense to use the built-in spanish analyzer as the starting point instead, but the fact that the spanish analyzer makes use of an specific Spanish stemmer (spanish or light_spanish) can render some potentially undesirable results, as the search terms would be reduced to their root forms.

For reference, given the text "Is this a déjà vu? La piragua baja por el río Segura y la Niña Pastori canta por soleares en el château; mientras la cigüeña vuela hacia el sur.", the standard based custom analyzer would return the tokens [is, this, a, deja, vu, la, piragua, baja, por, el, rio, segura, y, la, nina, pastori, canta, por, soleares, en, el, chateau, mientras, la, ciguena, vuela, hacia, el, sur] while the spanish based custom analyzer would return [is, this, deja, vu, piragu, baja, rio, segur, nina, pastori, cant, solear, chateau, mientr, ciguen, vuel, haci, sur].

Having the search terms reduced to their root forms would mean that searching for the term "cantó" would render results including "canto", but also "canta", "cante" or "cantos".

There is also another angle to consider, and that is related to the way we're currently highlighting the query terms in the search results. At the moment we're applying a simple regular expression on the client side of the app to put some custom markup around the query terms, but that regular expression would miss the folded (diacritics free) terms and also the words different from the query terms but with the same root:

A possible solution for this could be to leverage the built-in highlighting capabilities of the Elasticsearch engine, but that would pose a different kind of challenge because the way highlighting works is returning text snippets in a separate key in the search response that are limited to 100 chars length by default.

The search for the term "canto" would render the following response (I'm removing the non-relevant parts):

{
  [...]
  "hits": {
    [...]
    "hits": [
      {
        [...]
        "_source": {
          [...]
          "text": "Se verá si estos días se rebaja la tensión y deja de sonar el canto de los hombres enojados Así dice la letra que estos estudiantes hongkoneses han inventando sobre el himno nacional de China en un video que ha circulado por las redes hasta que ha dejado de estar disponible.\n",
          [...]
        },
        "highlight": {
          "text": [
            "Se verá si estos días se rebaja la tensión y deja de sonar el <mark>canto</mark> de los hombres enojados Así dice"
          ]
        },
       [...]
      },
      {
        [...]
        "_source": {
          [...]
          "text": "Así canta Kevin Spacy la bamba, con la tuna de Derecho de Sevilla.\n",
          [...]
        },
        "highlight": {
          "text": [
            "Así <mark>canta</mark> Kevin Spacy la bamba, con la tuna de Derecho de Sevilla."
          ]
        },
       [...]
      },
      {
        [...]
        "_source": {
          [...]
          "text": "No podía ser otro lugar sino el emblemático y legendario Luna Park, donde Gardel cantó por última vez en Buenos Aires, el escenario de la final del mundial de tango, que como en ediciones anteriores mantiene su habitual presencia internacional, con participantes de 36 países, entre ellos España.\n",
          [...]
        },
        "highlight": {
          "text": [
            "No podía ser otro lugar sino el emblemático y legendario Luna Park, donde Gardel <mark>cantó</mark> por última vez"
          ]
        },
       [...]
      },
    ]
  },
  [...]
}

The length of the text snippets can be configured, and we could get the whole field content in the highlighting results, but apart from other considerations like performance (it seems that highlighting may incur in some performance penalties, at least without specific tuning) we would be forced to reimplement the parsing of the results on the client side of the app.

So from my point of view the way to go would be to configure a custom analyzer using the built-in Spanish stopwords list for the stop token filter and also including the asciifolding one, but disregarding the use of any stemmer.

That wouldn't prevent cases like searching for "niña" and getting results containing "Nina", but all the matching words in the results would be an exact match of the query terms (diacritics aside) and it would be relatively simple to tweak the regular expression we're using in the client side of the app to take into account all the diacritics variants when highlighting the query terms.

dcabo · 2020-02-27T15:28:11Z

@EvaBelmonte @MAGavilanes @javidevega @carmen-tm resumiendo lo de Eduardo: ya podemos cambiar la configuración del buscador para que ignore acentos. Ha costado más de lo previsto porque hay que tocar como pintamos los resultados y tal, entre otras cosas, pero ya está. Puede haber casos donde alguien busque "Cantó" y le devuelva cosas de "canto", pero entiendo que son casos nicho y que el beneficio en general es mayor: "climático" vs "climatico" y todo eso.

Solo hay una cosa un tanto molesta, y es que para Elastic la tilde de la eñe es un acento también, y lo quita. O sea, que "niña" y "Nina" es lo mismo. Eduardo y yo no encontramos ejemplos donde esto sea un problema real, pero por comentar

¿Os parece bien que hagamos el cambio, a pesar del tema de la eñe?

esebastian · 2020-02-27T16:12:26Z

@EvaBelmonte @MAGavilanes @javidevega @carmen-tm @dcabo Además de lo de ignorar acentos, también hemos configurado el buscador para que ignore la lista de stop words (palabras vacías) propias del español.

La lista de palabras que el buscador considera stop words se puede ver aquí. Es bastante amplia, así que no estaría de más que le pegaseis un vistazo para ver si la lista nos vale tal cual, si tendríamos que ajustar las palabras que la componen (que se puede) o si simplemente no deberíamos de ignorarlas.

mgavilanes · 2020-02-28T10:05:04Z

¿Es viable que el buscador por defecto ignore acentos y mayúsculas pero que tenga dos selectores, para decidir si quieres que sea o no case o accent sensitive? A la hora de afinar las menciones, por ejemplo, al partido 'Ciudadanos' (nombre propio) y distinguirla del sustantivo plural ('ciudadanos de todo el mundo'), el uso de la mayúscula hace parte del trabajo (salvo fallos en la transcripción). Puedo hacerlo en csv o, si el usuario es más conformista, pedirle a Verba que haga el trabajo. ¿Cómo lo véis?

dcabo · 2020-02-28T10:22:17Z

Eso tiene complejidad técnica (tenemos que tener el contenido duplicado en el buscador, uno con mayúsculas y otro limpio) y de interfaz (poner la opción y explicarla), a cambio de un caso de uso bastante raro. No he visto ningún buscador ahí fuera haciendo esto. La solución buena a esto es el buscador por entidad. Que ahora mismo no es del todo fiable, pero parece mejor intentar afinar eso que montar un apaño retorcido. Apaño que tampoco te va a funcionar bien cuando esté a principio de frase, “Ciudadanos de todo el mundo se han manifiestado hoy contra el cambio climático”.

…

On 28 Feb 2020, 11:05 +0100, MAGavilanes ***@***.***>, wrote: ¿Es viable que el buscador por defecto ignore acentos y mayúsculas pero que tenga dos selectores, para decidir si quieres que sea o no case o accent sensitive? A la hora de afinar las menciones, por ejemplo, al partido 'Ciudadanos' (nombre propio) y distinguirla del sustantivo plural ('ciudadanos de todo el mundo'), el uso de la mayúscula hace parte del trabajo (salvo fallos en la transcripción). Puedo hacerlo en csv o, si el usuario es más conformista, pedirle a Verba que haga el trabajo. ¿Cómo lo véis? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

EvaBelmonte · 2020-03-02T15:37:02Z

He revisado las palabras vacías y tienen sentido. No tengo 100% claro lo del cambio, pero lo que sí es que si se modifica hay que cambiar las querys de las fichas, que están ahí con acentos y sí y se han usado para enseñar cómo funciona Verba. O sea: además de cambiarlo en las fichas habría que comunicar ese cambio a los usuarios (en la web, en twitter, en github...)

dcabo · 2020-03-02T15:52:28Z

Lo de cambiar las queries en titulares lo tenía en la cabeza, sí. De todas formaa, si tenéis dudas lo comentamks sin prjsa el miércoles cuando esté en la oficina

…

On 2 Mar 2020, 16:38 +0100, Eva Belmonte ***@***.***>, wrote: He revisado las palabras vacías y tienen sentido. No tengo 100% claro lo del cambio, pero lo que sí es que si se modifica hay que cambiar las querys de las fichas, que están ahí con acentos y sí y se han usado para enseñar cómo funciona Verba. O sea: además de cambiarlo en las fichas habría que comunicar ese cambio a los usuarios (en la web, en twitter, en github...) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

esebastian · 2020-04-14T06:37:37Z

@dcabo ¿Dejamos esto en standby o lo cerramos ahora y lo abrimos de nuevo en caso de que lo llevemos adelante en el futuro?

dcabo · 2020-04-14T11:22:43Z

Pff, Verba, blast from the past, jaja, tengo pendiente escribir a los que se ofrecieron ayudar 😞 Déjalo de momento a ver si retomamos la normalidad en algún momento, yo esto lo quiero hacer.

dcabo added the backend/elastic label Oct 22, 2019

esebastian self-assigned this Dec 10, 2019

dcabo added this to the Launch milestone Feb 4, 2020

dcabo modified the milestones: Launch, Verba 1.1 Feb 18, 2020

civio deleted a comment from dcabo Feb 27, 2020

esebastian added the on hold label Apr 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review handling of accented characters and/or stop words #35

Review handling of accented characters and/or stop words #35

dcabo commented Oct 22, 2019

esebastian commented Feb 19, 2020

dcabo commented Feb 27, 2020 •

edited

Loading

esebastian commented Feb 27, 2020 •

edited

Loading

mgavilanes commented Feb 28, 2020

dcabo commented Feb 28, 2020 via email

EvaBelmonte commented Mar 2, 2020

dcabo commented Mar 2, 2020 via email

esebastian commented Apr 14, 2020

dcabo commented Apr 14, 2020

Review handling of accented characters and/or stop words #35

Review handling of accented characters and/or stop words #35

Comments

dcabo commented Oct 22, 2019

esebastian commented Feb 19, 2020

dcabo commented Feb 27, 2020 • edited Loading

esebastian commented Feb 27, 2020 • edited Loading

mgavilanes commented Feb 28, 2020

dcabo commented Feb 28, 2020 via email

EvaBelmonte commented Mar 2, 2020

dcabo commented Mar 2, 2020 via email

esebastian commented Apr 14, 2020

dcabo commented Apr 14, 2020

dcabo commented Feb 27, 2020 •

edited

Loading

esebastian commented Feb 27, 2020 •

edited

Loading