Skip to content

v0.4

Compare
Choose a tag to compare
@parpalak parpalak released this 06 Jan 12:14
· 79 commits to master since this release

Release features

  • Updated DB structure (PdoStorage::erase() call is required on updates):
    1. Optimized indexing speed and index disk usage in DB (~1.5 times).
    2. Added storing some meta-information (currently word count) for indexing texts.
  • Revised algorithm for calculating relevance. Now the following factors are taken into account:
    1. The abundance of words for calculating pairwise relevance (proximity relevance).
    2. The size of indexed text (see below)
  • Improved algorithm of choosing sentences for snippets (the abundance of words is taken into account, see #20).
  • Refinements in Russian stemmer.

The size of indexed text affects relevance
In this release the size of indexed text itself has some impact on relevance. Texts of medium size (300...350 words) are preferred (although the factors like the number of occurances and words frequency are more important). This is done under the assumption that too short text cannot fully disclose a thought or concept, and too long text contains a lot of thoughts or concepts. This is how word count affects increasing relevance:

The size of the indexed text affects relevance
In this release, the size of the indexed text itself has some impact on relevance. Texts of medium size (300 to 350 words) are preferred, although factors like the number of occurrences and word frequency are more important. This is based on the assumption that a text that is too short cannot fully convey a thought or concept, and a text that is too long may contain multiple thoughts or concepts. This is how word count affects the increase in relevance:

graph