Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Rafi TRAD authored Oct 8, 2020
1 parent bd4a663 commit f8b1681
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Authorial Clustering of Shorter Texts With Non-parametric Topic Models

The digital age has engulfed us in data of various types, but not all of data are innocuous. Adverse data can be detrimental to individuals and nations alike and thus require serious intervention. In that regard, authenticating data and ascribing them to their producers can consequently aid in marginalising harmful data and advancing digital forensics. In the context of textual data, topic modelling proved useful in one authorship analysis task, namely authorship verification, and this study is the first to assess the feasibility of topic modelling towards the related authorial clustering task.
Authorial clustering involves the grouping of documents written by the same author or team of authors without any prior positive examples of an author’s writing style or thematic preferences. For authorial clustering on shorter texts (paragraph-length texts that are typically shorter than conventional documents), the document representation is particularly important: very high-dimensional feature spaces lead to data sparsity and suffer from serious consequences like the curse of dimensionality, while feature selection may lead to information loss. I programmed a high-level framework which utilizes a compact data representation in a latent feature space derived with non-parametric topic modeling. Authorial clusters are identified thereafter in two scenarios: (a) fully unsupervised and (b) semi-supervised where a small number of shorter texts are known to belong to the same author (must-link constraints) or not (cannot-link constraints).

We propose a language-independent approach to perform authorial clustering on shorter paragraph-long texts. The aforesaid approach utilises non-parametric topic models with a straightforward term weighting scheme to infer a low-dimensional less-noisy latent semantic space representation of texts (LSSR) which enables some traditional clustering algorithms to work more effectively. A simple and an elaborate workflow are assessed in light of naive and state-of-the-art baselines, using 120 authorial clustering problems which span three languages and two genres.
Experiments with 120 collections in three languages and two genres show that the topic-based latent feature space provides a promising level of performance while reducing the dimensionality by a factor of 1500x compared to state-of-the-arts! I also found that while prior knowledge on the precise number of authors (i.e. authorial clusters) does not contribute much to additional quality, little knowledge on constraints in authorial clusters memberships leads to clear performance improvements in front of this difficult task.

In the end, thorough experimentation with standard metrics indicates that there still remains an ample room for improvement for authorial clustering, especially with shorter texts.

0 comments on commit f8b1681

Please sign in to comment.