-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Rafi TRAD
authored
Oct 8, 2020
1 parent
bd4a663
commit f8b1681
Showing
1 changed file
with
4 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,7 @@ | ||
# Authorial Clustering of Shorter Texts With Non-parametric Topic Models | ||
|
||
The digital age has engulfed us in data of various types, but not all of data are innocuous. Adverse data can be detrimental to individuals and nations alike and thus require serious intervention. In that regard, authenticating data and ascribing them to their producers can consequently aid in marginalising harmful data and advancing digital forensics. In the context of textual data, topic modelling proved useful in one authorship analysis task, namely authorship verification, and this study is the first to assess the feasibility of topic modelling towards the related authorial clustering task. | ||
Authorial clustering involves the grouping of documents written by the same author or team of authors without any prior positive examples of an author’s writing style or thematic preferences. For authorial clustering on shorter texts (paragraph-length texts that are typically shorter than conventional documents), the document representation is particularly important: very high-dimensional feature spaces lead to data sparsity and suffer from serious consequences like the curse of dimensionality, while feature selection may lead to information loss. I programmed a high-level framework which utilizes a compact data representation in a latent feature space derived with non-parametric topic modeling. Authorial clusters are identified thereafter in two scenarios: (a) fully unsupervised and (b) semi-supervised where a small number of shorter texts are known to belong to the same author (must-link constraints) or not (cannot-link constraints). | ||
|
||
We propose a language-independent approach to perform authorial clustering on shorter paragraph-long texts. The aforesaid approach utilises non-parametric topic models with a straightforward term weighting scheme to infer a low-dimensional less-noisy latent semantic space representation of texts (LSSR) which enables some traditional clustering algorithms to work more effectively. A simple and an elaborate workflow are assessed in light of naive and state-of-the-art baselines, using 120 authorial clustering problems which span three languages and two genres. | ||
Experiments with 120 collections in three languages and two genres show that the topic-based latent feature space provides a promising level of performance while reducing the dimensionality by a factor of 1500x compared to state-of-the-arts! I also found that while prior knowledge on the precise number of authors (i.e. authorial clusters) does not contribute much to additional quality, little knowledge on constraints in authorial clusters memberships leads to clear performance improvements in front of this difficult task. | ||
|
||
In the end, thorough experimentation with standard metrics indicates that there still remains an ample room for improvement for authorial clustering, especially with shorter texts. |