Authorship Verification of Yorùbá Blog Posts using Character N-grams
No Thumbnail Available
Date
2020
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ICMCECS (IEEE)
Abstract
The task of determining whether a pair (or more) documents were written by the same author comes under authorship verification. N-grams are sequences of elements appearing in texts; they can be words, POS tags, characters, or some other elements that can be encountered one after another in texts. The tasks in authorship verification were more challenging as it focused on whether the target author and the text to be used have a closely related style. In this paper, an authorship verification task on Yorùbá blog posts is hereby presented. N-grams features were extracted from the corpus, and inductive learning techniques were applied to build feature-based models in order to perform the automatic author identification. The K-means clustering algorithm was used in the study since the supervised algorithm cannot be applied to the one-class classification of the dataset. The evaluation was done with the Silhouette Coefficient algorithm, which is used to evaluate unlabeled data. The result obtained is positive, which indicates the data points have a strong relationship with the dataset. The obtained result signifies a yes relationship between the posts. This signifies that the posts were from the same author.
Description
The article presents the authorship verification of a Yorùbá online posts. The experimental results show that there is a perfect relationship in the dataset. The study was able to prove that the contents (data) of the posts are from the same author and not from the different authors as earlier thought. The study believed that the application of further techniques can be employed for further evaluation. It was discovered that using unprocessed data will most time give a low result or misleading result. As a further study, the room to add additional dataset is required, generate additional features and use other ML techniques to ascertain our result.
Keywords
Citation
7