During 2011/2, the Department of Arts and Culture of the South African Government funded a small-scale project on genre classification for document management.
The primary goal of this project was to investigate optimal ways to do genre classification for the ten indigenous South African languages, based on resources under development for the National Centre for HLT (NCHLT).
Secondarily, the project aimed to foster collaboration between the South African and Low Countries’ HLT communities (both academic and commercial), build human capacity, generate new knowledge, and explore new frontiers for commercial application of HLTs in South Africa.
In order to accomplish the above-mentioned primary and secondary aims, the following tasks were undertaken:
- We investigated appropriate ontologies and optimal supervised and unsupervised machine learning methods for the development of genre classifiers, specifically for resource-scarce languages (information captured in a master’s dissertation, and in a scholarly publication - see outputs);
- We developed genre classifiers (and its associated resources) for the ten official indigenous languages of South Africa (available here);
- We implemented these classifiers as a web-based demo, where users can either upload a document or provide a URL for classification (depending on the chosen genre classification ontology - see here); and
- We organised a training event on "New Applications of Automatic Text Categorization", presented by Prof Walter Daelemans on 25 January 2012 at the CSIR, Pretoria (description available here, and slides available here or here).
- Protocol: Manual classification
- Protocol: Evaluation
- Info: Data sizes and classification ontology
- Data: Final versions
- Publications: Conference publication
- Snyman, DP, Van Huyssteen, GB & Daelemans, W. 2011. Automatic genre classification for resource scarce languages. In: Proceedings of the 2011 Conference of the Pattern Recognition Association of South Africa. ISBN: 978-0-620-51914-4. 22-25 November. Vanderbijlpark, South Africa. pp. 132-137. [pdf]
As part of the project, a tutorial on New Applications of Automatic Text Categorization has been hosted in January 2012.
Tutorial: New Applications of Automatic Text Categorisation
(i) Text profiling: predicting age, gender, personality, and region of the author of the text.
(ii) Intrinsic plagiarism detection: finding passages in text not written by the author.
(iii) Deception detection: finding out whether reviews and reports are truthful, detecting pedophile grooming in social networks etc.
In order to achieve this, we need document representations that are different from other applications, instead of (patterns of) content words we need other linguistic categories, and special purpose machine learning algorithms for some of the tasks, such as Koppel et al.'s unmasking algorithm.