Genre Classification



During 2011/2, the Department of Arts and Culture of the South African Government funded a small-scale project on genre classification for document management. 

 The primary goal of this project was to investigate optimal ways to do genre classification for the ten indigenous South African languages, based on resources under development for the National Centre for HLT (NCHLT). 

Secondarily, the project aimed to foster collaboration between the South African and Low Countries’ HLT communities (both academic and commercial), build human capacity, generate new knowledge, and explore new frontiers for commercial application of HLTs in South Africa.

In order to accomplish the above-mentioned primary and secondary aims, the following tasks were undertaken:

  • We investigated appropriate ontologies and optimal supervised and unsupervised machine learning methods for the development of genre classifiers, specifically for resource-scarce languages (information captured in a master’s dissertation, and in a scholarly publication - see outputs);
  • We developed genre classifiers (and its associated resources) for the ten official indigenous languages of South Africa (available here);
  • We implemented these classifiers as a web-based demo, where users can either upload a document or provide a URL for classification (depending on the chosen genre classification ontology - see here); and
  • We organised a training event on "New Applications of Automatic Text Categorization", presented by Prof Walter Daelemans on 25 January 2012 at the CSIR, Pretoria (description available here, and slides available here or here).
The project is executed and managed by Trifonius, in collaboration with partners, including:


Outputs are available here, including:
  • Protocol: Manual classification
  • Protocol: Evaluation
  • Info: Data sizes and classification ontology
  • Data: Final versions
  • Publications: Conference publication
    • Snyman, DP, Van Huyssteen, GB & Daelemans, W. 2011. Automatic genre classification for resource scarce languages. In:  Proceedings of the 2011 Conference of the Pattern Recognition Association of South Africa. ISBN: 978-0-620-51914-4. 22-25 November. Vanderbijlpark, South Africa. pp. 132-137. [pdf]

As part of the project, a tutorial on New Applications of Automatic Text Categorization has been hosted in January 2012.

The final version of a web-based demonstrator is available here.

Tutorial: New Applications of Automatic Text Categorisation

Presenter:    Prof Walter Daelemans (University of Antwerp, Belgium) (
Date:               25 January 2012
Time:              09:00-16:00
Place:              Knowledge Commons, CSIR, Pretoria
Cost:                Free
Tutorial description: New Applications of Automatic Text Categorization
Automatic text categorization is a mature language technology that is able to sort documents into different categories on the basis of examples. Its applications range from e-mail routing and spam filtering to topic detection and text genre assignment. A text categorization system incorporates an approach to document representation (mostly a set of relevant terms or n-grams of words found in the document), and a machine learning method. In the first part of the tutorial, this basic architecture has been described at an introductory level, and an overview of state of the art document representation and machine learning methods have been presented.
In the second part of the tutorial, we focused on more technical detail about a new application area of this technology: automatic profiling of text. In this application, we are interested in which metadata we can infer from a document. More specifically we are interested in how far we can get with text categorisation techniques in tasks like the following: 
(i) Text profiling: predicting age, gender, personality, and region of the author of the text.
(ii) Intrinsic plagiarism detection: finding passages in text not written by the author.
(iii) Deception detection: finding out whether reviews and reports are truthful, detecting pedophile grooming in social networks etc.
In order to achieve this, we need document representations that are different from other applications, instead of (patterns of) content words we need other linguistic categories, and special purpose machine learning algorithms for some of the tasks, such as Koppel et al.'s unmasking algorithm.  
This workshop was hosted and organised by Trifonius, and was made possible through funding by the National Centre for Human Language Technologies of the Department of Arts and Culture, and a financial contribution by the Human Language Technology Competence Area of the CSIR. The workshop was attended by eleven scholars and students.
Slide of the workshop is available