- Dutch Language Union (Belgium, The Netherlands)
- Department of Arts and Culture (South Africa)
Gerhard B van Huyssteen – Project leader and linguistics
- CTexT (Centre for Text Technology), North-West University, South Africa
Walter Daelemans - Computational linguistics
- CLiPs (Computational Linguistics and Psycholinguistics), University of Antwerp, Belgium
- CTexT (Centre for Text Technology), North-West University, Potchefstroom, South Africa
During 2011/2, the Department of Arts and Culture of the South African Government funded a small-scale project on genre classification for document management.
During the project, the following tasks were undertaken:
- We investigated appropriate ontologies and optimal supervised and unsupervised machine learning methods for the development of genre classifiers, specifically for resource-scarce languages (information captured in a master’s dissertation, and in a scholarly publication);
- We developed genre classifiers (and its associated resources) for the ten official indigenous languages of South Africa (available here);
- We implemented these classifiers as a web-based demo, where users can either upload a document or provide a URL for classification (depending on the chosen genre classification ontology); and
- We organised a training event on "New Applications of Automatic Text Categorization", presented by Prof Walter Daelemans on 25 January 2012 at the CSIR, Pretoria).
The primary aim of this project was to develop resources (including annotation protocols, and training and testing data) for the development of:
- automatic genre classifiers for ten South African languages.
Other secondary aims included:
- to report on the research and development process in the form of:
- one Master’s degree dissertation;
- at least two scholarly papers, to be published in relevant journals or peer-reviewed conference proceedings;
- various annotation protocols, made available publicly;
- to contribute towards human capital development and growth of the pool of experts in descriptive linguistics and computational linguistics in South Africa, Belgium and The Netherlands by offering bursaries, grants or contract work to undergraduate and post-graduate students;
- to extend the collaboration network between Trifonius, North-West University (NWU) and University of Antwerp (UA), by introducing young scholars and students to each other;
- to identify new research issues as they unfold in the research and development process; and
- to contribute to the HLT-enabling of the languages of South Africa.
Snyman, DP, Van Huyssteen, GB & Daelemans, W. 2014. Outomatiese Genreklassifikasie vir Afrikaans [Automatic genre classification for Afrikaans]. DOI: 10.4102/satnt.v33i1.759. Suid-Afrikaanse Tydskrif vir Natuurwetenskap en Tegnologie. 33(1): 12 pp.
Snyman, D, Van Huyssteen, GB & Daelemans, W. 2012. Cross-Lingual Genre Classification for Closely Related Languages. In: Proceedings of the Twenty-Third Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-620-54601-0. 29-30 November. Pretoria, South Africa. pp. 133-137.
Snyman, DP, Van Huyssteen, GB & Daelemans, W. 2011. Automatic genre classification for resource scarce languages. In: Proceedings of the 2011 Conference of the Pattern Recognition Association of South Africa. ISBN: 978-0-620-51914-4. 22-25 November. Vanderbijlpark, South Africa. pp. 132-137.
Genre classification corpora for South African languages 1.0. (Project leader, with Walter Daelemans as co-project leader, and Dirk Snyman as main collaborator and scientific programmer). Potchefstroom: NWU.
- Corpora that can be used to train genre classifiers for South African languages.
- Afrikaans Genre Classification Corpus (ISLRN: 666-908-651-526-7)
- isiNdebele Genre Classification Corpus (ISLRN: 248-916-003-745-6)
- isiXhosa Genre Classification Corpus (ISLRN: 418-998-894-930-1)
- isiZulu Genre Classification Corpus (ISLRN: 457-135-629-106-1)
- Sesotho Genre Classification Corpus (ISLRN: 469-495-440-934-0)
- Sesotho sa Leboa Genre Classification Corpus (ISLRN: 676-872-880-082-8)
- Setswana Genre Classification Corpus (ISLRN: 921-735-738-409-8)
- Siswati Genre Classification Corpus (ISLRN: 718-674-341-027-9)
- Tshivenda Genre Classification Corpus (ISLRN: 098-827-706-093-4)
- Xitsonga Genre Classification Corpus (ISLRN: 210-849-527-713-3)
- Cite as: Snyman, D, Van Huyssteen, GB & Daelemans, W. 2012. Genre classification corpora for South African languages 1.0. Potchefstroom: North-West University. Available at gcsal.sf.net.
(ii) Intrinsic plagiarism detection: finding passages in text not written by the author.
(iii) Deception detection: finding out whether reviews and reports are truthful, detecting pedophile grooming in social networks etc.
In order to achieve this, we need document representations that are different from other applications, instead of (patterns of) content words we need other linguistic categories, and special purpose machine learning algorithms for some of the tasks, such as Koppel et al.'s unmasking algorithm.
Final version of web-based demonstrator.