Call for papers for the NUSA special issue
"Linguistic studies using large annotated corpora"

Editors: Hiroki Nomoto and David Moeljadi

Background

Corpora have been used widely in modern linguistic research. Two notable features in corpus development in recent years are a significant increase in size and various kinds of annotations. Billion-size corpora are not uncommon nowadays. Efforts have been made to enrich raw texts with linguistic information such as morphology, parts of speech, constituent structure, semantic dependency, information and discourse structural status and so on. However, these developments that took place primarily in the field of natural language processing have not been maximally utilised in linguistic research of languages in Nusantara.

This special issue of NUSA: Linguistic studies of languages in and around Indonesia is intended to encourage researchers to explore the available resources and share ways of using them to investigate old and new empirical and theoretical topics.

Important dates

~~15 March~~ 10 May 2019	Manuscript submission deadline
Mid May (Late June for artciles submitted after 15 Mar) 2019	Notification of the editorial decision
1 August 2019	Final manuscript deadline
September 2019	Publication online

Examples of large annotated corpora

All manuscripts should explicitly state what resource(s) they use and how they utilise the annotations. In addition to the annotated corpora listed below, one can also build his/her own annotated corpus by annotating a raw corpus using a morphological dictionary (e.g. MALINDO Morph (Nomoto et al. 2018a)), a POS tagger (e.g. Morphind (Larasati et al. 2011), Rule-Based POS Tagger Bahasa Indonesia (Rashel et al. 2014)), an HPSG grammar (e.g. INDRA (Moeljadi et al. 2015)), etc.

MALINDO Conc (Nomoto et al. 2018b)
[Reclassified version of the Leipzig Corpora Collection (Goldhahn et al. 2012; Nomoto et al. to appear); morphological annotation; Malay, Indonesian; 3 million words]
Korpus Indonesia (KOIN) (Kwary 2018)
[part-of-speech annotation; Indonesian; 3.7 million words]
One Million POS Tagged Corpus of Bahasa Indonesia
[part-of-speech annotation; Indonesian; 1 million words]
Data from the Jakarta Field Station, Department of Linguistics, Max Planck Institute for Evolutionary Anthropology, 1999-2015
[word gloss annotation; Jakarta Indonesian and other languages in Indonesia; 4.5 million words]

Use of open resources is recommended to ensure the replicability of the findings and equality amongst researchers from different financial backgrounds.

For authors

For style files (LaTeX and Microsoft Word) and enquiries, please contact Hiroki Nomoto (nomoto ⟨ΑΤ⟩ tufs.ac.jp) or David Moeljadi (davidmoeljadi ⟨ΑΤ⟩ gmail.com).
We may be able to provide English language proofreading for selected authors of the accepted papers who are not native English speakers and are not affiliated with institutions whose main language of instruction/administration is English.

References

Goldhahn, Dirk, Thomas Eckart & Uwe Quasthoff. 2012. Building large monolingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12).
Kwary, Deny A. 2018. Towards the first online Indonesian National Corpus. The 4th Asia Pacific Corpus Linguistics Conference (APCLC 2018).
Larasati, Septina Dian, Vladislav Kuboň & Daniel Zeman. 2011. Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus. In Cerstin Mahlow et al. (eds.) Systems and Frameworks for Computational Morphology, 119-129. Verlag: Springer.
Moeljadi, David, Francis Bond & Sanghoun Song. 2015. Building an HPSG-based Indonesian Resource Grammar (INDRA). In Proceedings of the Grammar Engineering Across Frameworks (GEAF) Workshop, 53rd Annual Meeting of the ACL and 7th IJCNLP, 9-16.
Nomoto, Hiroki, Hannah Choi, David Moeljadi & Francis Bond. 2018a. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. In Kiyoaki Shirai (ed.) Proceedings of the LREC 2018 Workshop "The 13th Workshop on Asian Language Resources", 36-43.
Nomoto, Hiroki, Shiro Akasegawa & Asako Shiohara. 2018b. Building an open online concordancer for Malay/Indonesian. The 22nd International Symposium on Malay/Indonesian Linguistics (ISMIL). [slides]
Nomoto, Hiroki, Shiro Akasegawa & Asako Shiohara. to appear. Reclassification of the Leipzig Corpora Collection for Malay and Indonesian. NUSA 65.
Rashel, Fam, Andry Luthfi, Arawinda Dinakaramani & Ruli Manurung. 2014. Building an Indonesian rule-based part-of-speech tagger. In International Conference on Asian Language Processing (IALP2014). IEEE.

Call for papers for the NUSA special issue "Linguistic studies using large annotated corpora"