On Privacy Preservation and Document-based Active Learning for Named Entity Recognition

Olsson, Fredrik (2009) On Privacy Preservation and Document-based Active Learning for Named Entity Recognition. In: ACM First International Workshop on Privacy and Anonymity for Very Large Datasets, 6 Nov 2009, Hong Kong.

Full text not available from this repository.


The preservation of the privacy of persons mentioned in text requires the ability to automatically recognize and identify names. Named entity recognition is a mature field and most current approaches are based on supervised machine learning techniques. Such learning requires the presence of labeled examples on which to train; training examples are usually provided to the learner on the form of annotated corpora. Creating and annotating corpora is a tedious, meticulous and error prone process; obtaining good training examples is a hard task in itself. This paper describes the development and in-depth empirical investigation of a method, called BootMark, for bootstrapping the marking up of named entities in textual documents. Experimental results show that BootMark requires a human annotator to manually annotate fewer documents in order to produce a named entity recognizer with a given performance, than would be needed if the documents forming the basis for the recognizer were randomly drawn from the same corpus. The investigation further indicates that the primary gain obtained by BootMark compared to passive learning is in terms of higher recall. Thus, it is argued, the recognizers are suitable for use in privacy preservation applications.

Item Type:Conference or Workshop Item (Paper)
Additional Information:Workshop held in conjunction with The 18th ACM Conference on Information and Knowledge Management (CIKM 2009)
ID Code:3712
Deposited By:Fredrik Olsson
Deposited On:07 Dec 2009 11:28
Last Modified:07 Dec 2009 11:28

Repository Staff Only: item control page