Abstract
A remarkable amount of Twitter messages are generated every second. Detecting the location entities mentioned in these messages is useful in text mining applications. Therefore, techniques for extracting the location entities from the Twitter textual content are needed. In this work, we approach this task in a similar manner to the Named Entity Recognition (NER) task, but we focus only on locations, while NER systems detect names of persons, organizations, locations, and sometimes more (e.g., dates, times). But, unlike NER systems, we address a deeper task: classifying the detected locations into names of cities, provinces/states, and countries in order to map them into physical locations. We approach the task in a novel way, consisting in two stages. In the first stage, we train Conditional Random Fields (CRF) models that are able to detect the locations mentioned in the messages. We train three classifiers: one for cities, one for provinces/states, and one for countries, with various sets of features. Since a dataset annotated with this kind of information was not available, we collected and annotated our own dataset to use for training and testing. In the second stage, we resolve the remaining ambiguities, namely, cases when there exists more than one place with the same name. We proposed a set of heuristics able to choose the correct physical location in these cases. Our two-stage model will allow a social media monitoring system to visualize the places mentioned in Twitter messages on a map of the world or to compute statistics about locations. This kind of information can be of interest to business or marketing applications.
http://ift.tt/2ohWvJG
Δεν υπάρχουν σχόλια:
Δημοσίευση σχολίου