Information Extraction from text data

Najah-Imane BENTABET

1. Motivation behind Information Extraction

News-papers, blogs, and web-pages are a rich and diverse source of textual information. However, the information contained in these sources cannot be manually extracted, recorded, and indexed, mainly because they come in a massive size. Moreover, the extraction of some information sometimes require specific knowledge or technical background. This is the case in the Financial domain where we want to be able to automatically extract some key information from specific documents (e.g pdfs, word documents, etc) published by banks, funds, and other financial institutions. In order to scale knowledge extraction to the large size of available textual information, and build extractors specific to a certain field, new methods based on machine learning have been deployed and developed [1, 2].

2. From information extraction to relation extraction

Information Extraction (IE) is a broad subject covering multiple tasks, mainly Named Entity Recognition (NER), Co-reference Resolution, Relation Extraction (RE), and Event Extraction. RE uses NER and Co-reference resolution to extract binary semantic relations, whereas event extraction (i.e, extract who did what to whom when and where for an event) can be represented as a complex combination of relations, and thus may use RE techniques. Accordingly, RE is the pivot of IE.

What is relation extraction ? A relation extractor automatically understands the semantic (i.e meaning) of a sentence, and then puts it in a structured and computer-readable format, known as a Knowledge Base. For example, a relation extractor would map the sentence


John Lenon, the lead singer of the Beatles, was born in England (1)

to the structured fact has_nationality(John Lenon, England), and the sentence:

Google bought Youtube in 2006 for US$1.65 billion (2)

to the fact has_acquired(Google, Youtube).


Main difficulty of IE The main difficulty of IE is the fact that most text data is initially unstructured, i.e the data format is not indicative of its meaning. For instance, there are multiple different ways to convey the fact has_acquired(Google, Youtube). Sentence 2 conveys it, and so do the following sentences:

Youtube was purchased by Google.     (3)
Google confirms YouTube aquisition. (4)
Google has acquired Youtube, an online video sharing service. (5)
Google’s Youtube takeover clears final hurdle. (6)

The versatility of natural or human-understandable language makes it difficult to interpret by a computing machine.

3. Methods in the state of the art

A relation extractor takes as input a raw sentence with two marked named entities. In sentence 2 for example, the pair of marked entities is Google and Youtube. The input sentence is then processed in two steps. First, a vector representation of the sentence is computed (cf section 3.1), and second, a semantic relation is assigned to the sentence based on this representation (cf section 3.2).

3.1 Vector representation of a sentence

Feature-based represention This type of representation is binary, sparse, and computed using tools from Computational linguistics (CL). These Tools include among others:

  • Named Entity Recognizer: a tool that tags a token as either an organization (ORG), a person (PER), a location (LOC), etc …
  • Part-Of-Speech (POS) tagger: a tool that tags each word of a sentence with its Part-Of-Speech (whether it is a verb, a noun, a proper noun, etc)
  • Dependency parser: a tool that represents a sentence as a tree, whose nodes are the words of this sentence, and edges are the grammatical relations between these words (e.g subject_of, preposition, etc).

In the binary sparse representation, each bin indicates the presence or not of a feature. These features are retrieved with the help of CL tools and are selected on a trial-and-error basis. For example, we can choose to represent our sentences with the following features:

  • NER tags of the pair of entities
  • POS sequence of the sentence
  • sequence of words between the pair entities
  • the dependency path between the pair of entities

This binary representation is sparse and of high dimensionality. It also requires heavy pre-processing, as it uses different CL tools, the dependency parser being the most computationnaly demanding.

Dense representation Dense representations generally stem from neural networks (convolutional, recurrent, or recursive [3]). Instead of using off-the-shelf tools trained and optimized separately for the RE task as it is the case for the feature-based representation, the dense representation is rather optimized (a.k.a learned) during the training of the relation extractor. We refer the reader to [4, 5, 6, 7] for architectures that have been successfully used to perform relation extraction. Work is still ongoing to find better, easier to train architectures.

3.2 Paradigms for relation extraction

RE can be done under different paradigms depending on the data at hand:  whether it is a raw corpus of text, a set of labeled and tagged sentences, a knowledge base, etc … The semantic relation assigned to an input sentence may either be a pre-defined relation type (e.g has_nationality), or an integer id, depending on the paradigm.

Supervised RE (aka Relation Classification) In supervised relation extraction, the training dataset is composed of sentences tagged with a pair of named entities, and labeled with the semantic relations they convey. The number of considered relations is finite and the list, R, of relations considered is specified in advance by the user. The task consists of training a model that correctly maps an unseen sentence to one of the semantic relations in R: it is a classification problem where the labels are the semantic relations considered.

Distant supervision Distant supervision [8] is an alternative to Relation Classification (RC) that does not require hand-labeled datasets. Similar to RC, it consists of mapping sentences to semantic relations belonging to a predefined list, R, of relation types. However, unlike RC, labels are automatically assigned to the training sentences. Hence, the training datasets are larger in size than those used under the supervised setting, but are noisier.

Bootstrapping Boutstrapping [9] is a semi-sepervised method that targets one relation type, r, at a time and outputs a large set of entity pairs related by r. It requires to have, initially, only few entity pairs linked by r.

Open IE An open information extractor processes raw corpora of text to output a diverse set of relational tuples, like for example (Google, has acquired, Youtube) [10]. The main drawback of this paradigm is the fact that it does not specify the relation type conveyed by each extracted triplet.

Unsupervised RE An unsupervised relation extractor clusters a set of relational triplets such that the triplets conveying the same semantic relation are put together in the same cluster [11].

4. Practical uses of IE

Information extraction, and specifically relation extraction, enables to construct from scratch a large knowledge base, or populate existing knowledge bases (like Yago [12], DBPedia [13], WikiData [14]). It is crucial to have at hand a comprehensive knowledge base as it is the backbone of many useful engines, such as
Question-Answering systems [15]. With a powerful QA system, one can further develop appealing tools such as:

  • personalized bots
  • bots specific to a certain domain (finance, risk assessment [16], …)
  • quick access to information regarding competitors
  • utilizing extracted information in decision making


[1] A. Fader, S. Soderland, and O. Etzioni, “Identifying relations for open information extraction,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, (Stroudsburg, PA, USA), pp. 1535–1545, Association for Computational Linguistics, 2011.
[2] N. Bach and S. Badaskar, “A review of relation extraction,” 2007.
[3] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng, “Semantic compositionality through recursive matrix-vector spaces,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP CoNLL ’12, (Stroudsburg, PA, USA), pp. 1201–1211, Association for Computational Linguistics, 2012.
[4] D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, et al., “Relation classification via convolutional deep neural network.,” in COLING, pp. 233 –2344, 2014.
[5] D. Zeng, K. Liu, Y. Chen, and J. Zhao, “Distant supervision for relation extraction via piecewise convolutional neural networks.,” in EMNLP (L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton, eds.), pp. 1753–1762, The Association for Computational Linguistics, 2015.
[6] Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun, “Neural relation extraction with selective attention over instances,” in ACL, 2016.
[7] X. Jiang, Q. Wang, P. Li, and B. Wang, “Relation extraction with multiinstance multi-label convolutional neural networks,” in COLING, 2016.
[8] M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision for relation extraction without labeled data,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 – Volume 2, ACL ’09, (Stroudsburg, PA, USA), pp. 1003–1011, Association for Computational Linguistics, 2009.
[9] M. A. Hearst, “Automatic acquisition of hyponyms from large text corpora,” in Proceedings of the 14th Conference on Computational Linguistics – Volume 2, COLING ’92, (Stroudsburg, PA, USA), pp. 539–545, Association for Computational Linguistics, 1992.
[10] O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam, “Open information extraction: The second generation,” in Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence – Volume Volume One, IJCAI’11, pp. 3–10, AAAI Press, 2011.
[11] L. Yao, A. Haghighi, S. Riedel, and A. McCallum, “Structured relation discovery using generative models,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, (Stroudsburg, PA, USA), pp. 1456–1466, Association for Computational Linguistics, 2011.
[12] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A core of semantic knowledge unifying wordnet and wikipedia,” 2007.
[13] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer, “DBpedia – a large-scale, multilingual knowledge base extracted from wikipedia,” Semantic Web Journal, vol. 6, no. 2, pp. 167–195, 2015.
[14] D. Vrandecic and M. Krötzsch, “Wikidata: A free collaborative knowledgebase,” Commun. ACM, vol. 57, pp. 78–85, Sept. 2014.
[15] D. Ravichandran and E. Hovy, “Learning surface text patterns for a question answering system,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, (Stroudsburg, PA, USA), pp. 41–47, Association for Computational Linguistics, 2002.
[16] P. Capet, T. Delavallade, T. Nakamura, A. Sandor, C. Tarsitano, and S. Voyatzi, A Risk Assessment System with Automatic Extraction of Event Types, pp. 220–229. Boston, MA: Springer US, 2008.