Artificial Intelligence
Natural language processing
Semantic web

Extraction of semantic relations from descriptions of Quebec heritage sites

Student: François Ferry

Supervisor: Michel Gagnon

Co-supervisor(s): Amal Zouaq

A lot of information on the Web and in databases is in raw texts. If the raw text is easily understandable for humans, it is more difficult to process it with machines. This is why structuring data is a big challenge, that will allow making data more accessible and exploitable. There are numerous information extraction methods from raw texts. The most popular are based on machine learning and word representation to take into account some information like semantic, word distribution, etc.

In this project, we will work with data from the Repertoire of Cultural Heritage of Quebec. This repertoire brings together real estates, person, movable heritage and intangible cultural heritage of Quebec. The current classification does no longer meet the needs of the Ministry of Culture and Communication of Quebec. This is why, to help to redesign the knowledge base, we propose an application to extract relations between real estates and persons or group of persons. Each real estate has a historical synthesis which describes its history, and cite persons who played some role in its history. Thus, our goal is to process these syntheses to extract these relations. Ultimately, this application should help to settle the future knowledge base. Input data of our problem are, for each real estate, a historical synthesis and a list of persons who are in relation with this real estate.

Our research question is to determine if a machine learning-based approach is enough to extract relations from the syntheses. For each pair hreal estate, personi, we will first isolate the context around each mention of the person in the historical synthesis of the real estate. We found out, by browsing the data, that information describing relation is very often near the mention of the person. We define the context either by a fixed number of words surrounding the mention, either by the sentence containing the mention. Then we use a word representation model to transform context into a vector. Thus, we have a vector for each pair hreal estate, personi. This vector will then be given to a supervised machine learning algorithm (support vector machine or multilayer perceptron) to predict the relation it represents. These algorithms are trained on data extracted from the Repertoire of Cultural Heritage of Quebec, and are tested on a manually annotated corpus (extracted from the repertoire and annotated by us).

For more information, click here