In this talk, I will present FaQuAD: a reading comprehension (RC) dataset in the domain of Brazilian higher education. RC is a complex natural language understanding task whose input comprises a reading passage (usually, a paragraph) and a question related to this passage. The task consists in finding the question answer within the given reading passage (context). The correct answer is always a span of the context. FaQuAD follows the format of the well-known SQuAD dataset [Rajpurkar et al.2016]. It comprises 900 questions related to contexts taken from 39 documents: 18 official documents from the Computer Science College at UFMS and 21 Wikipedia articles related to Brazilian higher education system. Unlike many question answering (QA) datasets based on predefined question-answer pairs, FaQuAD is based on contexts. The system needs to interpret both the question and the context in order to return the best answer. As far as we know, FaQuAD is the first Portuguese reading comprehension dataset with this challenging format. Additionally, I will describe a deep learning model [Seo et al. 2016] to solve this task by means of transfer learning of an unsupervised language model [Peters et al. 2018]. This model (called BiDAF) can benefit from pre-trained representations in two levels: word and contextual representations. The word representation layer is based on the GloVe model [Pennington et al. 2014]; while the contextual representation layer is based on ELMo contextual representations [Peters et al. 2018]. We report on several ablation tests to assess different aspects of both the model and the dataset.
Bienvenue à tous!