Summary of Algorithmes et structures de données pour l'indexation de grands volumes de données textuelles

Algorithmes et structures de données pour l'indexation de grands volumes de données textuelles

We will explore algorithmic strategies for making an index to efficiently access large amount of unstructured textual data. At the core of an information retrieval system we often find a data structure called the "inverted index" used to retrieve documents given a word they may contain. For all information retrieval models -- from the simpler ones (e.g., vector space model,...) to the more advanced (e.g., language models,...) -- their performance depends mainly on the decisions made while making the index. Thus, we will first study the elements necessary for the construction of the index (i.e., specific algorithms and data structures), and then implement and test an index on a realistic dataset.

We will also consider how vector space models can, in a way, capture meaning. In particular, we will study hyper-dimensional calculus and random indexing.