Font Size:
Analyzing Binary Data: Is your Model Really Pairwise?
##manager.scheduler.building##: Edificio Santa Maria
##manager.scheduler.room##: Auditorio San Agustin
Date: 2019-07-10 12:00 PM – 03:45 PM
Last modified: 2019-06-14
Abstract
Uncovering the main patterns hidden within data containing random errors is a central problem in science. To address this question, information theory, with its Minimum Description Length (MDL) principle, provides a precise prescription on how to decide, between potential explanations of the data, which is the best. The choice is based not only on how well a model fits the data, but also on the simplicity of the explanation it provides. However, in practice, such a prescription is difficult to follow as 1) model complexity is hard to compute and 2) the number of potential models is huge, which makes brute force search impossible. For instance, the number of models one must consider in order to capture all possible patterns in a binary dataset grows super-exponentially with the number of variables in the system. For that reason, it is common to reduce the space of possible models by selecting only among pairwise models, whose number only grows exponentially. But is it correct to assume that real data are always well captured by pairwise models? In this work, we tackle the problem of model selection for binary datasets within the largest class of models, the family of all spin models with high order interactions. We propose a heuristic algorithm to perform an efficient search in this huge space while keeping the number of parameters of the model as low as possible. This ensures that the complexity of the model stays small since, to first order, complexity increases with the number of parameters. Finally, we compare on real data the performance of the model we find with that of the best pairwise model.