Doctoral thesis defence of Mouloud Belbahri : GERAD

Title: Apprentissage basé sur le Qini pour la prédiction de l'effet causal conditionnel

Date: June 21, 2021
Time: 10:00

Zoom Meeting ID: 537 868 5982 Passcode: 965224

Jury members:
President: Martin Bilodeau
Supervisor: Alejandro Murua
Co-supervisor: Vahid Partovi Nia
Member: Pierre Duchesne
External member: Thierry Duchesne Date: 21 juin 2021
Heure: 10:00

Numéro d'identification de la rencontre Zoom: 537 868 5982 Mot de passe: 965224

Membres du jury:
Président-rapporteur: Martin Bilodeau
Directeur de recherche: Alejandro Murua
Co-directeur de recherche: Vahid Partovi Nia
Membre du jury: Pierre Duchesne
Examinateur externe: Thierry Duchesne
University Dean: Sean Horan

Abstract: Uplift models deal with cause-and-effect inference for a specific factor, such as a marketing intervention. In practice, these models are built on individual data from randomized experiments. A targeted group contains individuals who are subject to an action; a control group serves for comparison. Uplift modeling is used to order the individuals with respect to the value of a causal effect, e.g., positive, neutral, or negative. First, we propose a new way to perform model selection in uplift regression models. Our methodology is based on the maximization of the Qini coefficient. Because model selection corresponds to variable selection, the task is haunting and intractable if done in a straightforward manner when the number of variables to consider is large. To realistically search for a good model, we conceived a searching method based on an efficient exploration of the regression coefficients space combined with a lasso penalization of the log-likelihood. There is no explicit analytical expression for the Qini surface, so unveiling it is not easy. Our idea is to gradually uncover the Qini surface in a manner inspired by surface response designs. The goal is to find a reasonable local maximum of the Qini by exploring the surface near optimal values of the penalized coefficients. We openly share our codes through the R Package tools4uplift. Though there are some computational methods available for uplift modeling, most of them exclude statistical regression models. Our package intends to fill this gap. This package comprises tools for: i) quantization, ii) visualization, iii) variable selection, iv) parameters estimation and v) model validation. This library allows practitioners to use our methods with ease and to refer to methodological papers in order to read the details. Uplift is a particular case of causal inference. Causal inference tries to answer questions such as “What would be the result if we gave this patient treatment A instead of treatment B ?" . The answer to this question is then used as a prediction for a new patient. In the second part of the thesis, it is on the prediction that we have placed more emphasis. Most vii existing approaches are adaptations of random forests for the uplift case. Several split criteria have been proposed in the literature, all relying on maximizing heterogeneity. However, in practice, these approaches are prone to overfitting. In this work, we bring a new vision to uplift modeling. We propose a new loss function defined by leveraging a connection with the Bayesian interpretation of the relative risk. Our solution is developed for a specific twin neural network architecture allowing to jointly optimize the marginal probabilities of success for treated and control individuals. We show that this model is a generalization of the uplift logistic interaction model. We modify the stochastic gradient descent algorithm to allow for structured sparse solutions. This helps fitting our uplift models to a great extent. We openly share our Python codes for practitioners wishing to use our algorithms. We had the rare opportunity to collaborate with industry to get access to data from largescale marketing campaigns favorable to the application of our methods. We show empirically that our methods are competitive with the state of the art on real data and through several simulation setting scenarios.