Illustration thèse confinancée
Thesis by Ekaterina Tomilina (2022 - 2025)

Multi-omics regulation network inference via the Gaussian copula

Thesis by Ekaterina Tomilina (MaIAGE, 2022-2025). Systems biology is built on the analysis of complex, large and highly diverse data networks. Understanding the interdependencies between, and regulation of, the different types of omics data it measures is a genuine challenge. This thesis proposes to use copula theory to construct, test and apply a coupled statistical model of this heterogeneity.

  • dates : 2022-2025
  • Research laboratory : MaIAGE
  • Thesis director :  Gildas MAZO (INRAE, MaIAGE), Florence JAFFREZIC (INRAE, GABI), Andrea Rau (INRAE, GABI)
  • Metaprogramme axis : Axis 1 (Deciphering the functions of living matter at multiple scales: regulation and integration of biological processes)

Summary

The study of multi-omic regulatory networks represents a key challenge in biology. The term multi-omic refers to the different -omic levels of an organism (proteomics, genomics, metabolomics, etc.). Each level holds a particular role in molecular biology processes, and their interaction is responsible for biological reactions in living organisms. Thus, a better understanding of the underlying mechanisms of these networks could, for instance, contribute to improved insights into diseases such as cancer. A first major obstacle is the heterogeneity of the data (continuous, discrete, mixed, etc.). Indeed, classical network inference methods are often limited to a single type of data. A second major obstacle is high dimension, which arises when the number of variables exceeds the number of observations. This raises the issue of variable selection in order to keep only the most important variables in the network. In this thesis, we propose the use of a Gaussian copula model to represent multi-omics data. This model captures the dependencies between observed variables through a latent Gaussian structure, parameterized by a correlation matrix that naturally encodes a network. The properties of this model, as well as many inference methods for correlation coefficients, are well established in the case of continuous observed variables. We therefore focus primarily on adapting the model to the case where discrete variables are also present.  For continuous variables, several methods exist for estimating the copula correlation coefficients. This task is less straightforward in the presence of discrete variables, as it often requires assumptions on the nature of the marginal distributions. We propose a maximum likelihood estimation method. To avoid high computational costs, we adopt a pairwise likelihood approach. Moreover, by adopting a semi-parametric framework, we remove the need for assumptions on the marginal distributions.  We also investigate the independence properties of the model and show that latent correlations encode dependencies between groups of observed variables. Furthermore, we provide an interpretation of the extreme values of correlation coefficients, so far known only in a fully continuous framework, in the presence of binary variables.  In a third step, we propose to study the structure of latent conditional correlations, while performing variable selection to address the high-dimensional setting. Thanks to the Gaussian structure, this task consists in inverting the correlation matrix. To achieve this, we apply a penalized inversion method to our pairwise semi-parametric maximum likelihood estimator.  Finally, we illustrate our methodology on a bull fertility multi-omic dataset from INRAE via our R package, developed as part of this thesis.

Ekaterina Tomilina

Contact

 

Publications

 

Journal articles

  • Ekaterina Tomilina, Gildas Mazo, Florence Jaffrézic. A semi-parametric Gaussian copula model for heterogeneous network inference: an application to multi-omics data. 2024. ⟨hal-04847648⟩
  • Ekaterina Tomilina, Florence Jaffrézic, Gildas Mazo. Gaussian copula correlation network analysis with application to multi-omics data. 2025. ⟨hal-04847648v3
  • Ekaterina Tomilina, Gildas Mazo, Florence Jaffrézic. Multi-omics network inference with a Gaussian copula model. 2025. ⟨hal-05173829⟩

Conferences paper

  • Ekaterina Tomilina, Gildas Mazo, Florence Jaffrézic. Méthodes à copules pour l'inférence de réseaux de régulation multi-omiques. Colloque Jeunes Probabilistes et Statisticiens, groupe Modélisation Aléatoire et Statistique de la Société de Mathématiques Appliquées et Industrielles, Oct 2023, Saint Pierre d'Oléron, France. ⟨hal-04308489⟩
  • Ekaterina Tomilina, Gildas Mazo, Florence Jaffrezic. Gaussian copula estimation for heterogeneous data. European Meeting of Statisticians, Jul 2023, Warsaw (POLAND), Poland. . ⟨hal-04308470⟩
  • Ekaterina Tomilina, Gildas Mazo, Florence Jaffrezic. Copula-based models for multi-omic network inference. Compstat 2024, Aug 2024, Giessen, Germany. ⟨hal-04683480⟩
  • Ekaterina Tomilina, Gildas Mazo, Florence Jaffrezic. Copula-based models for multi-omics network inference. Journée des Statistiques 2024, Société Française de Statistique, May 2024, Bordeaux, France. ⟨hal-04598167⟩

Software

See also