Illustration thèse confinancée
Thesis by Sihan Xie (2023 - 2026)

DeepSelectGene: Deep Learning for Genotype Data and its Application in Genomic Selection

Thesis by Sihan Xie (GABI, 2023-2026). Deep learning (DL) methods are being increasingly used to build phenotype predictive models based on genotype data in the study of human diseases and production traits for genomic selection in domestic animals. These models require computer training with numerous data sets that are not always available. This thesis will address this limitation by proposing a novel method of simulating genotype data.

  • co-funded thesis
  • Starting date : december 2023
  • Research laboratory :  GABI (Animal Genetics and Integrative Biology)
  • Thesis director : Eric BARREY, (UMR GABI, INRAE)
  • Supervisors : Blaise HANCZAR (IBISC, Université d’Evry Val d’Essonne), Julien Chiquet (MIA Paris-Saclay, INRAE)
  • Metaprogramme axis : Axis 2 (Predicting phenotypes and their responses to changes in stress fields) 

Summary :

Deep learning (DL) methods are beginning to be used as predictive models for phenotypes based on genotype data in the context of human diseases and production traits for genomic selection in domestic animals. These models necessitate training with numerous data sets (genotype > phenotype pairs), which may not always be feasible for certain species genotyped with only a few thousand animals.

The project of this thesis will address this limitation by successively using two type of DL methods: a first generative DL method, for example, "generative adversarial neural networks" (GANs), will allow the simulation of genotype data from training on a small but qualitatively representative real data set. Thus, we will artificially increase the size of the database necessary for the proper training of a second predictive DL model for phenotype prediction from a genotype (50-800K SNP). This second DL model, adopting a simpler structure for predictions, will need to be optimized based on our initial exploratory studies on the subject during the GenIALearn project on bovine genomic selection (MP DIGIT-BIO 2022-2024).

In summary, this thesis work will propose a novel method for simulating genotype data that is beneficial for i) enhancing understanding of the genetic determinism involved in phenotype formation; ii) generating additional quasi-real data essential for training the DL prediction model. Therefore, this DL model for phenotype prediction can be applied with just a few thousand genotype-phenotype data sets and will subsequently be improved as the database is progressively enriched. This topic represents a highly interdisciplinary field in genomic modeling, encompassing genetics, genomics, statistics, and data science, and is at the forefront of AI applications in genomics.

Contact :

See also