Contextualization of plant metabolomics data using knowledge graphs enhanced by Big Data, AI and Semantic Web technologies

The SEED project expands the Metabolomic Semantic Data Lake (MSD), an electronic infrastructure that combines the Semantic Web with Big Data technologies to enable large-scale processing of the scientific literature on metabolomics. First developed to improve access to knowledge on the links between the metabolism and human health, this infrastructure has now been adapted to include plant metabolisms. SEED will refine its use of artificial intelligence, enabling publications to be annotated automatically through the use of ontologies. The project will also expand the range of sources included in the data lake and will draw on four case studies to validate and illustrate the new associations discovered between metabolites, biomarkers and plants.

Context and key challenges

The scientific literature contains vast quantities of information on the central role of certain metabolites in many aspects of plants and plant-based products, not least their resistance to disease, interactions with the environment, and organoleptic properties. This knowledge is essential to an understanding of the mechanisms that determine the characteristics of plants, their capacity to adapt to different stresses and environmental conditions, and their suitability for processing into food.

Despite this wealth of available knowledge, its effective reuse in research projects nevertheless continues to present a major challenge. The difficulty arises in part from the varied nature of the experimental approaches that are used for the study of metabolites and from the great natural variability of vegetable products, but it also stems from the relative lack, in the field of plant research, of annotated or organized data. Unlike the biomedical sector, where knowledge standardization and indexing are further advanced, the world of plant research still suffers from considerable data fragmentation, making it hard to integrate and exploit existing datasets.

This means that standardized methodologies and innovative tools to classify, annotate and organize the full body of knowledge available from the scientific literature need to be developed. The creation of organized and interoperable knowledge bases for published scientific data, would make it easier to carry out scientific monitoring, identify molecular biomarkers and gain an in-depth understanding of plant metabolisms and their properties.

The SEED project therefore aims to further develop the Metabolomic Semantic Data Lake, an innovative e-infrastructure devoted to the production and consolidation of knowledge graphs. The MSD platform is designed to contextualize experimental data from metabolomics platforms by leveraging Big Data, AI and Semantic Web technologies to perform a synthetic and automated analysis of the scientific literature.

It also integrates automatic annotation methods, to bridge the gaps in key-word annotation in some specialized areas. A major product of the platform is the FORVM Plants knowledge graph, which links experimental data with the scientific corpus to identify and analyze metabolic biomarkers in plants.

Goals and methodology

The goal of the SEED project is to consolidate the FORVM Plants knowledge graph, applying this tool to three case studies on plant metabolomics and a fourth study on the reactivity of plant polyphenols during the transformation of plant products into food. The project will be in two main stages:

The first stage will seek to enhance data annotation using ontologies from the Planteome project along with the TransformON ontology. It will concentrate on the creation of a dataset designed to refine a semantic similarities coding model. In this first stage, the documentary data accessible through ISTEX, France’s scientific digital library, will also be integrated, expanding the size and diversity of the corpus for analysis.
In its second stage, the project will focus on the provision of answers to the biological questions raised in the four case studies, making use of the FORVM Plants knowledge graph.

The project uses the results of Maxime DELMAS’s thesis, Construire, exploiter et étendre un graphe de connaissances pour l’étude des liens entre métabolisme et santé (building, exploiting and extending a knowledge graph for the study of associations between the metabolism and health), co-funded by DIGIT-BIO. These results are also currently being used in a thesis by Meije MATHÉ, which is accredited by DIGIT-BIO.

The proof of concept for the FORVM Plants knowledge graph is emerging as a key resource for the development of new standards in an implementation study coordinated by Franck GIACOMONI (UNH/AlimH). The study, ‘Next level of reproducible, comparable and integrable Metabolomics’ (2025-2027), is part of the European ELIXIR program. It seeks to improve data treatment solutions for metabolomics data using a standardized semantic model to support innovative solutions for health and systems biology.

Contact - Coordination :

Olivier FILANGI (IGEPP)

Partnerships

INRAE participants

Département	Unités	Expertises
BAP	IGEPP	Data processing, Big Data, Semantic Web, Response to abiotic stress in Brassicaceae
BAP	BFP	Fruit metabolism and physiology and their impacts on growth, biomass production and nutritional quality
Transform	BIA	Semantic Web, Knowledge modelling, Phytochemical ontologies, metabolomics of transformed plant products

Partenaires extérieurs

Institut	Expertises
CNRS (LRSV)	Untargeted metabolomics, phytochemistry, multivariate analysis

Modification date: 29 June 2026 | Publication date: 05 June 2026 | By: Marjorie Domergue

Name of the cookie	Purpose	Shelf life
CAS and PHP session cookies	Login credentials, session security	Session
Tarteaucitron	Saving your cookie consent choices	12 months

Name of the cookie	Purpose	Shelf life
atid	Trace the visitor's route in order to establish visit statistics.	13 months
atuserid	Store the anonymous ID of the visitor who starts the first time he visits the site	13 months
atidvisitor	Identify the numbers (unique identifiers of a site) seen by the visitor and store the visitor's identifiers.	13 months