A Transformer-based Nlp Pipeline for Enhanced Extraction of Botanical Information Using Camembert on French Literature - Unité de modélisation mathématique et informatique des systèmes complexes Access content directly
Conference Papers Year : 2024

A Transformer-based Nlp Pipeline for Enhanced Extraction of Botanical Information Using Camembert on French Literature

Abstract

This research investigates the untapped wealth of centuries-old French botanical literature, particularly focused on floras, which are comprehensive guides detailing plant species in specific regions. Despite their significance, this literature remains largely unexplored in the context of AI integration. Our objective is to bridge this gap by constructing a specialized botanical French dataset sourced from the flora of New Caledonia. We propose a transformer-based Named Entity Recognition pipeline, leveraging distant supervision and CamemBERT, for the automated extraction and structuring of botanical information. The results demonstrate exceptional performance: for species names extraction, the NER model achieves precision (0.94), recall (0.98), and F1-score (0.96), while for fine-grained extraction of botanical morphological terms, the CamemBERT-based NER model attains precision (0.93), recall (0.96), and F1-score (0.94). This work contributes to the exploration of valuable botanical literature by underscoring the capability of AI models to automate information extraction from complex and diverse texts.
Fichier principal
Vignette du fichier
csit140605.pdf (1.58 Mo) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-04536866 , version 1 (08-04-2024)

Identifiers

Cite

Ayoub Nainia, Régine Vignes-Lebbe, Eric Chenin, Maya Sahraoui, Hajar Mousannif, et al.. A Transformer-based Nlp Pipeline for Enhanced Extraction of Botanical Information Using Camembert on French Literature. 5th International Conference on NLP & Information Retrieval (NLPI 2024), Mar 2024, Sydney (AUSTRALIA), Australia. pp.59-78, ⟨10.5121/csit.2024.140605⟩. ⟨hal-04536866⟩
49 View
16 Download

Altmetric

Share

Gmail Facebook X LinkedIn More