Author Image

Bonjour, je suis Théo Sourget

Théo Sourget

Research Assistant chez IT University of Copenhagen - PURRlab.

Passionné par la Data Science et l’IA, j’ai obtenu en 2023 mon diplôme de Master à l’Université de Rouen. Durant mes études, je me suis spécialisée dans l’analyse d’images médicales avec des projets portant sur la classification et la segmentation d’images médicales à l’aide de modèles de deep learning. Ces projets portaient également sur l’entraînement de modèles transformers avec des datasets de tailles limitées ainsi que sur l’effet de l’augmentation des données et du transfer learning.

Compétences

Expérience professionnelle

1
IT University of Copenhagen - PURRlab

October 2023 - Aujourd'hui, Copenhagen (Denmark)

Research Assistant

October 2023 - Aujourd'hui

  • Study how public medical image datasets are referenced in research papers
Teaching Assistant

October 2023 - Aujourd'hui

  • Help students during practical work

Data scientist intern
Capgemini Engineering - Medic@

April 2023 - September 2023, Illkirch-Graffenstaden (France)

Responsabilités :
  • Detection, segmentation and numbering of teeth in dental panoramic x-ray.
  • Comparison of Mask-RCNN and Detection Transformer (DETR).
  • Comparison of “classical” data augmentation technics with augmentation using the generation of new panoramics.
2

3
See-d

April 2021 - July 2021, Vannes (France)

Developer

June 2021 - July 2021

  • Continuation of the intership
  • Creation of a dashboard with Qlik Sense using the previously created API
Developer Intern

April 2021 - June 2021

  • Development of a storage-related data analysis website with Python and Streamlit
  • Deployement with Docker
  • Presentation of the tool to the team

Diplômes

Master Data Science and Engineering (SID)
Grade: 16.215 sur 20 (Graduated with Highest Honors)
Université de Rouen
2020-2021
Bachelor's degree in Computer Science and Data Science
Grade: 14.958 sur 20 (Graduated with Honors)
IUT de Vannes
2018-2020
2 years diploma in Computer science
Grade: 13.835 sur 20

Projets

Citation finder
2023

Website multiple API such as OpenAlex to search for papers referencing another one and papers matching keywords and concept.

Research project Master 2
2022 - 2023

Comparison of Segformer and U-Net to perform semantic segmentation on the CAMUS dataset (cardiac ultrasound images).

Research project Master 1
2021 - 2022

Classification of eye fundus images for glaucoma detection with convolutional neural network.

Bachelor's degree project
2020 - 2021

Analysis of a dataset on the 2011-2012 season of the Premier League.

Astronomical observation website
2020

Development of an astronomical observation website with React-Flask-MongoDB.

Applicharge
2019-2020

Development of a website and a mobile application for sports management with React/React Native.

Publications

Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare. The accuracy, robustness, and fairness of diagnostic algorithms depend on the data (and its quality) used to train and evaluate the models. MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data’s public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper, we conduct an analysis of publicly available machine learning datasets on CCPs, discussing datasets’ context, and identifying limitations and gaps in the current CCP landscape. We highlight differences between MI and computer vision datasets, particularly in the potentially harmful downstream effects from poor adoption of recommended dataset management practices. We compare the analyzed datasets across several dimensions, including data sharing, data documentation, and maintenance. We find vague licenses, lack of persistent identifiers and storage, duplicates, and missing metadata, with differences between the platforms. Our research contributes to efforts in responsible data curation and AI algorithms for healthcare.

Medical imaging papers often focus on methodology, but the quality of the algorithms and the validity of the conclusions are highly dependent on the datasets used. As creating datasets requires a lot of effort, researchers often use publicly available datasets, there is however no adopted standard for citing the datasets used in scientific papers, leading to difficulty in tracking dataset usage. In this work, we present two open-source tools we created that could help with the detection of dataset usage, a pipeline using OpenAlex and full-text analysis, and a PDF annotation software used in our study to manually label the presence of datasets. We applied both tools on a study of the usage of 20 publicly available medical datasets in papers from MICCAI and MIDL. We compute the proportion and the evolution between 2013 and 2023 of 3 types of presence in a paper: cited, mentioned in the full text, cited and mentioned. Our findings demonstrate the concentration of the usage of a limited set of datasets. We also highlight different citing practices, making the automation of tracking difficult.

Detection Transformer for Teeth Detection, Segmentation, and Numbering in Oral Rare Diseases: Focus on Data Augmentation and Inpainting Techniques

In this work, we focused on deep learning image processing in the context of oral rare diseases, which pose challenges due to limited data availability. A crucial step involves teeth detection, segmentation and numbering in panoramic radiographs. To this end, we used a dataset consisting of 156 panoramic radiographs from individuals with rare oral diseases and labeled by experts. We trained the Detection Transformer (DETR) neural network for teeth detection, segmentation, and numbering the 52 teeth classes. In addition, we used data augmentation techniques, including geometric transformations. Finally, we generated new panoramic images using inpainting techniques with stable diffusion, by removing teeth from a panoramic radiograph and integrating teeth into it. The results showed a mAP exceeding 0,69 for DETR without data augmentation. The mAP was improved to 0,82 when data augmentation techniques are used. Furthermore, we observed promising performances when using new panoramic radiographs generated with inpainting technique, with mAP of 0,76.

The U-Net model, introduced in 2015, is established as the state-of-the-art architecture for medical image segmentation, along with its variants UNet++, nnU-Net, V-Net, etc. Vision transformers made a breakthrough in the computer vision world in 2021. Since then, many transformer based architectures or hybrid architectures (combining convolutional blocks and transformer blocks) have been proposed for image segmentation, that are challenging the predominance of U-Net. In this paper, we ask the question whether transformers could overtake U-Net for medical image segmentation. We compare SegFormer, one of the most popular transformer architectures for segmentation, to U-Net using three publicly available medical image datasets that include various modalities and organs with the segmentation of cardiac structures in ultrasound images from the CAMUS challenge, the segmentation of polyp in endoscopy images and the segmentation of instrument in colonoscopy images from the MedAI challenge. We compare them in the light of various metrics (segmentation performance, training time) and show that SegFormer can be a true competitor to U-Net and should be carefully considered for future tasks in medical image segmentation.

Poste récents