Harold Benoit

Github / Linkedin / CV / Email: (my first name)_(my last name)@hotmail.ch

As per the latest news, I am responsible for the evaluation and supervised finetuning (SFT) of LLMs for the Swiss AI research initiative, co-led by and .
It can be roughly summarized as training LLMs from scratch with 10 million GPU hours of compute, and conducting research along the way.



I received my M.Sc. degree in Data Science (ranked 3rd in year) at , where I also previously completed my B.Sc. in Computer Science.




With enough scale and compute, models tend to converge to the same point, and approximate their datasets. A reasonable hypothesis is that, with enough extensive high-quality data, any standard architecture can exhibit advanced capabilities.

Thus, my research tends to be "data-focused", exploring scalable ways to identify or synthetize high-quality data with the intent to render models more general and adaptable to new environments.

Controlled Training Data Generation with Diffusion Models
Teresa Yeo*, Andrei Atanov*, Harold Benoit^, Aleksandr Alekseev^, Ruchira Ray, Pooya Akhoondi, Amir Zamir
In review, 2024
arXiv / Github / project page

We propose a framework for automatically identifying the optimal input signal for a generative model, for the purpose of creating tailored synthetic training data for model-finetuning.
Tailored, here, means that the data is (1) unseen previously by the finetuned model, thus relevant for learning (2) close to a target deployment domain.

Unraveling the Key Components of OOD Generalization via Diversification
Harold Benoit*, Liangze Jiang*, Andrei Atanov*, Oğuzhan Fatih Kar, Mattia Rigotti, Amir Zamir
ICLR, 2024
arXiv / OpenReview

We distill the critical design factors of current state-of-the-art methods (multi-hypotheses/diversification methods) for spurious correlation situations.