Skip to main content
KBS_Icon_questionmark link-ico
Croissant Hero ;

King's part of global cross-sector initiative to standardise machine learning datasets

In partnership with MLCommonsⓇ, an organisation bridging academia and industry to create better datasets for safer more responsible artificial intelligence (AI) models, King’s is announcing the release of Croissant, a metadata format to standardise machine learning datasets and improve the quality of future AI.

Data is at the core of every AI and ML model. However, there is currently no standardised method of organising and arranging the data and files that make up each dataset used to train said model. This means that many ML datasets lack the sufficient machine-readable documentation to allow people to use them responsibly. Without this information, finding, understanding, and using these datasets safely and ethically can be time-consuming and may introduce inaccuracy into a model.

By providing metadata in a standardised way without the need to make changes to the data itself, Croissant promises to change the game in AI ethics, where high-quality, well-documented datasets are essential to the safe function of a model.

For too long people have trained or applied their algorithms to whatever data was available, with mixed results and little consideration for the link between data quality and model performance, outcomes and impacts. Data is a critical element of any model's performance, and as some experts suggest it will run out, making the need to harness it even more important” – Professor Elena Simperl

Croissant also aims to make data more easily accessible and discoverable, as it enables datasets to be loaded into different AI platforms without the need for the lengthy process of reformatting. By taking this step, Croissant hopes to spread best practice no matter what platform is used.

This new format is an extension of existing machine-readable standard schema.org, which is used by over 40 million datasets and enables them to be found through industry standard search engines such as Google Dataset Search and integrated into popular ML frameworks used by industry and academia, like TensorFlow and PyTorch.

The Croissant editor also allows practitioners to inspect, create, or modify Croissant descriptions for their dataset, helping to create a standardised format across industries and teams. The format is also receiving support from major repositories of ML data, including Kaggle, OpenML and Hugging Face.

Croissant allows more people to do more with data – a key aim of the department now in action... it is a privilege to collaborate with world-class machine learning scientists and engineers around the globe making an enormous contribution to the AI data ecosystem."– Professor Elena Simperl

Omar Benjelloun, software engineer at Google and Croissant working group co-chair said "The development of Croissant was grounded in the needs of ML practitioners, and the technical requirements of ML tools, platforms, and datasets.

“Our goal with Croissant is to unlock real value for users by enabling the tools they use to work seamlessly together, while keeping the format as simple and intuitive as possible."

Our goal with Croissant is to unlock real value for users by enabling the tools they use to work seamlessly together, while keeping the format as simple and intuitive as possible."– Omar Benjelloun, Software Engineer at Google and Croissant Working Group Co-Chair

Croissant is made possible thanks to efforts by the MLCommons Croissant working group, which includes contributors from these organizations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Google, Harvard, Hugging Face, Kaggle, Kings College London - Open Data Institute, Meta, NASA, Open University of Catalonia - Luxembourg Institute of Science and Technology, and TU Eindhoven.

 

Read the blog from Google here.
Read the blog from the Open Data Institute here.

In this story

Elena Simperl

Elena Simperl

Professor of Computer Science

Latest news