Language models are increasingly surprising: we all come to mind examples of artificial intelligence such as GTP-2 or GPT-3 from OpenAI, text generators capable of maintaining its meaning and coherence, in some cases becoming indistinguishable from a text written by humans . Nevertheless, This class of models suffer from two defects:

Almost entirely, they are private developments, whose code remains inaccessible to the community of researchers.

They tend to forget that there are (many) languages ​​besides English.

Now, both things are going to change thanks to the launch of ‘BigScience Large Open-science Open-access Multilingual Language Model’, better known as ‘BLOOM’. The development of this AI began in 2021, with the human and financial support (100 million dollars) of the machine learning startup Hugging Face (which you will know, for example, for hosting the DALL-E Mini generator on its site). web), although Nvidia, Microsoft and the CNRS (the ‘French CSIC’) have also collaborated.

Speaking to VentureBeat, Teven Le Scao, research engineer at Hugging Face, explained that Hugging Face made use of Nvidia’s ‘Megatron’ and Microsoft’s ‘DeepSpeed’ opensource projects —both based on the PyTorch machine learning framework—, created to allow data scientists to train large language models.

BLOOM is trained to generate text a total of 59 languages: 46 of them natural (including Spanish, Catalan and Basque) and 13 of them programming. Its 176 billion parameters (which exceed, however slightly, the limit previously established by GPT-3) needed to be subjected to 117 days (from March 11 to July 6) of training on the French supercomputer Jean Zay.



Examples of use of BLOOM. Look at the one on the right, in which you can see how it is capable of ‘translating’ between two variants of the same language (the Spanish of Spain and that of Argentina).

And this is SPECTACULAR news! That the community has taken ~2 years to release an open source model similar to GPT-3 is incredible. Something that gives hope for a future where AI remains open for research and use. — Carlos Santana (@DotCSV) July 12, 2022

Also, has been licensed under its own open licensebased on the ‘Responsible AI’, which allows use “as open as possible” without giving up some control over the use given to AI: “We are trying to define what does open source mean in the context of big AI models, because they don’t really work like software doesLeScao said.

Anyone can download it. And, in theory, run it

But beware: that BLOOM is subject to a free license does not mean that its use must necessarily be free. We are used to this class of AI models (as well as image-based ones like DALL-E) looking like mere web applications, but if OpenAI charges to use GPT-3 it is because it makes heavy use of its expensive physical infrastructure…

…so Hugging Face could be charging for BLOOM usage if it wanted to (for now, it just asks you to register on its website). However, its license prevents this company from having a monopoly on the model: Any other entity with access to the same hardware will now be able to launch their own instance of BLOOM.

“All the experiments that researchers and professionals have always wanted to run […] are now possible. BLOOM is the seed of a living family of models that we intend to nurture, not a single model, and we stand ready to support community efforts to expand it.”



We have tried to generate our own text (in black what is written by us, in blue what is written by BLOOM)… and we have bad news: the AI ​​believes that by August we should have already started the new school year.

