The ‘open source’ artificial intelligence that best C program was trained by reading questions and answers from Stack Overflow


A month ago, the artificial intelligence company DeepMind (a subsidiary of Google/Alphabet) announced to the world the creation of AlphaCode, an AI capable of performing like an average developer when faced with programming problems.

This was, of course, well received in the technology industry, as it opened up a whole series of possibilities when working as an assistant to human users (not necessarily programmers). However, the problem of AI models like AlphaCode is, as a group of researchers from Carnegie Mellon Univ. explains, that

“programming models [mediante IA] more powerful is that are not publicly available: This precludes the application of such models outside of larger companies, and limits research in this field to organizations with fewer resources.”

In fact, a study carried out in 2020 by the startup AI21 Labs established the cost of training a code generator model with 1.5 billion parameters (i.e., about half of the PolyCoder complex) in the $80,000-1.6 million. For its part, solutions like GitHub Copilot have 12,000 million parameters.

The use of open source to train GitHub Copilot's AI sparks a controversy over what license to apply to the software it generates

The programmer career in 2017 and in the future (with Javier Santana)

PolyCoder was created to democratize research on programming AIs

And for this reason, these researchers (Frank Xu, Uri Alon, Graham Neubig and Vincent Hellendoorn) had been working for some time on an open source programmer AI, accessible to all types of users and organizations, that would be capable of democratize the creation and research of programmer AIsa field so far dominated by DeepMind and OpenAI.

And that’s where it comes in PolyCoder, your new automated code generator model which is based on the popular GPT-2 and has been trained on 631 GB of data and 38.9 million code files (sourced from GitHub repositories) to ‘learn’ to code in 12 programming languages.

It is able to generate code in C, C#, C++, Go, Java, JavaScript, PHP, Python, Ruby, Rust, Scala y TypeScriptalthough its own creators point out that it particularly excels at writing code in C. In fact, is able to code in C more accurately than other known modelsincluding Codex (the AI ​​model behind GitHub’s CoPilot feature).

A peculiarity of PolyCoder is that it was not only trained with code files, but also also with natural language information taken from Stack Overflowthe developer Q&A website:

“A promising approach to developing robust code generation models seems to be to train them on various sources of programming knowledge, not just code from a wide mix of programming languages, but programming-related text from all over the web.”

C is the 'greenest' programming language, closely followed by Rust: they consume the least power when executing algorithms

For example, the datasets used to train Codex have not been made publicly available and their API output follows a ‘black box’ model, thus preventing researchers from adjusting the AI ​​model or studying certain aspects of itsuch as its interpretability.

“To some extent, we hope that our open source efforts can convince others to follow our lead. The key is that the community should be able to train these models themselves…

…[pero]our model has already pushed the limit of what can be trained on a single server: any larger model already requires a cluster of servers, dramatically increasing cost.”

Another of the objectives of these researchers when creating PolyCoder as open source is prevent this class of models from being pushed to generate programs with bugs or malicious code (as is already the case with CoPilot), especially if they result in hard-to-detect vulnerabilities.

Vía | VentureBeat & ZDNet


Please enter your comment!
Please enter your name here