This document presents a meticulously curated collection of awesome research papers, datasets, and tools dedicated to the application of machine learning techniques in code intelligence.
Code intelligence involves the application of machine learning techniques to extract knowledge from large-scale code repositories, with the aim of developing intelligent tools to improve the quality and productivity of computer programming.
The list includes the publication year for each paper (or the submission year for pre-prints and arXiv articles), the name of the first author, and the publication venue. Additionally, if the code associated with the research is available, it is linked via a corresponding hyperlink.
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2018 | A Survey of Machine Learning for Big Code and Naturalness | Allamanis et al. | CSUR | Code |
2021 | A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research | Watson et al. | TOSEM | Code |
2020 | Synergy between Machine/Deep Learning and Software Engineering- How Far Are We? | Wang et al. | arXiv | Code |
2020 | A Survey on Deep Learning for Software Engineering | Yang et al. | CSUR | Code |
2020 | Deep Learning & Software Engineering- State of Research and Future Directions | Devanbu et al. | arXiv | Code |
2021 | CodeXGLUE- A Machine Learning Benchmark Dataset for Code Understanding and Generation | Lu et al. | arXiv | Code |
To represent source code, we need to first determine what to represent. Various work has proposed to extract code features from multiple perspectives, including code tokens, intermediate representation, abstract syntax tree, as well as many kinds of flow graphs.
Code tokens, shaping the textual appearance of source code, are composed of function name, keywords, and various variable identifiers. These tokens are simple yet effective to represent the semantics of programs. The majority of approaches for processing code involve breaking the program down into a sequence of tokens based on specific delimiters, such as spaces or the capitalization patterns in identifiers (for identifiers like SortList and intArray).
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2017 | Synthesizing benchmarks for predictive modeling | Cummins et al. | CGO | Code |
2015 | Toward deep learning software repositories | White et al. | ICSE | Code |
2016 | Summarizing source code using a neural attention model | Iyer et al. | ACL | Code |
2016 | A convolutional attention network for extreme summarization of source code | Allamanis et al. | ICML | Code |
2019 | Open Vocabulary Learning on Source Code with a Graph-Structured Cache | Cvitkovic et al. | ICML | Code |
2021 | A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code | Chirkova et al. | NAACL | Code |
2020 | Learning and Evaluating Contextual Embedding of Source Code | Kanade et al. | ICML | Code |
2020 | Codebert: A pre-trained model for programming and natural languages | Feng et al. | EMNLP | Code |
2020 | Big code!= big vocabulary: Open-vocabulary models for source code | Karampatsis et al. | ICSE | Code |
There have been multiple methods proposed to analyze the API sequences in programs. One line of work is about mining API usage patterns from a large code corpus to demonstrate how to use an API. Another line of work is API recommendation, which aims to recommend or generate a sequence of APIs for users.
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2015 | How can I use this method? | Moreno et al. | ICSE | Code |
2017 | An unsupervised approach for discovering relevant tutorial fragments for APIs | Jiang et al. | ICSE | Code |
2017 | DeepAM: Migrate APIs with Multi-Modal Sequence to Sequence Learning | Deepam et al. | IJCAI | Code |
2016 | Deep API learning | Gu et al. | FSE | Code |
2017 | Exploring API Embedding for API Usages and Applications | Nguyen et al. | ICSE | Code |
2019 | SAR: learning cross-language API mappings with little knowledge | Bui et al. | FSE | Code |
The Abstract Syntax Tree (AST) is a tree-structured intermediate representation of code that describes the syntactic structure of a program. In an AST, the leaf nodes typically correspond to the tokens of variables and method names in the source code, while the non-leaf nodes represent the syntactic structure of code, like function definition, branch functions. As a result, this representation allows ASTs to be useful for both capturing the lexical information (e.g., variable number) and the syntactic structure of the source code. In practice, we can extract ASTs using several open source tools, e.g., tree-sitter parser, and LLVM Clang.
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2016 | Convolutional neural networks over tree structures for programming language processing | Mou et al. | AAAI | Code |
2020 | Modeling programs hierarchically with stack-augmented LSTM | Liu et al. | JSS | Code |
2019 | A novel neural source code representation based on abstract syntax tree | Zhang et al. | ICSE | Code |
2018 | Deep code comment generation | Hu et al. | ICPC | Code |
2019 | code2vec: Learning distributed representations of code | Alon et al. | PLDI | Code |
2019 | code2seq: Generating Sequences from Structured Representations of Code | Alon et al. | ICLR | Code |
2020 | Structural language models of code | Alon et al. | ICML | Code |
2017 | A syntactic neural model for general-purpose code generation | Yin et al. | ACL | Code |
2018 | Tree-to-tree neural networks for program translation | Chen et al. | ICLR | Code |
The Intermediate Representation (IR) is a well-formed structure that is independent of programming languages and machine architectures. It is used by compilers to accurately represent the source code during the translation process from the source code to low-level machine code. The IR can express the operations of the target machine. It is natural to enhance the code embeddings via utilizing IRs, with the benefit of limited vocabulary to significantly alleviate the OOV issue.
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2018 | Neural code comprehension: A learnable representation of code semantics | Ben et al. | Neurips | Code |
2020 | IR2Vec: LLVM IR based Scalable Program Embeddings | Venkatakeerthy et al. | TACO | Code |
2020 | Compiler-based graph representations for deep learning models of code | Brauckmann et al. | CC | Code |
2021 | ProGraML: Graph-based Deep Learning for Program Optimization and Analysis | Cummins et al. | ICML | Code |
2021 | How could Neural Networks understand Programs? | Peng et al. | ICML | Code |
Currently, many approaches have been proposed to convert programs into graphs to better represent the rich structural information within the programs, including ControlFlow Graph (CFG), Data-Flow Graph (DFG) and Code Property Graph (CPG). The CFG represents the computation and control flow of a program. In this representation, each node represents a basic block and each edge represents the transitions of control flow in the program. The DFG is a directed graph that illustrates data relationships among various functions. Each node in the DFG has input and output data ports, and each edge links an output port to an input port on another node.
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2018 | Learning to represent programs with graphs | Allamanis et al. | ICLR | Code |
2017 | Smartpaste: Learning to adapt source code | Allamanis et al. | arXiv | Code |
2018 | Generative code modeling with graphs | Brockschmidt et al. | ICLR | Code |
2020 | Flow2Vec: value-flow-based precise code embedding | Sui et al. | OOPSLA | Code |
2021 | ProGraML: Graph-based Deep Learning for Program Optimization and Analysis | Cummins et al. | ICML | Code |
2021 | PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair | Chen et al. | NeurIPS | Code |
2017 | Intelligent development environment and software knowledge graph | Lin et al. | NeurIPS | Code |
2020 | Graph4code: A machine interpretable knowledge graph for code | Abdelaziz et al. | arXiv | Code |
2020 | Exploiting Code Knowledge Graph for Bug Localization via Bi-directional Attention | Zhang et al. | ICPC | Code |
In addition to the aforementioned features of code that have already been widely explored, there also exist several kinds of features that are used in some specific scenarios.
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2018 | Code vectors: Understanding programs through embedded abstracted symbolic traces | Henkel et al. | FSE | Code |
2019 | Learning to Represent Edits | Yin et al. | ICLR | Code |
2019 | Neural Networks for Modeling Source Code Edits | Zhao et al. | arXiv | Code |
2020 | Cc2vec: Distributed representations of code changes | Hoang et al. | ICSE | Code |
2019 | On Learning Meaningful Code Changes via Neural Machine Translation | Tufano et al. | ICSE | Code |
2021 | Copy that! Editing Sequences by Copying Spans | Panthaplackel et al. | AAAI | Code |
2020 | A Structural Model for Contextual Code Changes | Brody et al. | OOPSLA | Code |
2021 | Learning Structural Edits via Incremental Tree Transformations | Yao et al. | ICLR | Code |
To leverage multiple code features, several approaches to representing source code in a hybrid fashion have been developed.
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2018 | Deep code search | Gu et al. | ICSE | Code |
2016 | Deep learning code fragments for code clone detection | White et al. | ASE | Code |
2018 | Deepsim: deep learning code functional similarity | Zhao et al. | FSE | Code |
2018 | Improving automatic source code summarization via deep reinforcement learning | Wan et al. | ASE | Code |
2019 | Multi-modal attention network learning for semantic source code retrieval | Wan et al. | ASE | Code |
Classifying source code into different classes (e.g., different functionalities and programming languages), is important for many tasks such as code categorization, programming language identification, code prediction, and vulnerability detection. Various studies have been conducted to classify code snippets into categories based on their functionalities.
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2016 | Convolutional neural networks over tree structures for programming language processing | Mou et al. | AAAI | Code |
2018 | Adapting neural text classification for improved software categorization | Leclair et al. | ICSME | Code |
2019 | Bilateral dependency neural networks for cross-language algorithm classification | Bui et al. | SANER | Code |
2018 | SCC: Automatic classification of code snippets | Alreshedy et al. | SCAM | Code |
2020 | SCC++: predicting the programming language of questions and snippets of Stack Overflow | Alrashedy et al. | JSS | Code |
Detecting vulnerabilities or bugs in programs is essential for assuring the quality of software, as well as saves much effort and time for software development. Although many tools have been developed for vulnerability detection, e.g., Clang Static Analyzer, Coverity, Fortify, Flawfinder, Infer, and SVF, most of them are based on static analysis. Recently, a growing number of works employ deep learning to discover vulnerabilities.
Code completion is a core feature of most modern IDEs. It offers the developers a list of possible code hints based on available information.
Programming languages with dynamic typing, like Python and JavaScript, allow for rapid prototyping for developers and can save the time of software development dramatically. However, without the type information, unexpected run-time errors are prone to occur, which may introduce bugs and produce low-quality code. Current works on type inference, with the aim of automatically inferring variable types, mainly fall into two categories: the static-analysis-based and learning-based.
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2018 | MaxSMT-based type inference for Python 3 | Hassan et al. | CAV | Code |
2004 | Faster than C: Static type inference with Starkiller | Salib et al. | PyCon Proceedings | Code |
2015 | Predicting program properties from big code | Raychev et al. | Communications of the ACM | Code |
2016 | Python probabilistic type inference with natural language support | Xu et al. | FSE | Code |
2018 | Deep learning type inference | Hellendoorn et al. | FSE | Code |
2019 | NL2Type: Inferring JavaScript Function Types from Natural Language Information | Malik et al. | ICSE | Code |
2020 | Typewriter: Neural type prediction with search-based validation | Pradel et al. | FSE | Code |
2020 | Lambdanet: Probabilistic type inference using graph neural networks | Wei et al. | ICLR | Code |
2020 | OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints | Pandi et al. | arXiv | Code |
2020 | Typilus: neural type hints | Allamanis et al. | PLDI | Code |
2021 | Type4Py: Deep Similarity Learning-Based Type Inference for Python | Mir et al. | arXiv | Code |
Code search aims to retrieve a code snippet by a natural-language query (nl-tocode) or code query (code-to-code). The nl-to-code search refers to searching code fragments that have similar semantics to the natural-language query from a codebase. In contrast to nl-to-code search, the input of code-to-code search is source code, rather than natural-language description. The objective of the code-to-code search is to find code snippets that are semantically related to an input code from a codebase.
Numerous software engineering activities, including code reuse, vulnerability detection, and code search, rely on detecting similar code snippets (or code clones). There are basically four main types of code clones: Type-1 code clones are ones that are identical except for spaces, blanks, and comments. Type-2 code clones denote identical code snippets except for the variable, type, literal, and function names. Type-3 code clones denote two code snippets that are almost identical except for a few statements that have been added or removed. Type-4 code clones denote heterogeneous code snippets with similar functionality but differing code structures or syntax. To handle different types of code clones, various works have been proposed.
Inspired by the text generation work in NLP, many approaches have been put forward to systematically generate a description or function name to summarize the semantics of source code.
Translating programs from a deprecated programming language to a modern one is important for software maintenance. Many neural machine translation-based methods have been proposed for program translation.
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2013 | Lexical statistical machine translation for language migration | Nguyen et al. | FSE | Code |
2015 | Using machine translation for converting python 2 to python 3 code | Aggarwal et al. | Technical Report | Code |
2015 | Divide-and-conquer approach for multi-phase statistical migration for source code | Nguyen et al. | ASE | Code |
2018 | Tree-to-tree neural networks for program translation | Chen et al. | ICLR | Code |
2017 | DeepAM: Migrate APIs with Multi-Modal Sequence to Sequence Learning | Deepam et al. | IJCAI | Code |
2020 | Unsupervised translation of programming languages | Lachaux et al. | NeurIPS | Code |
Program synthesis is a task for generating source code using high-level specifications (e.g., program descriptions or input-output samples). Given the natural-language inputs, current approaches resort to generating programs through machine translation.
Automatically localizing and repairing bugs in programs can save much manual effort in software development. One line of work is to learn the patterns of how programmers edit the source code, which can be used to check syntax errors while compiling. Another line of work is focusing on repairing programs by generating patches.
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2018 | Learning to optimize tensor programs | Chen et al. | NeurIPS | Code |
2020 | FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System | Zheng et al. | ASPLOS | Code |
2020 | Ansor: Generating high-performance tensor programs for deep learning | Zheng et al. | OSDI | Code |
2013 | Predictive modeling in a polyhedral optimization space | Park et al. | IJPL | Code |
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2021 | ProGraML: Graph-based Deep Learning for Program Optimization and Analysis | Cummins et al. | ICML | Code |
2020 | Deep program structure modeling through multi-relational graph-based learning | Ye et al. | PACT | Code |
2020 | Designing PairBuddy – A Conversational Agent for Pair Programming | Robe et al. | arXiv | Code |
2021 | On the Evaluation of Commit Message Generation Models: An Experimental Study | Tao et al. | ICSME | Code |
2018 | Large-scale and language-oblivious code authorship identification | Abuhamad et al. | CCS | Code |
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2019 | Open Vocabulary Learning on Source Code with a Graph-Structured Cache | Cvitkovic et al. | ICML | Code |
2020 | Big code!= big vocabulary: Open-vocabulary models for source code | Karampatsis et al. | ICSE | Code |
2021 | A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code | Chirkova et al. | NAACL | Code |
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2021 | Disentangled Code Representation Learning for Multiple Programming Languages | Zhang et al. | ACL | Code |
2022 | Multilingual training for Software Engineering | Ahmed et al. | ICSE | Code |
2019 | Clcdsa: cross language code clone detection using syntactical features and api documentation | Nafi et al. | ASE | Code |
2019 | Bilateral dependency neural networks for cross-language algorithm classification | Bui et al. | SANER | Code |
2019 | SAR: learning cross-language API mappings with little knowledge | Bui et al. | FSE | Code |
2021 | Interactive Cross-language Code Retrieval with Auto-Encoders | Chen et al. | ASE | Code |
2022 | Cross-Domain Deep Code Search with Few-Shot Meta Learning | Chai et al. | ICSE | Code |
2022 | Cross-Language Binary-Source Code Matching with Intermediate Representations | Gui et al. | SANER | Code |
Year | Title | Author | Venue | Code |
---|---|---|---|---|
2021 | Vulnerability Detection with Fine-grained Interpretations | Li et al. | FSE | Code |
2021 | Interpreting deep learning-based vulnerability detector predictions based on heuristic searching | Zou et al. | TOSEM | Code |
2021 | Interpretable Program Synthesis | Zhang et al. | CHI | Code |
2021 | PyExplainer: Explaining the Predictions of Just-In-Time Defect Models | Pornprasit et al. | ASE | Code |