Awesome Deep Learning for Code Intelligence

This document presents a meticulously curated collection of awesome research papers, datasets, and tools dedicated to the application of machine learning techniques in code intelligence.

Code intelligence involves the application of machine learning techniques to extract knowledge from large-scale code repositories, with the aim of developing intelligent tools to improve the quality and productivity of computer programming.

The list includes the publication year for each paper (or the submission year for pre-prints and arXiv articles), the name of the first author, and the publication venue. Additionally, if the code associated with the research is available, it is linked via a corresponding hyperlink.

Related Survey

Year	Title	Author	Venue	Code
2018	A Survey of Machine Learning for Big Code and Naturalness	Allamanis et al.	CSUR	Code
2021	A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research	Watson et al.	TOSEM	Code
2020	Synergy between Machine/Deep Learning and Software Engineering- How Far Are We?	Wang et al.	arXiv	Code
2020	A Survey on Deep Learning for Software Engineering	Yang et al.	CSUR	Code
2020	Deep Learning & Software Engineering- State of Research and Future Directions	Devanbu et al.	arXiv	Code
2021	CodeXGLUE- A Machine Learning Benchmark Dataset for Code Understanding and Generation	Lu et al.	arXiv	Code

Code Representation

To represent source code, we need to first determine what to represent. Various work has proposed to extract code features from multiple perspectives, including code tokens, intermediate representation, abstract syntax tree, as well as many kinds of flow graphs.

Code Tokens

Code tokens, shaping the textual appearance of source code, are composed of function name, keywords, and various variable identifiers. These tokens are simple yet effective to represent the semantics of programs. The majority of approaches for processing code involve breaking the program down into a sequence of tokens based on specific delimiters, such as spaces or the capitalization patterns in identifiers (for identifiers like SortList and intArray).

Year	Title	Author	Venue	Code
2017	Synthesizing benchmarks for predictive modeling	Cummins et al.	CGO	Code
2015	Toward deep learning software repositories	White et al.	ICSE	Code
2016	Summarizing source code using a neural attention model	Iyer et al.	ACL	Code
2016	A convolutional attention network for extreme summarization of source code	Allamanis et al.	ICML	Code
2019	Open Vocabulary Learning on Source Code with a Graph-Structured Cache	Cvitkovic et al.	ICML	Code
2021	A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code	Chirkova et al.	NAACL	Code
2020	Learning and Evaluating Contextual Embedding of Source Code	Kanade et al.	ICML	Code
2020	Codebert: A pre-trained model for programming and natural languages	Feng et al.	EMNLP	Code
2020	Big code!= big vocabulary: Open-vocabulary models for source code	Karampatsis et al.	ICSE	Code

API

There have been multiple methods proposed to analyze the API sequences in programs. One line of work is about mining API usage patterns from a large code corpus to demonstrate how to use an API. Another line of work is API recommendation, which aims to recommend or generate a sequence of APIs for users.

Year	Title	Author	Venue	Code
2015	How can I use this method?	Moreno et al.	ICSE	Code
2017	An unsupervised approach for discovering relevant tutorial fragments for APIs	Jiang et al.	ICSE	Code
2017	DeepAM: Migrate APIs with Multi-Modal Sequence to Sequence Learning	Deepam et al.	IJCAI	Code
2016	Deep API learning	Gu et al.	FSE	Code
2017	Exploring API Embedding for API Usages and Applications	Nguyen et al.	ICSE	Code
2019	SAR: learning cross-language API mappings with little knowledge	Bui et al.	FSE	Code

AST

The Abstract Syntax Tree (AST) is a tree-structured intermediate representation of code that describes the syntactic structure of a program. In an AST, the leaf nodes typically correspond to the tokens of variables and method names in the source code, while the non-leaf nodes represent the syntactic structure of code, like function definition, branch functions. As a result, this representation allows ASTs to be useful for both capturing the lexical information (e.g., variable number) and the syntactic structure of the source code. In practice, we can extract ASTs using several open source tools, e.g., tree-sitter parser, and LLVM Clang.

Year	Title	Author	Venue	Code
2016	Convolutional neural networks over tree structures for programming language processing	Mou et al.	AAAI	Code
2020	Modeling programs hierarchically with stack-augmented LSTM	Liu et al.	JSS	Code
2019	A novel neural source code representation based on abstract syntax tree	Zhang et al.	ICSE	Code
2018	Deep code comment generation	Hu et al.	ICPC	Code
2019	code2vec: Learning distributed representations of code	Alon et al.	PLDI	Code
2019	code2seq: Generating Sequences from Structured Representations of Code	Alon et al.	ICLR	Code
2020	Structural language models of code	Alon et al.	ICML	Code
2017	A syntactic neural model for general-purpose code generation	Yin et al.	ACL	Code
2018	Tree-to-tree neural networks for program translation	Chen et al.	ICLR	Code

IR

The Intermediate Representation (IR) is a well-formed structure that is independent of programming languages and machine architectures. It is used by compilers to accurately represent the source code during the translation process from the source code to low-level machine code. The IR can express the operations of the target machine. It is natural to enhance the code embeddings via utilizing IRs, with the benefit of limited vocabulary to significantly alleviate the OOV issue.

Year	Title	Author	Venue	Code
2018	Neural code comprehension: A learnable representation of code semantics	Ben et al.	Neurips	Code
2020	IR2Vec: LLVM IR based Scalable Program Embeddings	Venkatakeerthy et al.	TACO	Code
2020	Compiler-based graph representations for deep learning models of code	Brauckmann et al.	CC	Code
2021	ProGraML: Graph-based Deep Learning for Program Optimization and Analysis	Cummins et al.	ICML	Code
2021	How could Neural Networks understand Programs?	Peng et al.	ICML	Code

Code Graphs

Currently, many approaches have been proposed to convert programs into graphs to better represent the rich structural information within the programs, including ControlFlow Graph (CFG), Data-Flow Graph (DFG) and Code Property Graph (CPG). The CFG represents the computation and control flow of a program. In this representation, each node represents a basic block and each edge represents the transitions of control flow in the program. The DFG is a directed graph that illustrates data relationships among various functions. Each node in the DFG has input and output data ports, and each edge links an output port to an input port on another node.

Year	Title	Author	Venue	Code
2018	Learning to represent programs with graphs	Allamanis et al.	ICLR	Code
2017	Smartpaste: Learning to adapt source code	Allamanis et al.	arXiv	Code
2018	Generative code modeling with graphs	Brockschmidt et al.	ICLR	Code
2020	Flow2Vec: value-flow-based precise code embedding	Sui et al.	OOPSLA	Code
2021	ProGraML: Graph-based Deep Learning for Program Optimization and Analysis	Cummins et al.	ICML	Code
2021	PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair	Chen et al.	NeurIPS	Code
2017	Intelligent development environment and software knowledge graph	Lin et al.	NeurIPS	Code
2020	Graph4code: A machine interpretable knowledge graph for code	Abdelaziz et al.	arXiv	Code
2020	Exploiting Code Knowledge Graph for Bug Localization via Bi-directional Attention	Zhang et al.	ICPC	Code

Other Features of Code

In addition to the aforementioned features of code that have already been widely explored, there also exist several kinds of features that are used in some specific scenarios.

Year	Title	Author	Venue	Code
2018	Code vectors: Understanding programs through embedded abstracted symbolic traces	Henkel et al.	FSE	Code
2019	Learning to Represent Edits	Yin et al.	ICLR	Code
2019	Neural Networks for Modeling Source Code Edits	Zhao et al.	arXiv	Code
2020	Cc2vec: Distributed representations of code changes	Hoang et al.	ICSE	Code
2019	On Learning Meaningful Code Changes via Neural Machine Translation	Tufano et al.	ICSE	Code
2021	Copy that! Editing Sequences by Copying Spans	Panthaplackel et al.	AAAI	Code
2020	A Structural Model for Contextual Code Changes	Brody et al.	OOPSLA	Code
2021	Learning Structural Edits via Incremental Tree Transformations	Yao et al.	ICLR	Code

Hybrid

To leverage multiple code features, several approaches to representing source code in a hybrid fashion have been developed.

Year	Title	Author	Venue	Code
2018	Deep code search	Gu et al.	ICSE	Code
2016	Deep learning code fragments for code clone detection	White et al.	ASE	Code
2018	Deepsim: deep learning code functional similarity	Zhao et al.	FSE	Code
2018	Improving automatic source code summarization via deep reinforcement learning	Wan et al.	ASE	Code
2019	Multi-modal attention network learning for semantic source code retrieval	Wan et al.	ASE	Code

Application

Code Classification

Classifying source code into different classes (e.g., different functionalities and programming languages), is important for many tasks such as code categorization, programming language identification, code prediction, and vulnerability detection. Various studies have been conducted to classify code snippets into categories based on their functionalities.

Year	Title	Author	Venue	Code
2016	Convolutional neural networks over tree structures for programming language processing	Mou et al.	AAAI	Code
2018	Adapting neural text classification for improved software categorization	Leclair et al.	ICSME	Code
2019	Bilateral dependency neural networks for cross-language algorithm classification	Bui et al.	SANER	Code
2018	SCC: Automatic classification of code snippets	Alreshedy et al.	SCAM	Code
2020	SCC++: predicting the programming language of questions and snippets of Stack Overflow	Alrashedy et al.	JSS	Code

Vulnerability Detection and Bug Finding

Detecting vulnerabilities or bugs in programs is essential for assuring the quality of software, as well as saves much effort and time for software development. Although many tools have been developed for vulnerability detection, e.g., Clang Static Analyzer, Coverity, Fortify, Flawfinder, Infer, and SVF, most of them are based on static analysis. Recently, a growing number of works employ deep learning to discover vulnerabilities.

Year	Title	Author	Venue	Code
2016	Automatically Learning Semantic Features for Defect Prediction	Wang et al.	ICSE	Code
2017	Software defect prediction via convolutional neural network	Li et al.	QRS	Code
2018	Automatic feature learning for predicting vulnerable software components	Dam et al.	TSE	Code
2018	Vuldeepecker: A deep learning-based system for vulnerability detection	Li et al.	NDSS	Code
2019	μVulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection	Zou et al.	TPSC	Code
2021	SySeVR: A framework for using deep learning to detect software vulnerabilities	Li et al.	TDSC	Code
2018	Cross-project transfer representation learning for vulnerable function discovery	Lin et al.	TII	Code
2018	Maximal divergence sequential autoencoder for binary software vulnerability detection	Le et al.	ICLR	Code
2019	Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks	Zhou et al.	NeurIPS	Code
2020	Combining graph-based learning with automated data collection for code vulnerability detection	Wang et al.	TIFS	Code
2021	DeepWukong: Statically detecting software vulnerabilities using deep graph neural network	Cheng et al.	TOSEM	Code
2021	Combining Graph Neural Networks with Expert Knowledge for Smart Contract Vulnerability Detection	Liu et al.	TKDE	Code
2021	Vulnerability Detection with Fine-Grained Interpretations	Li et al.	FSE	Code
2021	Interpreting deep learning-based vulnerability detector predictions based on heuristic searching	Zou et al.	TOSEM	Code
2018	Deepbugs: A learning approach to name-based bug detection	Pradel et al.	OOPSLA	Code
2019	Improving bug detection via context-based code representation learning and attention-based neural networks	Li et al.	OOPSLA	Code
2020	Neural Attribution for Semantic Bug-Localization in Student Programs	Gupta et al.	NeurIPS	Code
2021	Fault Localization with Code Coverage Representation Learning	Li et al.	ICSE	Code
2021	Learning to find naming issues with big code and small supervision	He et al.	PLDI	Code

Code Completion

Code completion is a core feature of most modern IDEs. It offers the developers a list of possible code hints based on available information.

Year	Title	Author	Venue	Code
2014	Code completion with statistical language models	Raychev et al.	PLDI	Code
2017	Neural code completion	Liu et al.	ICLR	Code
2018	Code completion with neural attention and pointer networks	Li et al.	IJCAI	Code
2016	Learning python code suggestion with a sparse pointer network	Bhoopchand et al.	arXiv	Code
2019	Pythia: Ai-assisted code completion system	Svyatkovskiy et al.	SIGKDD	Code
2021	Code prediction by feeding trees to transformers	Kim et al.	ICSE	Code
2020	Structural language models of code	Alon et al.	ICML	Code
2021	Code completion by modeling flattened abstract syntax trees as graphs	Wang et al.	AAAI	Code
2020	IntelliCode Compose: Code Generation Using Transformer	Svyatkovskiy et al.	FSE	Code
2020	A Self-Attentional Neural Architecture for Code Completion with Multi-Task Learning	Liu et al.	ICPC	Code
2020	Multi-task learning based pre-trained language model for code completion	Liu et al.	ASE	Code
2021	Fast and memory-efficient neural code completion	Svyatkovskiy et al.	MSR	Code
2020	On-the-Fly Adaptation of Source Code Models using Meta-Learning	Shrivastava et al.	arXiv	Code
2019	Generative Code Modeling with Graphs	Brockschmidt et al.	ICLR	Code
2018	A Retrieve-and-Edit Framework for Predicting Structured Outputs	Hashimoto et al.	NIPS	Code

Type Inference

Programming languages with dynamic typing, like Python and JavaScript, allow for rapid prototyping for developers and can save the time of software development dramatically. However, without the type information, unexpected run-time errors are prone to occur, which may introduce bugs and produce low-quality code. Current works on type inference, with the aim of automatically inferring variable types, mainly fall into two categories: the static-analysis-based and learning-based.

Year	Title	Author	Venue	Code
2018	MaxSMT-based type inference for Python 3	Hassan et al.	CAV	Code
2004	Faster than C: Static type inference with Starkiller	Salib et al.	PyCon Proceedings	Code
2015	Predicting program properties from big code	Raychev et al.	Communications of the ACM	Code
2016	Python probabilistic type inference with natural language support	Xu et al.	FSE	Code
2018	Deep learning type inference	Hellendoorn et al.	FSE	Code
2019	NL2Type: Inferring JavaScript Function Types from Natural Language Information	Malik et al.	ICSE	Code
2020	Typewriter: Neural type prediction with search-based validation	Pradel et al.	FSE	Code
2020	Lambdanet: Probabilistic type inference using graph neural networks	Wei et al.	ICLR	Code
2020	OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints	Pandi et al.	arXiv	Code
2020	Typilus: neural type hints	Allamanis et al.	PLDI	Code
2021	Type4Py: Deep Similarity Learning-Based Type Inference for Python	Mir et al.	arXiv	Code

Code Search

Code search aims to retrieve a code snippet by a natural-language query (nl-tocode) or code query (code-to-code). The nl-to-code search refers to searching code fragments that have similar semantics to the natural-language query from a codebase. In contrast to nl-to-code search, the input of code-to-code search is source code, rather than natural-language description. The objective of the code-to-code search is to find code snippets that are semantically related to an input code from a codebase.

Year	Title	Author	Venue	Code
2015	Codehow: Effective code search based on api understanding and extended boolean model (e)	Lv et al.	ASE	Code
2016	Relationship-aware code search for JavaScript frameworks	Li et al.	FSE	Code
2018	Deep code search	Gu et al.	ICSE	Code
2019	Multi-modal attention network learning for semantic source code retrieval	Wan et al.	ASE	Code
2020	A Multi-Perspective Architecture for Semantic Code Search	Haldar et al.	ACL	Code
2020	OCoR: An Overlapping-Aware Code Retriever	Zhu et al.	ASE	Code
2019	Coacor: Code annotation for code retrieval with reinforcement learning	Yao et al.	WWW	Code
2019	Aroma: Code recommendation via structural code search	Luan et al.	OOPSLA	Code
2020	Deep Graph Matching and Searching for Semantic Code Retrieval	Ling et al.	TKDD	Code
2019	When deep learning met code search	Cambronero et al.	FSE	Code
2018	FaCoY: a code-to-code search engine	Kim et al.	ICSE	Code
2021	Interactive Cross-language Code Retrieval with Auto-Encoders	Chen et al.	ASE	Code

Code Clone Detection

Numerous software engineering activities, including code reuse, vulnerability detection, and code search, rely on detecting similar code snippets (or code clones). There are basically four main types of code clones: Type-1 code clones are ones that are identical except for spaces, blanks, and comments. Type-2 code clones denote identical code snippets except for the variable, type, literal, and function names. Type-3 code clones denote two code snippets that are almost identical except for a few statements that have been added or removed. Type-4 code clones denote heterogeneous code snippets with similar functionality but differing code structures or syntax. To handle different types of code clones, various works have been proposed.

Year	Title	Author	Venue	Code
2002	CCFinder: A multilinguistic token-based code clone detection system for large scale source code	Kamiya et al.	TSE	Code
2008	NICAD- Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization	Roy et al.	ICPC	Code
2007	Deckard: Scalable and accurate tree-based detection of code clones	Jiang et al.	ICSE	Code
2016	Sourcerercc: Scaling code clone detection to big-code	Sajnani et al.	ICSE	Code
2016	Deep learning code fragments for code clone detection	White et al.	ASE	Code
2017	Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code	Wei et al.	IJCAI	Code
2018	Deepsim: deep learning code functional similarity	Zhao et al.	FSE	Code
2020	SCDetector: Software Functional Clone Detection Based on Semantic Tokens Analysis	Wu et al.	ASE	Code
2019	A novel neural source code representation based on abstract syntax tree	Zhang et al.	ICSE	Code
2019	Learning-based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection	Buch et al.	SANER	Code
2020	Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree	Wang et al.	SANER	Code
2020	funcGNN: A Graph Neural Network Approach to Program Similarity	Nair et al.	ESEM	Code
2021	Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks	Mehrotra et al.	TSE	Code
2018	Deep Learning Similarities from Different Representations of Source Code	Tufano et al.	MSR	Code

Code Summarization

Inspired by the text generation work in NLP, many approaches have been put forward to systematically generate a description or function name to summarize the semantics of source code.

Year	Title	Author	Venue	Code
2010	Supporting program comprehension with source code summarization	Haiduc et al.	ICSE	Code
2013	Autocomment: Mining question and answer sites for automatic comment generation	Wong et al.	ASE	Code
2015	Clocom: Mining existing source code for automatic comment generation	Wong et al.	SANER	Code
2013	Evaluating source code summarization techniques: Replication and expansion	Eddy et al.	ICPC	Code
2013	Natural Language Models for Predicting Programming Comments	Movshovitz et al.	ACL	Code
2016	A convolutional attention network for extreme summarization of source code	Allamanis et al.	ICML	Code
2016	Summarizing source code using a neural attention model	Iyer et al.	ACL	Code
2018	Deep code comment generation	Hu et al.	ICPC	Code
2019	code2seq: Generating Sequences from Structured Representations of Code	Alon et al.	ICLR	Code
2019	Structured neural summarization	Fernandes et al.	ICLR	Code
2020	A transformer-based approach for source code summarization	Ahmad et al.	ACL	Code
2021	SIT: Code Summarization with Structure-Induced Transformer	Wu et al.	ACL	Code
2018	Improving automatic source code summarization via deep reinforcement learning	Wan et al.	ASE	Code
2020	Improved code summarization via a graph neural network	Leclair et al.	ICPC	Code
2021	CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees	Shi et al.	EMNLP	Code
2019	A Neural Model for Generating Natural Language Summaries of Program Subroutines	Leclair et al.	ICSE	Code
2020	Improved Automatic Summarization of Subroutines via Attention to File Context	Haque et al.	MSR	Code
2020	Suggesting Comment Completions for Python using Neural Language Models	Ciurumelea et al.	SANER	Code
2020	Retrieval-based neural source code summarization	Zhang et al.	ICSE	Code
2020	Retrieve and refine: exemplar-based neural comment generation	Wei et al.	ASE	Code
2021	Retrieval-Augmented Generation for Code Summarization via Hybrid GNN	Liu et al.	ICLR	Code
2021	EditSum: A Retrieve-and-Edit Framework for Source Code Summarization	Li et al.	ASE	Code
2018	Summarizing source code with transferred api knowledge	Hu et al.	IJCAI	Code
2019	Code generation as a dual task of code summarization	Wei et al.	NeurIPS	Code
2020	Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning	Ye et al.	WWW	Code
2019	Learning to Spot and Refactor Inconsistent Method Names	Liu et al.	ICSE	Code
2021	Deep Just-In-Time Inconsistency Detection Between Comments and Source Code	Panthaplackel et al.	AAAI	Code
2020	Suggesting Natural Method Names to Check Name Consistencies	Nguyen et al.	ICSE	Code
2020	Learning to Update Natural Language Comments Based on Code Changes	Panthaplackel et al.	ACL	Code
2020	Automating Just-In-Time Comment Updating	Liu et al.	ASE	Code
2021	Automating the removal of obsolete TODO comments	Gao et al.	FSE	Code

Program Translation

Translating programs from a deprecated programming language to a modern one is important for software maintenance. Many neural machine translation-based methods have been proposed for program translation.

Year	Title	Author	Venue	Code
2013	Lexical statistical machine translation for language migration	Nguyen et al.	FSE	Code
2015	Using machine translation for converting python 2 to python 3 code	Aggarwal et al.	Technical Report	Code
2015	Divide-and-conquer approach for multi-phase statistical migration for source code	Nguyen et al.	ASE	Code
2018	Tree-to-tree neural networks for program translation	Chen et al.	ICLR	Code
2017	DeepAM: Migrate APIs with Multi-Modal Sequence to Sequence Learning	Deepam et al.	IJCAI	Code
2020	Unsupervised translation of programming languages	Lachaux et al.	NeurIPS	Code

Program Synthesis

Program synthesis is a task for generating source code using high-level specifications (e.g., program descriptions or input-output samples). Given the natural-language inputs, current approaches resort to generating programs through machine translation.

Year	Title	Author	Venue	Code
2006	Learning for semantic parsing with statistical machine translation	Wong et al.	NAACL	Code
2011	Automating string processing in spreadsheets using input-output examples	Gulwani et al.	POPL	Code
2014	Structured Generative Models of Natural Source Code	Maddison et al.	ICML	Code
2015	Language to code: Learning semantic parsers for if-this-then-that recipes	Quirk et al.	ACL	Code
2016	Language to logical form with neural attention	Dong et al.	ACL	Code
2016	Latent attention for if-then program synthesis	Liu et al.	NIPS	Code
2016	Improved semantic parsers for if-then statements	Beltagy et al.	ACL	Code
2016	Latent Predictor Networks for Code Generation	Ling et al.	ACL	Code
2017	A syntactic neural model for general-purpose code generation	Yin et al.	ACL	Code
2017	Abstract Syntax Networks for Code Generation and Semantic Parsing	Rabinovich et al.	ACL	Code
2017	Neural Programming by Example	Shu et al.	AAAI	Code
2017	DeepCoder: Learning to write programs	Balog et al.	ICLR	Code
2017	RobustFill: Neural Program Learning under Noisy I/O	Devlin et al.	ICML	Code
2017	Seq2sql: Generating structured queries from natural language using reinforcement learning	Zhong et al.	arXiv	Code
2018	Mapping Language to Code in Programmatic Context	Iyer et al.	EMNLP	Code
2018	Selecting representative examples for program synthesis	Pu et al.	ICML	Code
2018	NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System	Lin et al.	LREC	Code
2018	An encoder-decoder framework translating natural language to database queries	Cai et al.	IJCAI	Code
2018	Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task	Yu et al.	EMNLP	Code
2018	Syntaxsqlnet: Syntax tree networks for complex and cross-domain text-to-sql task	Yu et al.	EMNLP	Code
2019	Learning to infer program sketches	Nye et al.	ICML	Code
2019	AutoPandas: neural-backed generators for program synthesis	Bavishi et al.	OOPSLA	Code
2019	Sparc: Cross-domain semantic parsing in context	Yu et al.	ACL	Code
2019	CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases	Yu et al.	EMNLP	Code
2019	A Grammar-Based Structural CNN Decoder for Code Generation	Sun et al.	AAAI	Code
2019	Spoc: Search-based pseudocode to code	Kulal et al.	NIPS	Code
2020	HISyn: human learning-inspired natural language programming	Nan et al.	FSE	Code
2021	Evaluating large language models trained on code	Chen et al.	arXiv	Code
2022	Competition-Level Code Generation with AlphaCode	Li et al.	AI	Code
2022	CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis	Nijkamp et al.	arXiv	Code
2022	PaLM: Scaling Language Modeling with Pathways	Chowdhery et al.	arXiv	Code
2023	InCoder: A Generative Model for Code Infilling and Synthesis	Fried et al.	ICLR	Code
2022	PanGu-Coder: Program Synthesis with Function-Level Language Modeling	Christopoulou et al.	arXiv	Code
2022	ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages	Chai et al.	ACL Findings	Code
2023	StarCoder: may the source be with you!	Li et al.	TMLR	Code
2023	Code Llama: Open Foundation Models for Code	Roziere et al.	arXiv	Code
2023	CodeT5+: Open Code Large Language Models for Code Understanding and Generation	Wang et al.	EMNLP	Code
2023	CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X	Zheng et al.	KDD	Code

Program Repair

Automatically localizing and repairing bugs in programs can save much manual effort in software development. One line of work is to learn the patterns of how programmers edit the source code, which can be used to check syntax errors while compiling. Another line of work is focusing on repairing programs by generating patches.

Year	Title	Author	Venue	Code
2016	Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks	Bhatia et al.	arXiv	Code
2018	Syntax and Sensibility: Using language models to detect and correct syntax errors	Santos et al.	SANER	Code
2017	DeepFix: Fixing Common C Language Errors by Deep Learning	Gupta et al.	AAAI	Code
2021	SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair	Chen et al.	TSE	Code
2018	Deep Reinforcement Learning for Programming Language Correction	Gupta et al.	arXiv	Code
2019	SampleFix: Learning to Correct Programs by Sampling Diverse Fixes	Hajipour et al.	arXiv	Code
2019	Neural Program Repair by Jointly Learning to Localize and Repair	Vasic et al.	ICLR	Code
2020	Hoppity: Learning graph transformations to detect and fix bugs in programs	Dinella et al.	ICLR	Code
2014	Neural turing machines	Graves et al.	arXiv	Code
2019	DeepDelta: Learning to Repair Compilation Errors	Mesbah et al.	FSE	Code
2020	Learning to Fix Build Errors with Graph2Diff Neural Networks	Tarlow et al.	ICSE	Code
2020	Codit: Code editing with tree-based neural models	Chakraborty et al.	TSE	Code
2021	A Syntax-Guided Edit Decoder for Neural Program Repair	Zhu et al.	FSE	Code
2020	Graph-based, Self-Supervised Program Repair from Diagnostic Feedback	Yasunaga et al.	ICML	Code
2021	TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer	Berabi et al.	ICML	Code
2020	Self-Supervised Bug Detection and Repair	Allamanis et al.	NeurIPS	Code
2021	CURE: Code-Aware Neural Machine Translation for Automatic Program Repair	Jiang et al.	ICSE	Code
2018	An empirical investigation into learning bug-fixing patches in the wild via neural machine translation	Tufano et al.	ASE	Code
2018	Learning to Generate Corrective Patches using Neural Machine Translation	Hata et al.	arXiv	Code
2018	Learning to Repair Software Vulnerabilities with Generative Adversarial Networks	Harer et al.	NeurIPS	Code
2020	Synthesize, execute and debug: Learning to repair for neural program synthesis	Gupta et al.	NeurIPS	Code
2020	DLFix: Context-based Code Transformation Learning for Automated Program Repair	Li et al.	ICSE	Code
2020	Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair	Tian et al.	ASE	Code
2004	At the end of synthesis: narrowing program candidates	Shriver et al.	ICSE-NIER	Code
2020	Human-in-the-loop automatic program repair	Bohme et al.	ICST	Code
2021	Interactive Patch Filtering as Debugging Aid	Liang et al.	ICSME	Code
2019	Learning to optimize halide with tree search and random programs	Adams et al.	TOG	Code

Code Optimization

Year	Title	Author	Venue	Code
2018	Learning to optimize tensor programs	Chen et al.	NeurIPS	Code
2020	FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System	Zheng et al.	ASPLOS	Code
2020	Ansor: Generating high-performance tensor programs for deep learning	Zheng et al.	OSDI	Code
2013	Predictive modeling in a polyhedral optimization space	Park et al.	IJPL	Code

Other Applications

Year	Title	Author	Venue	Code
2021	ProGraML: Graph-based Deep Learning for Program Optimization and Analysis	Cummins et al.	ICML	Code
2020	Deep program structure modeling through multi-relational graph-based learning	Ye et al.	PACT	Code
2020	Designing PairBuddy – A Conversational Agent for Pair Programming	Robe et al.	arXiv	Code
2021	On the Evaluation of Commit Message Generation Models: An Experimental Study	Tao et al.	ICSME	Code
2018	Large-scale and language-oblivious code authorship identification	Abuhamad et al.	CCS	Code

Dataset

Year	Title	Author	Venue	Code
2019	Codesearchnet challenge: Evaluating the state of semantic code search	Husain et al.	arXiv	Code
2021	CoSQA: 20,000+ Web Queries for Code Search and Question Answering	Huang et al.	ACL	Code
2016	Probabilistic model for code with decision trees	Raychev et al.	OOPSLA	Code
2017	A parallel corpus of Python functions and documentation strings for automated code documentation and code generation	Barone et al.	IJCNLP	Code
2020	PyMT5: multi-mode translation of natural language and Python code with transformers	Clement et al.	EMNLP	Code
2018	Deep code comment generation	Hu et al.	ICPC	Code
2021	Retrieval-Augmented Generation for Code Summarization via Hybrid GNN	Liu et al.	ICLR	Code
2018	Deep learning type inference	Hellendoorn et al.	FSE	Code
2021	CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks	Puri et al.	arXiv	Code
2019	JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation	Agashe et al.	EMNLP	Code
2021	ProGraML: Graph-based Deep Learning for Program Optimization and Analysis	Cummins et al.	ICML	Code
2019	Recommendations for Datasets for Source Code Summarization	Leclair et al.	NAACL	Code
2021	CoDesc: A Large Code-Description Parallel Dataset	Hasan et al.	ACL	Code
2021	Measuring Coding Challenge Competence With APPS	Hendrycks et al.	NeurIPS	Code
2021	AVATAR: A Parallel Corpus for Java-Python Program Translation	Ahmad et al.	arXiv	Code
2018	StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow	Yao et al.	WWW	Code
2021	PyTorrent: A Python Library Corpus for Large-scale Language Models	Bahrami et al.	arXiv	Code
2021	CodeQA: A Question Answering Dataset for Source Code Comprehension	Liu et al.	EMNLP	Code
2021	CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation	Lu et al.	NeurIPS	Code

CHALLENGES AND OPPORTUNITIES

Comprehensive Code Representation

Year	Title	Author	Venue	Code
2019	Open Vocabulary Learning on Source Code with a Graph-Structured Cache	Cvitkovic et al.	ICML	Code
2020	Big code!= big vocabulary: Open-vocabulary models for source code	Karampatsis et al.	ICSE	Code
2021	A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code	Chirkova et al.	NAACL	Code

Multi-Lingual and Cross-Language

Year	Title	Author	Venue	Code
2021	Disentangled Code Representation Learning for Multiple Programming Languages	Zhang et al.	ACL	Code
2022	Multilingual training for Software Engineering	Ahmed et al.	ICSE	Code
2019	Clcdsa: cross language code clone detection using syntactical features and api documentation	Nafi et al.	ASE	Code
2019	Bilateral dependency neural networks for cross-language algorithm classification	Bui et al.	SANER	Code
2019	SAR: learning cross-language API mappings with little knowledge	Bui et al.	FSE	Code
2021	Interactive Cross-language Code Retrieval with Auto-Encoders	Chen et al.	ASE	Code
2022	Cross-Domain Deep Code Search with Few-Shot Meta Learning	Chai et al.	ICSE	Code
2022	Cross-Language Binary-Source Code Matching with Intermediate Representations	Gui et al.	SANER	Code

Model Interpretability

Year	Title	Author	Venue	Code
2021	Vulnerability Detection with Fine-grained Interpretations	Li et al.	FSE	Code
2021	Interpreting deep learning-based vulnerability detector predictions based on heuristic searching	Zou et al.	TOSEM	Code
2021	Interpretable Program Synthesis	Zhang et al.	CHI	Code
2021	PyExplainer: Explaining the Predictions of Just-In-Time Defect Models	Pornprasit et al.	ASE	Code

Robustness and Security

Year	Title	Author	Venue	Code
2017	Towards evaluating the robustness of neural networks	Carlini et al.	SP	Code
2018	Robust physical-world attacks on deep learning visual classification	Eykholt et al.	CVPR	Code
2017	Towards evaluating the robustness of neural networks	Carlini et al.	SP	Code
2019	On evaluating adversarial robustness	Carlini et al.	arXiv	Code
2020	Adversarial attacks on deep-learning models in natural language processing: A survey	Zhang et al.	TIST	Code
2020	Semantic Robustness of Models of Source Code	Ramakrishnan et al.	arXiv	Code
2020	Adversarial Examples for Models of Code	Yefet et al.	OOPSLA	Code
2021	Adversarial Attacks to API Recommender Systems: Time to Wake Up and Smell the Coffee?	Nguyen et al.	ASE	Code
2020	Adversarial robustness for code	Bielik et al.	ICML	Code
2021	Adversarial Robustness of Deep Code Comment Generation	Zhou et al.	arXiv	Code
2019	Misleading Authorship Attribution of Source Code using Adversarial Learning	Quiring et al.	USENIX Security	Code
2021	A Practical Black-box Attack on Source Code Authorship Identiﬁcation Classiﬁers	Liu et al.	TIFS	Code
2021	Backdoors in Neural Models of Source Code	Ramakrishnan et al.	arXiv	Code
2021	You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion	Schuster et al.	USENIX Security	Code
2021	Explanation-Guided Backdoor Poisoning Attacks Against Malware Classifiers	Severi et al.	USENIX Security	Code
2020	Generating Adversarial Examples for Holding Robustness of Source Code Processing Models	Zhang et al.	AAAI	Code

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Deep Learning for Code Intelligence

Related Survey

Code Representation

Code Tokens

API

AST

IR

Code Graphs

Other Features of Code

Hybrid

Application

Code Classification

Vulnerability Detection and Bug Finding

Code Completion

Type Inference

Code Search

Code Clone Detection

Code Summarization

Program Translation

Program Synthesis

Program Repair

Code Optimization

Other Applications

Dataset

CHALLENGES AND OPPORTUNITIES

Comprehensive Code Representation

Multi-Lingual and Cross-Language

Model Interpretability

Robustness and Security

About

Releases

Packages

Contributors 2

CGCL-codes/awesome-code-intelligence

Folders and files

Latest commit

History

Repository files navigation

Awesome Deep Learning for Code Intelligence

Related Survey

Code Representation

Code Tokens

API

AST

IR

Code Graphs

Other Features of Code

Hybrid

Application

Code Classification

Vulnerability Detection and Bug Finding

Code Completion

Type Inference

Code Search

Code Clone Detection

Code Summarization

Program Translation

Program Synthesis

Program Repair

Code Optimization

Other Applications

Dataset

CHALLENGES AND OPPORTUNITIES

Comprehensive Code Representation

Multi-Lingual and Cross-Language

Model Interpretability

Robustness and Security

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages