Results 1 to 10 of about 11,357,928 (275)
Source code analysis dataset [PDF]
The data in this article pair source code with three artifacts from 108,568 projects downloaded from Github that have a redistributable license and at least 10 stars.
Ben Gelman+3 more
doaj +2 more sources
Vulnerability Prediction From Source Code Using Machine Learning
As the role of information and communication technologies gradually increases in our lives, software security becomes a major issue to provide protection against malicious attempts and to avoid ending up with noncompensable damages to the system.
Zeki Bilgin+5 more
doaj +2 more sources
The Stack: 3 TB of permissively licensed source code [PDF]
Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation.
Denis Kocetkov+12 more
semanticscholar +1 more source
DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection [PDF]
We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects.
Yizheng Chen+4 more
semanticscholar +1 more source
An Empirical Comparison of Pre-Trained Models of Source Code [PDF]
While a large number of pre-trained models of source code have been successfully developed and applied to a variety of software engineering (SE) tasks in recent years, our understanding of these pre-trained models is arguably fairly limited.
Changan Niu+5 more
semanticscholar +1 more source
NatGen: generative pre-training by “naturalizing” source code [PDF]
Pre-trained Generative Language models (e.g., PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to
Saikat Chakraborty+4 more
semanticscholar +1 more source
VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection [PDF]
This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects.
Hazim Hanif, S. Maffeis
semanticscholar +1 more source
CoditT5: Pretraining for Source Code and Natural Language Editing [PDF]
Pretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits.
Jiyang Zhang+4 more
semanticscholar +1 more source
A Transformer-based Approach for Source Code Summarization [PDF]
Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range ...
Wasi Uddin Ahmad+3 more
semanticscholar +1 more source
Semantic similarity metrics for evaluating source code summarization [PDF]
Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs.
S. Haque+3 more
semanticscholar +1 more source