cs.SE – Recent Articles

What is a “bug”? On subjectivity, epistemic power, and implications for software research

February 14, 2024February 14, 2024 cs.HC updates on arXiv.org Edit

Considerable effort in software research and practice is spent on bugs. Finding, reporting, tracking, triaging, attempting to fix them automatically, detecting "bug smells" -these comprise a substantial portion of large projects' time and development cost, and are of significant interest to researchers in Software Engineering, Programming Languages, and beyond. But, what is a bug, exactly? While segmentation faults rarely spark joy, most bugs are not so clear cut. Per the Oxford English Dictionary, the word "bug" has been a colloquialism for an engineering "defect" at least since the 1870s. Most modern software-oriented definitions speak to a disconnect between what a developer intended and what a program actually does. Formal verification, from its inception, has developed means to identify deviations from a formal specification, expected to more or less fully encode desired behavior. However, software is rarely accompanied by full and formal specifications, and this intention is instead treated as implicit or partially-documented at best. The International Software Testing Qualifications board writes: "A human being can make an error (mistake), which produces a defect (fault, bug) in the program code, or in a document. If a defect in code is executed, the system may fail to do what it should do (or do something it shouldn't), causing a failure. Defects may result in failures, but not all [do]". Most sources forsake this precision. The influential paper "Finding bugs is easy" begins by saying "bug patterns are code idioms that are often errors"-with no particular elaboration. Other work relies on imperfect practical proxies for specifications. For example, in automatic program repair research, a bug corresponds to a failing test case: when the test passes, the bug is considered fixed. However, when we interrogate fairly straightforward definitions, they start to break down...

Human-Centered AI Product Prototyping with No-Code AutoML: Conceptual Framework, Potentials and Limitations

February 14, 2024February 14, 2024 cs.HC updates on arXiv.org Edit

This paper evaluates No-Code AutoML as a solution for challenges in AI product prototyping, characterized by unpredictability and inaccessibility to non-experts, and proposes a conceptual framework. This complexity of AI products hinders seamless execution and interdisciplinary collaboration crucial for human-centered AI products. Relevant to industry and innovation, it affects strategic decision-making and investment risk mitigation. Current approaches provide limited insights into the potential and feasibility of AI product ideas. Employing Design Science Research, the study identifies challenges and integrates no-code AutoML as a solution by presenting a framework for AI product prototyping with No-code AutoML. A case study confirms its potential in supporting non-experts, offering a structured approach to AI product development. The framework facilitates accessible and interpretable prototyping, benefiting academia, managers, and decision-makers. Strategic integration of no-code AutoML enhances efficiency, empowers non-experts, and informs early-stage decisions, albeit with acknowledged limitations.

Ranking LLM-Generated Loop Invariants for Program Verification

February 14, 2024February 14, 2024 cs.CL updates on arXiv.org Edit

Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier. The source code and the experimental data for this paper are available in \url{https://github.com/microsoft/NeuralInvariantRanker}.

ACCESS: Prompt Engineering for Automated Web Accessibility Violation Corrections

February 13, 2024February 13, 2024 cs.HC updates on arXiv.org Edit

With the increasing need for inclusive and user-friendly technology, web accessibility is crucial to ensuring equal access to online content for individuals with disabilities, including visual, auditory, cognitive, or motor impairments. Despite the existence of accessibility guidelines and standards such as Web Content Accessibility Guidelines (WCAG) and the Web Accessibility Initiative (W3C), over 90% of websites still fail to meet the necessary accessibility requirements. For web users with disabilities, there exists a need for a tool to automatically fix web page accessibility errors. While research has demonstrated methods to find and target accessibility errors, no research has focused on effectively correcting such violations. This paper presents a novel approach to correcting accessibility violations on the web by modifying the document object model (DOM) in real time with foundation models. Leveraging accessibility error information, large language models (LLMs), and prompt engineering techniques, we achieved greater than a 51% reduction in accessibility violation errors after corrections on our novel benchmark: ACCESS. Our work demonstrates a valuable approach toward the direction of inclusive web content, and provides directions for future research to explore advanced methods to automate web accessibility.

Mercury: An Efficiency Benchmark for LLM Code Synthesis

February 13, 2024February 13, 2024 cs.CL updates on arXiv.org Edit

Despite advancements in evaluating Large Language Models (LLMs) for code synthesis, benchmarks have predominantly focused on functional correctness, overlooking the importance of code efficiency. We present Mercury, the first benchmark designated for assessing the code efficiency of LLM code synthesis tasks. Mercury consists of 1,889 programming tasks covering diverse difficulty levels alongside test case generators generating unlimited cases for comprehensive evaluation. Unlike existing benchmarks, Mercury integrates a novel metric Beyond@K to measure normalized code efficiency based on historical submissions, leading to a new evaluation indicator for code synthesis, which encourages generating functionally correct and computationally efficient code, mirroring the real-world software development standard. Our findings reveal that while LLMs demonstrate the remarkable capability to generate functionally correct code, there still exists a substantial gap in their efficiency output, underscoring a new frontier for LLM research and development.

Neural Models for Source Code Synthesis and Completion

February 13, 2024February 13, 2024 cs.CL updates on arXiv.org Edit

Natural language (NL) to code suggestion systems assist developers in Integrated Development Environments (IDEs) by translating NL utterances into compilable code snippet. The current approaches mainly involve hard-coded, rule-based systems based on semantic parsing. These systems make heavy use of hand-crafted rules that map patterns in NL or elements in its syntax parse tree to various query constructs and can only work on a limited subset of NL with a restricted NL syntax. These systems are unable to extract semantic information from the coding intents of the developer, and often fail to infer types, names, and the context of the source code to get accurate system-level code suggestions. In this master thesis, we present sequence-to-sequence deep learning models and training paradigms to map NL to general-purpose programming languages that can assist users with suggestions of source code snippets, given a NL intent, and also extend auto-completion functionality of the source code to users while they are writing source code. The developed architecture incorporates contextual awareness into neural models which generate source code tokens directly instead of generating parse trees/abstract meaning representations from the source code and converting them back to source code. The proposed pretraining strategy and the data augmentation techniques improve the performance of the proposed architecture. The proposed architecture has been found to exceed the performance of a neural semantic parser, TranX, based on the BLEU-4 metric by 10.82%. Thereafter, a finer analysis for the parsable code translations from the NL intent for CoNaLA challenge was introduced. The proposed system is bidirectional as it can be also used to generate NL code documentation given source code. Lastly, a RoBERTa masked language model for Python was proposed to extend the developed system for code completion.

Using Large Language Models for Student-Code Guided Test Case Generation in Computer Science Education

February 13, 2024February 13, 2024 cs.CL updates on arXiv.org Edit

In computer science education, test cases are an integral part of programming assignments since they can be used as assessment items to test students' programming knowledge and provide personalized feedback on student-written code. The goal of our work is to propose a fully automated approach for test case generation that can accurately measure student knowledge, which is important for two reasons. First, manually constructing test cases requires expert knowledge and is a labor-intensive process. Second, developing test cases for students, especially those who are novice programmers, is significantly different from those oriented toward professional-level software developers. Therefore, we need an automated process for test case generation to assess student knowledge and provide feedback. In this work, we propose a large language model-based approach to automatically generate test cases and show that they are good measures of student knowledge, using a publicly available dataset that contains student-written Java code. We also discuss future research directions centered on using test cases to help students.

Unlocking the Secrets of Software Configuration Landscapes-Ruggedness, Accessibility, Escapability, and Transferability

February 12, 2024February 12, 2024 cs.NE updates on arXiv.org Edit

Modern software systems are often highly configurable to tailor varied requirements from diverse stakeholders. Understanding the mapping between configurations and the desired performance attributes plays a fundamental role in advancing the controllability and tuning of the underlying system, yet has long been a dark hole of knowledge due to their black-box nature and the enormous combinatorial configuration space. In this paper, using $86$M evaluated configurations from three real-world systems on $32$ running workloads, we conducted one of its kind fitness landscape analysis (FLA) for configurable software systems. With comprehensive FLA methods, we for the first time show that: $i)$ the software configuration landscapes are fairly rugged, with numerous scattered local optima; $ii)$ nevertheless, the top local optima are highly accessible, featuring significantly larger basins of attraction; $iii)$ most inferior local optima are escapable with simple perturbations; $iv)$ landscapes of the same system with different workloads share structural similarities, which can be exploited to expedite heuristic search. Our results also provide valuable insights on the design of tailored meta-heuristics for configuration tuning; our FLA framework along with the collected data, build solid foundation for future research in this direction.

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

February 12, 2024February 12, 2024 cs.CL updates on arXiv.org Edit

Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code structure, while our AST-Aware Span Corruption objective equips the model to reconstruct various code structures. Unlike other models, AST-T5 avoids intricate program analyses or architectural changes, so it integrates seamlessly with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in CodeXGLUE. Our code and model are publicly available at https://github.com/gonglinyuan/ast_t5.

Uncertainty Awareness of Large Language Models Under Code Distribution Shifts: A Benchmark Study

February 12, 2024February 12, 2024 cs.CL updates on arXiv.org Edit

Large Language Models (LLMs) have been widely employed in programming language analysis to enhance human productivity. Yet, their reliability can be compromised by various code distribution shifts, leading to inconsistent outputs. While probabilistic methods are known to mitigate such impact through uncertainty calibration and estimation, their efficacy in the language domain remains underexplored compared to their application in image-based tasks. In this work, we first introduce a large-scale benchmark dataset, incorporating three realistic patterns of code distribution shifts at varying intensities. Then we thoroughly investigate state-of-the-art probabilistic methods applied to CodeLlama using these shifted code snippets. We observe that these methods generally improve the uncertainty awareness of CodeLlama, with increased calibration quality and higher uncertainty estimation~(UE) precision. However, our study further reveals varied performance dynamics across different criteria (e.g., calibration error vs misclassification detection) and trade-off between efficacy and efficiency, highlighting necessary methodological selection tailored to specific contexts.