About me

I am a final-year Ph.D. student in Computer Science at The Hong Kong University of Science and Technology (HKUST), advised by Prof. Shuai Wang. Currently, I'm visiting the Advanced Software Technologies (AST) Lab at ETH Zurich (ETH), advised by Prof. Zhendong Su. Prior to my Ph.D. studies, I obtained my B.S. in Computer Science from Harbin Institute of Technology, Shenzhen. I also participated in The Cornell, Maryland, Max Planck Pre-doctoral Research School.

My Email: zligo at connect dot ust dot hk

News

[2025-09] 🎉🎉 I was awarded the 2025 Ant Group Intech Scholarship (Only 10 students worldwide)
[2025-09] 🎉🎉 Our paper about Wechat Mini Game obfuscation is now accepted by ASE 2025 Industry Track
[2025-06] 🎉🎉 Our project CipherInsight wins the Tech Fest Hong Kong Awards 2025
[2025-06] 🎉🎉 Our proposal about LLM Safety Guardrail is supported by Bridge Gap Fund
[2025-05] 🎉🎉 Our paper about LLM SFT data extraction is now accepted by CCS 2025
[2025-05] 🎉🎉 Our paper about CTF Agent is now accepted by CCS 2025
[2025-05] 🎉🎉 Our paper about Constraint Synthesis is now accepted by SIGMOD 2026
[2025-04] 🎉🎉 Our paper about Causality-Aided Evaluation of LLCM is now accepted by ISSTA 2025
[2025-04] 🎉🎉 Our paper about LLM-Augmented Decompilation is now accepted by ISSTA 2025
[2025-03] 🎉🎉 Our paper about Ethical Suggestions of LLMs is now accepted by TOSEM
[2025-02] 🎉🎉 Our paper about LLM SFT Dataset Synthesis is now accepted by OOPSLA 2025
[2025-01] 🎉🎉 Our paper about LLM-Enhanced Fuzzing is now accepted by USENIX Security 2025
[2025-01] 🎉🎉 Our paper about LLM Jailbreaking is now accepted by USENIX Security 2025

Selected Publications[Full List of 26+ CCF-A Publications]

Differentiation-Based Extraction of Proprietary Data from Fine-tuned LLMs
Zongjie Li, Daoyuan Wu, Shuai Wang, Zhendong Su
Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS 2025) CCF-A
Taipei, China, 2025
abstract >
Keywords: LLM, Data Extraction
Abstract:
The increasing demand for domain-specific and human-aligned Large Language Models (LLMs) has led to the widespread adoption of Supervised Fine-Tuning (SFT) techniques. SFT datasets often comprise valuable instruction-response pairs, making them highly valuable targets for potential extraction. This paper studies this critical research problem for the first time. We start by formally defining and formulating the problem, then explore various attack goals, types, and variants based on the unique properties of SFT data in real-world scenarios. Based on our analysis of extraction behaviors of direct extraction, we develop a novel extraction method specifically designed for SFT models, called Differentiated Data Extraction (DDE), which exploits the confidence levels of fine-tuned models and their behavioral differences from pre-trained base models. Through extensive experiments across multiple domains and scenarios, we demonstrate the feasibility of SFT data extraction using DDE. Our results show that DDE consistently outperforms existing extraction baselines in all attack settings. To counter this new attack, we propose a defense mechanism that mitigates DDE attacks with minimal impact on model performance. Overall, our research reveals hidden data leak risks in fine-tuned LLMs and provides insights for developing more secure models.
close
API-guided dataset synthesis to finetune large code models
Zongjie Li, Daoyuan Wu, Shuai Wang, Zhendong Su
Proceedings of the ACM on Programming Languages (OOPSLA 2025) CCF-A
Singapore, 2025
abstract >
Keywords: LLM, Data Synthesis
Abstract:
Large code models (LCMs), pre-trained on vast code corpora, have demonstrated remarkable performance across a wide array of code-related tasks. Supervised fine-tuning (SFT) plays a vital role in aligning these models with specific requirements and enhancing their performance in particular domains. However, synthesizing high-quality SFT datasets poses a significant challenge due to the uneven quality of datasets and the scarcity of domain-specific datasets.Inspired by APIs as high-level abstractions of code that encapsulate rich semantic information in a concise structure, we propose DataScope, an API-guided dataset synthesis framework designed to enhance the SFT process for LCMs in both general and domain-specific scenarios. DataScope comprises two main components: Dsel and Dgen. On the one hand, Dsel employs API coverage as a core metric, enabling efficient dataset synthesis in general scenarios by selecting subsets of existing (uneven-quality) datasets with higher API coverage. On the other hand, Dgen recasts domain dataset synthesis as a process of using API-specified high-level functionality and deliberately-constituted code skeletons to synthesize concrete code. Extensive experiments demonstrate DataScope's effectiveness, with models fine-tuned on its synthesized datasets outperforming those tuned on unoptimized datasets five times larger. Furthermore, a series of analyses on model internals, relevant hyperparameters, and case studies provide additional evidence for the efficacy of our proposed methods. These findings underscore the significance of dataset quality in SFT and advance the field of LCMs by providing an efficient, cost-effective framework for constructing high-quality datasets, which in turn lead to more powerful and tailored LCMs for both general and domain-specific scenarios.
close
On Extracting Specialized Code Abilities from Large Language Models: A Feasibility Study
Zongjie Li, Chaozheng Wang, Pingchuan Ma, Chaowei Liu, Shuai Wang, Daoyuan Wu, Cuiyun Gao, Yang Liu
Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE 2024) CCF-A
Lisbon, Portugal, 2024
abstract >
Keywords: Code Models, Model Extraction
Abstract:
Recent advances in large language models (LLMs) significantly boost their usage in software engineering. However, training a well-performing LLM demands a substantial workforce for data collection and annotation. Moreover, training datasets may be proprietary or partially open, and the process often requires a costly GPU cluster. The intellectual property value of commercial LLMs makes them attractive targets for imitation attacks, but creating an imitation model with comparable parameters still incurs high costs. This motivates us to explore a practical and novel direction: slicing commercial black-box LLMs using medium-sized backbone models. In this paper, we explore the feasibility of launching imitation attacks on LLMs to extract their specialized code abilities, such as "code synthesis" and "code translation." We systematically investigate the effectiveness of launching code ability extraction attacks under different code-related tasks with multiple query schemes, including zero-shot, in-context, and Chain-of-Thought. We also design response checks to refine the outputs, leading to an effective imitation training process. Our results show promising outcomes, demonstrating that with a reasonable number of queries, attackers can train a medium-sized backbone model to replicate specialized code behaviors similar to the target LLMs. We summarize our findings and insights to help researchers better understand the threats posed by imitation attacks, including revealing a practical attack surface for generating adversarial code examples against LLMs.
close
Split and Merge: Aligning Position Biases in LLM-based Evaluators
Zongjie Li, Chaozheng Wang, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, Yang Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) Top Conference in NLP
Miami, Florida, USA, 2024
abstract >
Keywords: LLM Fairness, Evaluation
Abstract:
Large language models (LLMs) have shown promise as automated evaluators for assessing the quality of answers generated by AI systems. However, these LLM-based evaluators exhibit position bias, or inconsistency, when used to evaluate candidate answers in pairwise comparisons, favoring either the first or second answer regardless of content. To address this limitation, we propose PORTIA, an alignment-based system designed to mimic human comparison strategies to calibrate position bias in a lightweight yet effective manner. Specifically, PORTIA splits the answers into multiple segments, aligns similar content across candidate answers, and then merges them back into a single prompt for evaluation by LLMs. We conducted extensive experiments with six diverse LLMs to evaluate 11,520 answer pairs. Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested, achieving an average relative improvement of 47.46%. Remarkably, PORTIA enables less advanced GPT models to achieve 88% agreement with the state-of-the-art GPT-4 model at just 10% of the cost. Furthermore, it rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%. Subsequent human evaluations indicate that the PORTIA-enhanced GPT-3.5 model can even surpass the standalone GPT-4 in terms of alignment with human evaluators. These findings highlight PORTIA's ability to correct position bias, improve LLM consistency, and boost performance while keeping cost-efficiency. This represents a valuable step toward a more reliable and scalable use of LLMs for automated evaluations across diverse applications.
close
Protecting Intellectual Property of Large Language Model-Based Code Generation APIs via Watermarks
Zongjie Li, Chaozheng Wang, Shuai Wang, Cuiyun Gao
Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS 2023) CCF-A
Copenhagen, Denmark, 2023
abstract >
Keywords: LLM, Watermark, Intellectual Property Protection
Abstract:
The rise of large language model-based code generation (LLCG) has enabled various commercial services and APIs. Training LLCG models is often expensive and time-consuming, and the training data are often large-scale and even inaccessible to the public. As a result, the risk of intellectual property (IP) theft over the LLCG models (e.g., via imitation attacks) has been a serious concern. In this paper, we propose the first watermark (WM) technique to protect LLCG APIs from remote imitation attacks. Our proposed technique is based on replacing tokens in an LLCG output with their "synonyms" available in the programming language. A WM is thus defined as the stealthily tweaked distribution among token synonyms in LLCG outputs. We design six WM schemes (instantiated into over 30 WM passes) which rely on conceptually distinct token synonyms available in programming languages. Moreover, to check the IP of a suspicious model (decide if it is stolen from our protected LLCG API), we propose a statistical tests-based procedure that can directly check a remote, suspicious LLCG API.We evaluate our WM technique on LLCG models fine-tuned from two popular large language models, CodeT5 and CodeBERT. The evaluation shows that our approach is effective in both WM injection and IP check. The inserted WMs do not undermine the usage of normal users (i.e., high fidelity) and incur negligible extra cost. Moreover, our injected WMs exhibit high stealthiness and robustness against powerful attackers; even if they know all WM schemes, they can hardly remove WMs without largely undermining the accuracy of their stolen models.
close
Evaluating C/C++ Vulnerability Detectability of Query-Based Static Application Security Testing Tools
Zongjie Li, Zhibo Liu, Wai Kin Wong, Pingchuan Ma, Shuai Wang
IEEE Transactions on Dependable and Secure Computing (TDSC 2024) CCF-A
abstract >
Keywords: Vulnerability Detection, Static Analysis
Abstract:
In recent years, query-based static application security testing(Q-SAST) tools such as CodeQL have gained popularity due to their ability to codify vulnerability knowledge into SQL-like queries and search for vulnerabilities in the database derived from the software. The industry has made considerable progress in building Q-SAST tools, facilitating their integration into the continuous integration (CI) pipeline, and sustaining an active community. However, we do not have a systematic understanding of their vulnerability detection capability in comparison to conventional SAST tools. We conduct the first in-depth study of Q-SAST to demystify their C/C++ vulnerability detectability. Our study is conducted from three complementary aspects. We first use a synthetic CWE test suite and a real-world CVE test suite, totaling almost 30 K programs with known CWE/CVE, to assess popular (commercial) Q-SAST and industry-leading SAST (requiring no queries). Then, we gather defect-fixing pull requests (PRs) since the release dates of three popular Q-SAST tools, characterizing historically-fixed defects and comparing them to pitfalls exposed in our CWE/CVE study. To enhance vulnerability detection, we design SAST-MT, a metamorphic testing framework to detect false positives (FPs) and false negatives (FNs) of Q-SAST. Findings of SAST-MT can be used to easily expose the root causes of Q-SAST's FPs and FNs. We summarize lessons from our study that can benefit both users and developers of Q-SAST.
close
CCTEST: Testing and Repairing Code Completion Systems
Zongjie Li, Chaozheng Wang, Zhibo Liu, Haoxuan Wang, Dong Chen, Shuai Wang, Cuiyun Gao
45th IEEE/ACM International Conference on Software Engineering (ICSE 2023) CCF-A
Melbourne, Australia, 2023
abstract >
Keywords: LLM, Code Completion, Testing
Abstract:
Code completion, a highly valuable topic in the software development domain, has been increasingly promoted for use by recent advances in large language models (LLMs). To date, visible LLM-based code completion frameworks such as GitHub Copilot and GPT are trained using deep learning over vast quantities of unstructured text and open source code. As the paramount component and the cornerstone in daily programming tasks, code completion has largely boosted professionals' efficiency in building real-world software systems. In contrast to this flourishing market, we find that code completion systems often output suspicious results, and to date, an automated testing and enhancement framework for code completion systems is not available. This research proposes CCTEST, a framework to test and repair code completion systems in blackbox settings. CCTEST features a set of novel mutation strategies, namely program structure-correlated (PSC) mutations, to generate mutated code completion inputs. Then, it detects inconsistent outputs, representing possibly erroneous cases, from all the completed code cases. Moreover, CCTEST repairs the code completion outputs by selecting the output that mostly reflects the "average" appearance of all output cases, as the final output of the code completion systems. We detected a total of 33,540 inputs (with a true positive rate of 86%) that can trigger erroneous cases from eight popular LLM-based code completion systems. With repairing, we show that the accuracy of code completion systems is notably increased by 40% and 67% with respect to BLEU score and Levenshtein edit similarity.
close
Reef: A framework for collecting real-world vulnerabilities and fixes
Chaozheng Wang*, Zongjie Li*, Yun Pena, Shuzheng Gao, Sirong Chen, Shuai Wang, Cuiyun Gao, Michael R Lyu
2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE Industry 2023)
Kirchberg, Luxembourg, 2023 Best Paper Award
abstract >
Keywords: LLM, Vulnerability Detection
Abstract:
Software plays a crucial role in our daily lives, and therefore the quality and security of software systems have become increasingly important. However, vulnerabilities in software still pose a significant threat, as they can have serious consequences. Recent advances in automated program repair have sought to automatically detect and fix bugs using data-driven techniques. Sophisticated deep learning methods have been applied to this area and have achieved promising results. However, existing benchmarks for training and evaluating these techniques remain limited, as they tend to focus on a single programming language and have relatively small datasets. Moreover, many benchmarks tend to be outdated and lack diversity, focusing on a specific codebase. Worse still, the quality of bug explanations in existing datasets is low, as they typically use imprecise and uninformative commit messages as explanations. To address these issues, we propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories. We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs. Furthermore, we propose a neural language model-based approach to generate high-quality vulnerability explanations, which is key to producing informative fix messages. Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations. The dataset we collect contains 4,466 CVEs with 30,987 patches (including 236 CWE) across 7 programming languages with detailed related information, which is superior to existing benchmarks in scale, coverage, and quality. Evaluations by human experts further confirm that our framework produces high-quality vulnerability explanations.
close
Unleashing the Power of Compiler Intermediate Representation to Enhance Neural Program Embeddings
Zongjie Li, Pingchuan Ma, Huaijin Wang, Shuai Wang, Qiyi Tang, Sen Nie, Shi Wu
44th IEEE/ACM International Conference on Software Engineering (ICSE 2022) CCF-A
Pittsburgh, PA, USA, 2022
abstract >
Keywords: LLM, Code Embedding, Compiler
Abstract:
Neural program embeddings have demonstrated considerable promise in a range of program analysis tasks, including clone identification, program repair, code completion, and program synthesis. However, most existing methods generate neural program embeddings directly from the program source codes, by learning from features such as tokens, abstract syntax trees, and control flow graphs. This paper takes a fresh look at how to improve program embeddings by leveraging compiler intermediate representation (IR). We first demonstrate simple yet highly effective methods for enhancing embedding quality by training embedding models alongside source code and LLVM IR generated by default optimization levels (e.g., -O2). We then introduce IRGen, a framework based on genetic algorithms (GA), to identify (near-)optimal sequences of optimization flags that can significantly improve embedding quality.
close

Awards

[2025] I was awarded the 2025 Ant Group Intech Scholarship

[2025] Our project CipherInsight wins the Tech Fest Hong Kong Awards 2025

[2024] Our proposal towards Robustness & Fairness in LLM is now accepted by OpenAI.

[2023] Our paper REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes got Distinguished Paper

[2021] Our paper Static Inference Meets Deep Learning: A Hybrid Type Inference Approach for Python is Nominated for Distinguished Paper

Education

Sep. 2017 – Jun. 2021, B.S. Degree in Computer Science, Harbin Institute of Technology, Shenzhen

Aug. 2020, The Cornell, Maryland, Max Planck Pre-doctoral Research School 2020

Sep. 2021 - Now, The Hong Kong University of Science and Technology

Feb. 2024 - Now, ETH Zurich

Intern

Jul. 2020 – Jan. 2021, Computing Vision Intern, Vertical Platform and Software Validation, Intel, Shenzhen

Mar. 2021 - Sep. 2021, Research Intern, Keen Lab, Tencent, Shanghai

Professional Service

Conference reviewer: SP 2023 | ICSE-SEET 2022 | Asiaccs 2022 | ICSE 2021

Artifact Evaluation Committee: Usenix ATC 2022 | WiSec 2022 | OSDI 2022 | ISSTA 2022