Papers
please see Google Scholar for more recent works and details.
†: equal contribution.
2025
- SubmittedScoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review ProcessZhijun Chen†, Zeyu Ji†, Qianren Mao, and 10 more authors2025
- arXiv 2025Harnessing Multiple Large Language Models: A Survey on LLM EnsembleZhijun Chen, Jingzheng Li, Pengpeng Chen, and 7 more authorsarXiv preprint arXiv:2502.18036, 2025
LLM Ensemble – which involves the comprehensive use of multiple large language models (LLMs), each aimed at handling user queries during downstream inference, to benefit from their individual strengths – has gained substantial attention recently. The widespread availability of LLMs, coupled with their varying strengths and out-of-the-box usability, has profoundly advanced the field of LLM Ensemble. This paper presents the first systematic review of recent developments in LLM Ensemble. First, we introduce our taxonomy of LLM Ensemble and discuss several related research problems. Then, we provide a more in-depth classification of the methods under the broad categories of "ensemble-before-inference, ensemble-during-inference, ensemble-after-inference”, and review all relevant methods. Finally, we introduce related benchmarks and applications, summarize existing studies, and suggest several future research directions. A curated list of papers on LLM Ensemble is available at https://github.com/junchenzhi/Awesome-LLM-Ensemble.
@article{chen2025harnessing, title = {Harnessing Multiple Large Language Models: A Survey on LLM Ensemble}, author = {Chen, Zhijun and Li, Jingzheng and Chen, Pengpeng and Li, Zhuoran and Sun, Kai and Luo, Yuankai and Mao, Qianren and Yang, Dingqi and Sun, Hailong and Yu, Philip S}, journal = {arXiv preprint arXiv:2502.18036}, year = {2025}, }
- AAAI 2025Implicit Word Reordering with Knowledge Distillation for Cross-Lingual Dependency ParsingZhuoran Li, Chunming Hu, Junfan Chen, and 2 more authorsIn Proceedings of the AAAI Conference on Artificial Intelligence, 2025
Word order difference between source and target languages is a major obstacle to cross-lingual transfer, especially in the dependency parsing task. Current works are mostly based on order-agnostic models or word reordering to mitigate this problem. However, such methods either do not leverage grammatical information naturally contained in word order or are computationally expensive as the permutation space grows exponentially with the sentence length. Moreover, the reordered source sentence with an unnatural word order may be a form of noising that harms the model learning. To this end, we propose an Implicit Word Reordering framework with Knowledge Distillation (IWR-KD). This framework is inspired by that deep networks are good at learning feature linearization corresponding to meaningful data transformation, e.g. word reordering. To realize this idea, we introduce a knowledge distillation framework composed of a word-reordering teacher model and a dependency parsing student model. We verify our proposed method on Universal Dependency Treebanks across 31 different languages and show it outperforms a series of competitors, together with experimental analysis to illustrate how our method works towards training a robust parser.
@inproceedings{li2025implicit, title = {Implicit Word Reordering with Knowledge Distillation for Cross-Lingual Dependency Parsing}, author = {Li, Zhuoran and Hu, Chunming and Chen, Junfan and Chen, Zhijun and Zhang, Richong}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence}, volume = {39}, number = {23}, pages = {24530--24538}, year = {2025}, }
- arXiv 2025Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented GenerationQianren Mao, Qili Zhang, Hanwen Hao, and 8 more authorsarXiv preprint arXiv:2504.19101, 2025
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution for enhancing the accuracy and credibility of Large Language Models (LLMs), particularly in Question & Answer tasks. This is achieved by incorporating proprietary and private data from integrated databases. However, private RAG systems face significant challenges due to the scarcity of private domain data and critical data privacy issues. These obstacles impede the deployment of private RAG systems, as developing privacy-preserving RAG systems requires a delicate balance between data security and data availability. To address these challenges, we regard federated learning (FL) as a highly promising technology for privacy-preserving RAG services. We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG). This framework facilitates collaborative training of client-side RAG retrieval models. The parameters of these models are aggregated and distributed on a central-server, ensuring data privacy without direct sharing of raw data. In FedE4RAG, knowledge distillation is employed for communication between the server and client models. This technique improves the generalization of local RAG retrievers during the federated learning process. Additionally, we apply homomorphic encryption within federated learning to safeguard model parameters and mitigate concerns related to data leakage. Extensive experiments conducted on the real-world dataset have validated the effectiveness of FedE4RAG. The results demonstrate that our proposed framework can markedly enhance the performance of private RAG systems while maintaining robust data privacy protection.
@article{mao2025privacy, title = {Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation}, author = {Mao, Qianren and Zhang, Qili and Hao, Hanwen and Han, Zhentao and Xu, Runhua and Jiang, Weifeng and Hu, Qi and Chen, Zhijun and Zhou, Tyler and Li, Bo and others}, journal = {arXiv preprint arXiv:2504.19101}, year = {2025}, }
- arXiv 2025Safety2Drive: Safety-Critical Scenario Benchmark for the Evaluation of Autonomous DrivingJingzheng Li, Tiancheng Wang, Xingyu Peng, and 4 more authorsarXiv preprint arXiv:2505.13872, 2025
Autonomous Driving (AD) systems demand the high levels of safety assurance. Despite significant advancements in AD demonstrated on open-source benchmarks like Longest6 and Bench2Drive, existing datasets still lack regulatory-compliant scenario libraries for closed-loop testing to comprehensively evaluate the functional safety of AD. Meanwhile, real-world AD accidents are underrepresented in current driving datasets. This scarcity leads to inadequate evaluation of AD performance, posing risks to safety validation and practical deployment. To address these challenges, we propose Safety2Drive, a safety-critical scenario library designed to evaluate AD systems. Safety2Drive offers three key contributions. (1) Safety2Drive comprehensively covers the test items required by standard regulations and contains 70 AD function test items. (2) Safety2Drive supports the safety-critical scenario generalization. It has the ability to inject safety threats such as natural environment corruptions and adversarial attacks cross camera and LiDAR sensors. (3) Safety2Drive supports multi-dimensional evaluation. In addition to the evaluation of AD systems, it also supports the evaluation of various perception tasks, such as object detection and lane detection. Safety2Drive provides a paradigm from scenario construction to validation, establishing a standardized test framework for the safe deployment of AD.
@article{li2025safety2drive, title = {Safety2Drive: Safety-Critical Scenario Benchmark for the Evaluation of Autonomous Driving}, author = {Li, Jingzheng and Wang, Tiancheng and Peng, Xingyu and Chen, Jiacheng and Chen, Zhijun and Li, Bing and Liu, Xianglong}, journal = {arXiv preprint arXiv:2505.13872}, year = {2025}, }
- arXiv 2025Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical ScenariosJingzheng Li, Xianglong Liu, Shikui Wei, and 6 more authorsarXiv preprint arXiv:2503.23708, 2025
Autonomous driving has made significant progress in both academia and industry, including performance improvements in perception task and the development of end-to-end autonomous driving systems. However, the safety and robustness assessment of autonomous driving has not received sufficient attention. Current evaluations of autonomous driving are typically conducted in natural driving scenarios. However, many accidents often occur in edge cases, also known as safety-critical scenarios. These safety-critical scenarios are difficult to collect, and there is currently no clear definition of what constitutes a safety-critical scenario. In this work, we explore the safety and robustness of autonomous driving in safety-critical scenarios. First, we provide a definition of safety-critical scenarios, including static traffic scenarios such as adversarial attack scenarios and natural distribution shifts, as well as dynamic traffic scenarios such as accident scenarios. Then, we develop an autonomous driving safety testing platform to comprehensively evaluate autonomous driving systems, encompassing not only the assessment of perception modules but also system-level evaluations. Our work systematically constructs a safety verification process for autonomous driving, providing technical support for the industry to establish standardized test framework and reduce risks in real-world road deployment.
@article{li2025towards, title = {Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios}, author = {Li, Jingzheng and Liu, Xianglong and Wei, Shikui and Chen, Zhijun and Li, Bing and Guo, Qing and Yang, Xianqi and Pu, Yanjun and Wang, Jiakai}, journal = {arXiv preprint arXiv:2503.23708}, year = {2025}, }
2024
- arXiv 2024XRAG: eXamining the Core–Benchmarking Foundational Components in Advanced Retrieval-Augmented GenerationQianren Mao, Yangyifei Luo, Qili Zhang, and 8 more authorsarXiv preprint arXiv:2412.15529, 2024
Retrieval-augmented generation (RAG) synergizes the retrieval of pertinent data with the generative capabilities of Large Language Models (LLMs), ensuring that the generated output is not only contextually relevant but also accurate and current. We introduce XRAG, an open-source, modular codebase that facilitates exhaustive evaluation of the performance of foundational components of advanced RAG modules. These components are systematically categorized into four core phases: pre-retrieval, retrieval, post-retrieval, and generation. We systematically analyse them across reconfigured datasets, providing a comprehensive benchmark for their effectiveness. As the complexity of RAG systems continues to escalate, we underscore the critical need to identify potential failure points in RAG systems. We formulate a suite of experimental methodologies and diagnostic testing protocols to dissect the failure points inherent in RAG engineering. Subsequently, we proffer bespoke solutions aimed at bolstering the overall performance of these modules. Our work thoroughly evaluates the performance of advanced core components in RAG systems, providing insights into optimizations for prevalent failure points.
@article{mao2024xrag, title = {XRAG: eXamining the Core--Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation}, author = {Mao, Qianren and Luo, Yangyifei and Zhang, Qili and Luo, Yashuo and Cao, Zhilong and Zhang, Jinlong and Hao, HanWen and Chen, Zhijun and Jiang, Weifeng and Liu, Junnan and others}, journal = {arXiv preprint arXiv:2412.15529}, year = {2024}, }
- IJCAI 2024Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-SwitchingZhuoran Li, Chunming Hu, Junfan Chen, and 3 more authorsarXiv preprint arXiv:2406.13361, 2024
Code-switching is a data augmentation scheme mixing words from multiple languages into source lingual text. It has achieved considerable generalization performance of cross-lingual transfer tasks by aligning cross-lingual contextual word representations. However, uncontrolled and over-replaced code-switching would augment dirty samples to model training. In other words, the excessive code-switching text samples will negatively hurt the models’ cross-lingual transferability. To this end, we propose a Progressive Code-Switching (PCS) method to gradually generate moderately difficult code-switching examples for the model to discriminate from easy to hard. The idea is to incorporate progressively the preceding learned multilingual knowledge using easier code-switching data to guide model optimization on succeeding harder code-switching data. Specifically, we first design a difficulty measurer to measure the impact of replacing each word in a sentence based on the word relevance score. Then a code-switcher generates the code-switching data of increasing difficulty via a controllable temperature variable. In addition, a training scheduler decides when to sample harder code-switching data for model training. Experiments show our model achieves state-of-the-art results on three different zero-shot cross-lingual transfer tasks across ten languages.
@article{li2024improving, title = {Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching}, author = {Li, Zhuoran and Hu, Chunming and Chen, Junfan and Chen, Zhijun and Guo, Xiaohui and Zhang, Richong}, journal = {arXiv preprint arXiv:2406.13361}, year = {2024}, }
- Pattern RecognitionParallel Disentangling Network for Human–Object Interaction DetectionYamin Cheng, Hancong Duan, Chen Wang, and 1 more authorPattern Recognition, 2024
Human–object interaction (HOI) detection aims to localize and classify triplets of human, object and interaction from a given image. Earlier two-stage methods suffer both from mutually independent training processes and the interference of redundant negative human–object pairs. Prevailing one-stage transformer-based methods are free from the above problems by tackling HOI in an end-to-end manner. However, one-stage transformer-based methods carry the unnecessary entanglements of the query for different tasks, i.e., human–object detection and interaction classification, and thus bring in poor performance. In this paper, we propose a new transformer-based approach that parallelly disentangles human–object detection and interaction classification in a triplet-wise manner. To make each query focus on one specific task clearly, we exhaustively disentangle HOI by parallelly expanding the naive query in vanilla transformer as triple explicit queries. Then, we introduce a semantic communication layer to preserve the consistent semantic association of each HOI through mixing the feature representations of each query triplet of the correspondence constraint. Extensive experiments demonstrate that our proposed framework outperforms the existing methods and achieves the state-of-the-art performance, with significant reduction in parameters and FLOPs.
@article{cheng2024parallel, title = {Parallel Disentangling Network for Human--Object Interaction Detection}, author = {Cheng, Yamin and Duan, Hancong and Wang, Chen and Chen, Zhijun}, journal = {Pattern Recognition}, volume = {146}, pages = {110021}, year = {2024}, publisher = {Elsevier}, }
- ESWA 2024SimMix: Local Similarity-Aware Data Augmentation for Time SeriesPin Liu, Yuxuan Guo, Pengpeng Chen, and 4 more authorsExpert Systems with Applications, 2024
We find that local similarity is an essential factor for data augmentation in deep learning tasks concerning time series data, the applications of which are prevalent in various domains such as smart healthcare, intelligent transportation, smart finance, etc. With empirical and theoretical analysis, we find deep learning models achieve excellent performance only when the data augmentation method performs with appropriate intensity of local similarity—during the data augmentation process, too large/small intra-class local similarity will decrease the performance of deep learning models. With this discovery, we propose a time series augmentation method based on intra-class Similarity Mixing (SimMix), which accurately controls the intensity by quantifying and adjusting the similarity between augmented samples and original samples. With a PAC (i.e., Probably Approximately Correct) theoretical foundation, we design a cutmix strategy for non-equal length segments to eliminate semantic information loss and noise introduction defects in traditional methods. Through extensive validation on 10 real-world datasets, we demonstrate that the proposed method can outperform the state-of-the-art by a large margin.
@article{liu2024simmix, title = {SimMix: Local Similarity-Aware Data Augmentation for Time Series}, author = {Liu, Pin and Guo, Yuxuan and Chen, Pengpeng and Chen, Zhijun and Wang, Rui and Wang, Yuzhu and Shi, Bin}, journal = {Expert Systems with Applications}, volume = {255}, pages = {124793}, year = {2024}, publisher = {Elsevier}, }
2023
- KDD 2023Neural-Hidden-CRF: A Robust Weakly-Supervised Sequence LabelerZhijun Chen, Hailong Sun, Wanhao Zhang, and 3 more authorsIn Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023
Recommended as a Best Paper Candidate
We propose a neuralized undirected graphical model called Neural-Hidden-CRF to solve the weakly-supervised sequence labeling problem. Under the umbrella of probabilistic undirected graph theory, the proposed Neural-Hidden-CRF embedded with a hidden CRF layer models the variables of word sequence, latent ground truth sequence, and weak label sequence with the global perspective that undirected graphical models particularly enjoy. In Neural-Hidden-CRF, we can capitalize on the powerful language model BERT or other deep models to provide rich contextual semantic knowledge to the latent ground truth sequence, and use the hidden CRF layer to capture the internal label dependencies. Neural-Hidden-CRF is conceptually simple and empirically powerful. It obtains new state-of-the-art results on one crowdsourcing benchmark and three weak-supervision benchmarks, including outperforming the recent advanced model CHMM by 2.80 F1 points and 2.23 F1 points in average generalization and inference performance, respectively.
@inproceedings{chen2023neural, title = {Neural-Hidden-CRF: A Robust Weakly-Supervised Sequence Labeler}, author = {Chen, Zhijun and Sun, Hailong and Zhang, Wanhao and Xu, Chunyi and Mao, Qianren and Chen, Pengpeng}, booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {274--285}, year = {2023}, }
- ICDE 2023Learning from Noisy Crowd Labels with LogicsZhijun Chen, Hailong Sun, Haoqian He, and 1 more authorIn 2023 IEEE 39th International Conference on Data Engineering (ICDE), 2023
This paper explores the integration of symbolic logic knowledge into deep neural networks for learning from noisy crowd labels. We introduce Logic-guided Learning from Noisy Crowd Labels (Logic-LNCL), an EM-alike iterative logic knowledge distillation framework that learns from both noisy labeled data and logic rules of interest. Unlike traditional EM methods, our framework contains a “pseudo-E-step” that distills from the logic rules a new type of learning target, which is then used in the “pseudo-M-step” for training the classifier. Extensive evaluations on two real-world datasets for text sentiment classification and named entity recognition demonstrate that the proposed framework improves the state-of-the-art and provides a new solution to learning from noisy crowd labels.
@inproceedings{chen2023learning, title = {Learning from Noisy Crowd Labels with Logics}, author = {Chen, Zhijun and Sun, Hailong and He, Haoqian and Chen, Pengpeng}, booktitle = {2023 IEEE 39th International Conference on Data Engineering (ICDE)}, pages = {41--52}, year = {2023}, organization = {IEEE}, }
- IJCAI 2023Black-Box Data Poisoning Attacks on CrowdsourcingPengpeng Chen, Yongqiang Yang, Dingqi Yang, and 3 more authorsIn IJCAI, 2023
Understanding the vulnerability of label aggregation against data poisoning attacks is key to ensuring data quality in crowdsourced label collection. State-of-the-art attack mechanisms generally assume full knowledge of the aggregation models while failing to consider the flexibility of malicious workers in selecting which instances to label. Such a setup limits the applicability of the attack mechanisms and impedes further improvement of their success rate. This paper introduces a blackbox data poisoning attack framework that finds the optimal strategies for instance selection and labeling to attack unknown label aggregation models in crowdsourcing. We formulate the attack problem on top of a generic formalization of label aggregation models and then introduce a substitution approach that attacks a substitute aggregation model in replacement of the unknown model. Through extensive validation on multiple real-world datasets, we demonstrate the effectiveness of both instance selection and model substitution in improving the success rate of attacks.
@inproceedings{chen2023black, title = {Black-Box Data Poisoning Attacks on Crowdsourcing}, author = {Chen, Pengpeng and Yang, Yongqiang and Yang, Dingqi and Sun, Hailong and Chen, Zhijun and Lin, Peng}, booktitle = {IJCAI}, pages = {2975--2983}, year = {2023}, }
2022
- AAAI 2022Adversarial Learning from CrowdsPengpeng Chen, Hailong Sun, Yongqiang Yang, and 1 more authorIn Proceedings of the AAAI Conference on Artificial Intelligence, 2022
Learning from Crowds (LFC) seeks to induce a high-quality classifier from training instances, which are linked to a range of possible noisy annotations from crowdsourcing workers under their various levels of skills and their own preconditions. Recent studies on LFC focus on designing new methods to improve the performance of the classifier trained from crowdsourced labeled data. To this day, however, there remain under-explored security aspects of LFC systems. In this work, we seek to bridge this gap. We first show that LFC models are vulnerable to adversarial examples—small changes to input data can cause classifiers to make prediction mistakes. Second, we propose an approach, A-LFC for training a robust classifier from crowdsourced labeled data. Our empirical results on three real-world datasets show that the proposed approach can substantially improve the performance of the trained classifier even with the existence of adversarial examples. On average, A-LFC has 10.05% and 11.34% higher test robustness than the state-of-the-art in the white-box and black-box attack settings, respectively.
@inproceedings{chen2022adversarial, title = {Adversarial Learning from Crowds}, author = {Chen, Pengpeng and Sun, Hailong and Yang, Yongqiang and Chen, Zhijun}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence}, volume = {36}, number = {5}, pages = {5304--5312}, year = {2022}, }
2021
- APWeb 2022Data Poisoning Attacks on Crowdsourcing LearningPengpeng Chen, Hailong Sun, and Zhijun ChenIn Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, 2021
Understanding and assessing the vulnerability of crowdsourcing learning against data poisoning attacks is the key to ensure the quality of classifiers trained from crowdsourced labeled data. Existing studies on data poisoning attacks only focus on exploring the vulnerability of crowdsourced label collection. In fact, instead of the quality of labels themselves, the performance of the trained classifier is a main concern in crowdsourcing learning. Nonetheless, the impact of data poisoning attacks on the final classifiers remains underexplored to date. We aim to bridge this gap. First, we formalize the problem of poisoning attacks, where the objective is to sabotage the trained classifier maximally. Second, we transform the problem into a bilevel min-max optimization problem for the typical learning-from-crowds model and design an efficient adversarial strategy. Extensive validation on real-world datasets demonstrates that our attack can significantly decrease the test accuracy of trained classifiers. We verified that the labels generated with our strategy can be transferred to attack a broad family of crowdsourcing learning models in a black-box setting, indicating its applicability and potential of being extended to the physical world.
@inproceedings{chen2021data, title = {Data Poisoning Attacks on Crowdsourcing Learning}, author = {Chen, Pengpeng and Sun, Hailong and Chen, Zhijun}, booktitle = {Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data}, pages = {164--179}, year = {2021}, organization = {Springer}, }
- IJCAI Workshop 2021Learning from Multiple Annotators by Incorporating Instance FeaturesJingzheng Li, Hailong Sun, Jiyi Li, and 3 more authorsarXiv preprint arXiv:2106.15146, 2021
Learning from multiple annotators aims to induce a high-quality classifier from training instances, where each of them is associated with a set of possibly noisy labels provided by multiple annotators under the influence of their varying abilities and own biases. In modeling the probability transition process from latent true labels to observed labels, most existing methods adopt class-level confusion matrices of annotators that observed labels do not depend on the instance features, just determined by the true labels. It may limit the performance that the classifier can achieve. In this work, we propose the noise transition matrix, which incorporates the influence of instance features on annotators’ performance based on confusion matrices. Furthermore, we propose a simple yet effective learning framework, which consists of a classifier module and a noise transition matrix module in a unified neural network architecture. Experimental results demonstrate the superiority of our method in comparison with state-of-the-art methods.
@article{li2021learning, title = {Learning from Multiple Annotators by Incorporating Instance Features}, author = {Li, Jingzheng and Sun, Hailong and Li, Jiyi and Chen, Zhijun and Tao, Renshuai and Ge, Yufei}, journal = {arXiv preprint arXiv:2106.15146}, year = {2021}, }
- IJCAI 2020Structured Probabilistic End-to-End Learning from CrowdsZhijun Chen†, Huimin Wang†, Hailong Sun, and 4 more authorsIn Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, 2021
End-to-end learning from crowds has recently been introduced as an EM-free approach to training deep neural networks directly from noisy crowdsourced annotations. It models the relationship between true labels and annotations with a specific type of neural layer, termed as the crowd layer, which can be trained using pure backpropagation. Parameters of the crowd layer, however, can hardly be interpreted as annotator reliability, as compared with the more principled probabilistic approach. The lack of probabilistic interpretation further prevents extensions of the approach to account for important factors of annotation processes, eg, instance difficulty. This paper presents SpeeLFC, a structured probabilistic model that incorporates the constraints of probability axioms for parameters of the crowd layer, which allows to explicitly model annotator reliability while benefiting from the end-toend training of neural networks. Moreover, we propose SpeeLFC-D, which further takes into account instance difficulty. Extensive validation on realworld datasets shows that our methods improve the state-of-the-art.
@inproceedings{chen2021structured, title = {Structured Probabilistic End-to-End Learning from Crowds}, author = {Chen, Zhijun and Wang, Huimin and Sun, Hailong and Chen, Pengpeng and Han, Tao and Liu, Xudong and Yang, Jie}, booktitle = {Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence}, pages = {1512--1518}, year = {2021}, }