UniASM：无需微调的二进制代码相似性检测 (UniASM: Binary Code Similarity Detection without Fine-tuning) - 专知论文

会员服务 ·

0

代码相似性检测 · 相似性 · 微调 · 代码 · 克隆检测 ·

2023 年 4 月 6 日

UniASM: Binary Code Similarity Detection without Fine-tuning

翻译：UniASM：无需微调的二进制代码相似性检测

Yeming Gu,Hui Shu,Fan Hu

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Binary code similarity detection (BCSD) is widely used in various binary analysis tasks such as vulnerability search, malware detection, clone detection, and patch analysis. Recent studies have shown that the learning-based binary code embedding models perform better than the traditional feature-based approaches. In this paper, we propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions. We design two new training tasks to make the spatial distribution of the generated vectors more uniform, which can be used directly in BCSD without any fine-tuning. In addition, we present a new tokenization approach for binary functions, which increases the token's semantic information and mitigates the out-of-vocabulary (OOV) problem. We conduct an in-depth analysis of the factors affecting model performance through ablation experiments and obtain some new and valuable findings. The experimental results show that UniASM outperforms the state-of-the-art (SOTA) approach on the evaluation dataset. The average scores of Recall@1 on cross-compilers, cross-optimization levels, and cross-obfuscations are 0.77, 0.72, and 0.72. Besides, in the real-world task of known vulnerability search, UniASM outperforms all the current baselines.

翻译：二进制代码相似性检测（BCSD）广泛应用于各种二进制分析任务，如漏洞搜索、恶意软件检测、克隆检测和补丁分析。最近的研究表明，基于学习的二进制代码嵌入模型比传统的基于特征的方法表现更好。本文提出了一种新颖的基于Transformer的二进制代码嵌入模型UniASM，用于学习二进制函数的表示。我们设计了两个新的训练任务，使生成的向量的空间分布更加均匀，可以在无需任何微调的情况下直接用于BCSD。此外，我们提出了一种新的二进制函数标记方法，增加了标记的语义信息，缓解了词汇外（OOV）问题。我们通过消融实验对影响模型性能的因素进行了深入分析，并获得了一些新的有价值的发现。实验结果表明，UniASM在评估数据集上优于现有技术（SOTA）方法。交叉编译器、交叉优化级别和交叉混淆的Recall@1平均分别达到了0.77、0.72和0.72。此外，在已知漏洞搜索的实际任务中，UniASM优于所有当前基线。

0

相关内容

代码相似性检测

代码相似性检测

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

预训练模型如何用于文本挖掘？看这份KDD2021-UIUC《预训练文本表示:模型与应用在文本挖掘》教程，附200页Slides

专知会员服务

44+阅读 · 2021年8月18日

预训练语言模型fine-tuning近期进展概述

预训练语言模型fine-tuning近期进展概述

专知会员服务

40+阅读 · 2021年4月9日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

专知会员服务

50+阅读 · 2020年2月26日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【AAAI2020接受论文】多任务自监督学习的不流利检测，Multi-Task Self-Supervised Learning for Disfluency Detection

【AAAI2020接受论文】多任务自监督学习的不流利检测，Multi-Task Self-Supervised Learning for Disfluency Detection

专知会员服务

14+阅读 · 2019年11月11日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

NLP预训练模型大集合！

NLP预训练模型大集合！

全球人工智能

31+阅读 · 2018年12月29日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

tensorflow Object Detection API使用预训练模型mask r-cnn实现对象检测

tensorflow Object Detection API使用预训练模型mask r-cnn实现对象检测

极市平台

12+阅读 · 2018年8月24日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

c-Myc-GPC5通路调控前列腺癌进展的分子机理

国家自然科学基金

0+阅读 · 2013年12月31日

带有行限制的覆盖阵列的研究

国家自然科学基金

0+阅读 · 2013年12月31日

PPARγ-1SUMO化修饰在高（血）糖诱导血管内皮胰岛素抵抗中的作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

ECF转运蛋白的结构与转运机制

国家自然科学基金

0+阅读 · 2012年12月31日

新疆维吾尔族精神分裂症新发生的拷贝数变异（de novo CNV）研究

国家自然科学基金

0+阅读 · 2012年12月31日

SM-ɑ基因启动子区特异位点在2型糖尿病血管并发症中VSMC表型转换的分子机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

MiR-27a/b靶向沉默ABCA1调控胆固醇逆向转运

国家自然科学基金

0+阅读 · 2011年12月31日

染色质改构因子调控RNA聚合酶I转录起始的研究

国家自然科学基金

0+阅读 · 2011年12月31日

人类线粒体DNA古老变异潜在致病性的功能验证

国家自然科学基金

0+阅读 · 2011年12月31日

de novo预测蛋白质结构的并行元启发方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

Automatic Model Selection with Large Language Models for Reasoning

Arxiv

0+阅读 · 2023年5月23日

QLoRA: Efficient Finetuning of Quantized LLMs

Arxiv

1+阅读 · 2023年5月23日

Multilingual Large Language Models Are Not (Yet) Code-Switchers

Arxiv

0+阅读 · 2023年5月23日

Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

Arxiv

0+阅读 · 2023年5月21日

CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search

Arxiv

0+阅读 · 2023年5月19日

Big Little Transformer Decoder

Arxiv

0+阅读 · 2023年5月19日

Coordinated Transformer with Position \& Sample-aware Central Loss for Anatomical Landmark Detection

Arxiv

0+阅读 · 2023年5月18日

Pix2seq: A Language Modeling Framework for Object Detection

Arxiv

10+阅读 · 2021年9月22日

Neural Architecture Search without Training

Neural Architecture Search without Training

Arxiv

10+阅读 · 2021年6月11日

Weakly Supervised One-Shot Detection with Attention Siamese Networks

Arxiv

14+阅读 · 2018年1月12日

VIP会员

文章信息

相关主题

代码相似性检测

相关VIP内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

预训练模型如何用于文本挖掘？看这份KDD2021-UIUC《预训练文本表示:模型与应用在文本挖掘》教程，附200页Slides

专知会员服务

44+阅读 · 2021年8月18日

预训练语言模型fine-tuning近期进展概述

预训练语言模型fine-tuning近期进展概述

专知会员服务

40+阅读 · 2021年4月9日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

专知会员服务

50+阅读 · 2020年2月26日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【AAAI2020接受论文】多任务自监督学习的不流利检测，Multi-Task Self-Supervised Learning for Disfluency Detection

【AAAI2020接受论文】多任务自监督学习的不流利检测，Multi-Task Self-Supervised Learning for Disfluency Detection

专知会员服务

14+阅读 · 2019年11月11日

热门VIP内容

开通专知VIP会员享更多权益服务

新质生成式AI赋能产业变革的实践与路径

用于多模态大模型的离散标记化：全面综述

Nature综述：金融网络中的物理学

【CMU博士论文】通信高效且差分隐私的优化方法

相关资讯

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

NLP预训练模型大集合！

NLP预训练模型大集合！

全球人工智能

31+阅读 · 2018年12月29日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

tensorflow Object Detection API使用预训练模型mask r-cnn实现对象检测

tensorflow Object Detection API使用预训练模型mask r-cnn实现对象检测

极市平台

12+阅读 · 2018年8月24日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

Automatic Model Selection with Large Language Models for Reasoning

Arxiv

0+阅读 · 2023年5月23日

QLoRA: Efficient Finetuning of Quantized LLMs

Arxiv

1+阅读 · 2023年5月23日

Multilingual Large Language Models Are Not (Yet) Code-Switchers

Arxiv

0+阅读 · 2023年5月23日

Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

Arxiv

0+阅读 · 2023年5月21日

CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search

Arxiv

0+阅读 · 2023年5月19日

Big Little Transformer Decoder

Arxiv

0+阅读 · 2023年5月19日

Coordinated Transformer with Position \& Sample-aware Central Loss for Anatomical Landmark Detection

Arxiv

0+阅读 · 2023年5月18日

Pix2seq: A Language Modeling Framework for Object Detection

Arxiv

10+阅读 · 2021年9月22日

Neural Architecture Search without Training

Neural Architecture Search without Training

Arxiv

10+阅读 · 2021年6月11日

Weakly Supervised One-Shot Detection with Attention Siamese Networks

Arxiv

14+阅读 · 2018年1月12日

相关基金

c-Myc-GPC5通路调控前列腺癌进展的分子机理

国家自然科学基金

0+阅读 · 2013年12月31日

带有行限制的覆盖阵列的研究

国家自然科学基金

0+阅读 · 2013年12月31日

PPARγ-1SUMO化修饰在高（血）糖诱导血管内皮胰岛素抵抗中的作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

ECF转运蛋白的结构与转运机制

国家自然科学基金

0+阅读 · 2012年12月31日

新疆维吾尔族精神分裂症新发生的拷贝数变异（de novo CNV）研究

国家自然科学基金

0+阅读 · 2012年12月31日

SM-ɑ基因启动子区特异位点在2型糖尿病血管并发症中VSMC表型转换的分子机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

MiR-27a/b靶向沉默ABCA1调控胆固醇逆向转运

国家自然科学基金

0+阅读 · 2011年12月31日

染色质改构因子调控RNA聚合酶I转录起始的研究

国家自然科学基金

0+阅读 · 2011年12月31日

人类线粒体DNA古老变异潜在致病性的功能验证

国家自然科学基金

0+阅读 · 2011年12月31日

de novo预测蛋白质结构的并行元启发方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员