孟加拉语中高血压检测:基于文本相似性的方法 (Plagiarism Detection in the Bengali Language: A Text Similarity-Based Approach)

Plagiarism means taking another person's work and not giving any credit to them for it. Plagiarism is one of the most serious problems in academia and among researchers. Even though there are multiple tools available to detect plagiarism in a document but most of them are domain-specific and designed to work in English texts, but plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India with 300 million native speakers and 37 million second-language speakers. Plagiarism detection requires a large corpus for comparison. Bengali Literature has a history of 1300 years. Hence most Bengali Literature books are not yet digitalized properly. As there was no such corpus present for our purpose so we have collected Bengali Literature books from the National Digital Library of India and with a comprehensive methodology extracted texts from it and constructed our corpus. Our experimental results find out average accuracy between 72.10 % - 79.89 % in text extraction using OCR. Levenshtein Distance algorithm is used for determining Plagiarism. We have built a web application for end-user and successfully tested it for Plagiarism detection in Bengali texts. In future, we aim to construct a corpus with more books for more accurate detection.

翻译：Plagiarism 是指使用另一个人的工作,而没有给予他们任何荣誉。 Plagiarism 是学术界和研究人员中最严重的问题之一。尽管在一份文件中有许多工具可以用来检测plagiarism, 但大多数工具都是针对域的, 设计用于英文文本, 但这种工具并不仅限于一种语言。 Bengali是孟加拉国最广泛使用的语言,也是印度第二大语言,有3亿母语和3 700万第二语言。 Plagiarism 的检测需要大量的数据来进行比较。 Bengali文学有1300年的历史。因此,大多数孟加拉文学书籍尚未被适当数字化。由于没有为我们的目的提供这种工具,所以我们没有为印度国家数字图书馆收集孟加拉文学书籍,因此我们从中收集了一种全面的方法,从中提取了文字,并构建了我们的文。我们的实验结果发现,在使用 OCRR 的文本提取中,平均为72. 10 - 79.89% 。 Levestein 远程算法是用来确定Plagiagiarism的。我们为最终的检测目的,我们为Blasmaim m 建造了一种更精确的检测和成功的书。我们为Blamabal。我们为将来的检测而建造了一个更精确的搜索。我们为Bastium 。我们为Bastium 。我们建造了一个更精确的搜索而建造了一台。

相关内容

AIM

关注 655

医学人工智能AIM（Artificial Intelligence in Medicine）杂志发表了多学科领域的原创文章，涉及医学中的人工智能理论和实践，以医学为导向的人类生物学和卫生保健。医学中的人工智能可以被描述为与研究、项目和应用相关的科学学科，旨在通过基于知识或数据密集型的计算机解决方案支持基于决策的医疗任务，最终支持和改善人类护理提供者的性能。官网地址：http://dblp.uni-trier.de/db/journals/artmed/

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日