Research of Distinct Algorithm of Short Text Based on Simhash

Yun ZHANG; Zong-ze JIN; Wei-min MU; Wei-ping WANG

doi:10.12783/dtetr/oect2017/16127

Research of Distinct Algorithm of Short Text Based on Simhash

Yun ZHANG, Zong-ze JIN, Wei-min MU, Wei-ping WANG

Abstract

With the development of social network, microblog is the typical application in big data era. However, there are two aspects: the one is Chinese language is more various and flexible than English, the other is that microblog with Simhash is not good. This paper analyzes the data of microblog, which is the big number of data and short text, then the innovational algorithm is introduced. We used Simhash as the baseline, then we improved it and proposed B-Simhash algorithm. Meanwhile, we focused on the data quality of microblog by special character processing. Through experiments we can gain the results of B-Simhash, which is better in processing the short text. Adding the special character processing, the result is better than the original. According to the result of experiments, the higher precision and recall rate are gained. The efficiency is also improved for microblog de-duplication.

Keywords

Simhash, Text de-duplication, Data clean

DOI
10.12783/dtetr/oect2017/16127

Refbacks

There are currently no refbacks.

Username
Password
Remember me

ENGINEERINGand TECHNOLOGY RESEARCH

Research of Distinct Algorithm of Short Text Based on Simhash

Abstract

Keywords

Refbacks

ENGINEERING
and TECHNOLOGY RESEARCH