Research of Distinct Algorithm of Short Text Based on Simhash
Abstract
With the development of social network, microblog is the typical application in big data era. However, there are two aspects: the one is Chinese language is more various and flexible than English, the other is that microblog with Simhash is not good. This paper analyzes the data of microblog, which is the big number of data and short text, then the innovational algorithm is introduced. We used Simhash as the baseline, then we improved it and proposed B-Simhash algorithm. Meanwhile, we focused on the data quality of microblog by special character processing. Through experiments we can gain the results of B-Simhash, which is better in processing the short text. Adding the special character processing, the result is better than the original. According to the result of experiments, the higher precision and recall rate are gained. The efficiency is also improved for microblog de-duplication.
Keywords
Simhash, Text de-duplication, Data clean
DOI
10.12783/dtetr/oect2017/16127
10.12783/dtetr/oect2017/16127
Refbacks
- There are currently no refbacks.