i have a code below which Liang Sun implemented
#Created by Liang Sun in 2013
import re
import collections
import hashlib
class Simhash(object):
def __init__(self, value):
self.f = 64
self.reg = ur'[\w\ufb50-\ufdff]'
self.value = None
if isinstance(value, Simhash):
self.value = value.value
elif isinstance(value, basestring):
self.build_by_text(unicode(value))
elif isinstance(value, collections.Iterable):
self.build_by_features(value)
elif isinstance(value, long):
self.value = value
elif isinstance(value, Simhash):
self.value = value.hash
else:
raise Exception('Bad parameter')
def _slide(self, content, width=2):
return [content[i:i+width] for i in xrange(max(len(content)-width+1, 1))]
def _tokenize(self, content):
ans = []
content = ''.join(re.findall(self.reg, content))
ans = self._slide(content)
return ans
def build_by_text(self, content):
features = self._tokenize(content)
return self.build_by_features(features)
def build_by_features(self, features):
features = set(features) # remove duplicated features
hashs = [int(hashlib.md5(w.encode('utf-8')).hexdigest(), 16) for w in features]
v = [0]*self.f
for h in hashs:
for i in xrange(self.f):
mask = 1 << i
v[i] += 1 if h & mask else -1
ans = 0
for i in xrange(self.f):
if v[i] >= 0:
ans |= 1 << i
self.value = ans
def distance(self, another):
x = (self.value ^ another.value) & ((1 << self.f) - 1)
ans = 0
while x:
ans += 1
x &= x-1
return ans
I want to use this code for Arabic language text, I asked Lian Sun about this and he said I should replace self.reg = ur'[\w\ufb50-\ufdff]'
with the Arabic code point range. I searched and find the Arabic Unicode block on Wikipedia but I don't know how to use it.
Any help appreciated
There is no "Arabic code-point range", there are instead 7 blocks specific to Arabic, plus other blocks that Arabic may use. See Arabic script in Unicode for a nice description of them.
If you want to match the Arabic characters available in ISO-8859-6, you only want part of one of those blocks, 0621-0652.
If you want to match the Arabic characters available in Unicode 1.0, that's the blocks 0600-06FF, 0750-077F, annd 08A0-08FF.
If you want the contextual variants, you also need the two "presentation forms" blocks (although some of these are not actually used by Arabic, only by other languages that use the Arabic script—then again, you tagged your question Farsi…), FB50-FDFF and FE70-FEFF. The fact that your original code was matching FB50-FDFF implies that you need these.
Finally, as of Unicode 6.1, there are two additional ranges that you may or may not need, primarily useful for mathematics, 10E60-10E7F and 1EE00-1EEFF.
I'm going to guess that you need the first 5 blocks, but not the last two, so, instead of this:
… do this:
However, I'm not sure this really solves your problem. The original code was using
re.findall
with the presentation forms to break the text into tokens—maybe as a hacky way of splitting on end characters (which will only work on text encoded in a very particular, and obsolete, way…). Changing it tofindall
every run of Arabic characters will give you a very different result.