ka | en

Authorisation

Distance and similarity with N-Grams

Author: Shota Odisharia
Annotation:

Nowadays, almost in every program, that processes any kind of text data (and not only text data), it is almost a must to have a possibility of search in text. In many applications, especially in applications which interfare with humans, depanding on their specification, it’s inevitable to have a service of approximate search. This, of course, more complicates the above said problem. One of the main problems of the text approximate search is the fact, that it’s very hard to determine and estimate the rate of two words’ similarity. That’s because, it depends on the human’s interpretation and perception of the words. Thus, it’s probably impossible to give the exact classification of the words’ similarity. However, there are a lot of known and already widely used algorythms that compute and calculate the similarity of phrases. Among them, probably the most known are best on the Edit Distance, also known as the Levinstains Distance and on the longest common subsequense (LCS). The main goal of this paper is to define a proper algorithm for word similarity, which will inherit the pros of the already known algorithms and also defy their cons. In the paper, we present the N-gramm similarity and distance algorithms, along with the code for computing them. We provide a formal and a recursive definitions. We show that the Edit distance and LCS are the specific cases of the N-gram similarity and distance. We provide new measurements for words similarity, which are based on the consept of N-gram similarity and distance.

Lecture files:

Magister - Odisharia [ka]