Sunday, April 17, 2011

Analyzing search keyphrases similarities with different methods

Have you ever wondered what kind of similar keyphrases people use when searching for something and how do they compare?
I have :)

Using a subset of keyphrases used by visitors of (some 550 items), I wrote a simple R program that tries to assign potentially similar phrases to each of the keyphrases used.

I employed three methods for this task:
  1. Levenshtein distance - see source code here
  2. length of longest common substring (LCS) - source code of my implementation here
  3. phrases starting with base phrase - very rude implementation available here
Check the small selection of the results of this exercise:

You can download all the results here.

As you will notice, each of the method gives different results. None single set of the results is definitively better than the others. Combining them in some way seems the most interesting solution.

Correcting spelling mistakes before analyzing dependencies seems important for improving results. Some other methods can be added as well. 

No comments: