Sunday, May 15, 2011

Testing LSA in R

Latent Semantic Analysis (LSA) is a technique used in text mining for identifying concepts in documents.

It can be used, among others, for querying documents for concepts related to specified keywords, or estimating similarities between groups of terms (such as phrases).

Since there is a package implementing LSA in R available, I've decided to put it to a very simple test in the two areas mentioned above:

1. LSA-based search


> q <- fold_in(query("sushi hamster", rownames(myNewMatrix)),myLSAspace)

> qd <- 0
> for (i in 1:ncol(myNewMatrix)) {

+ qd[i] <- cosine(as.vector(q),as.vector(myNewMatrix[,i]))

+ }

> cor(q,myNewMatrix,method="spearman")
                      D1 D2         D3         D4
SUSHI HAMSTER -0.5882353  1 -0.7647059 -0.5882353

> qd
[1]  0.05031141  0.99187467 -0.22654816  0.22207399


2. phrases similarity


> ComparePhrases <- function(p1, p2) {

+ q1 <- fold_in(query(p1, rownames(myNewMatrix)),myLSAspace)
+ q2 <- fold_in(query(p2, rownames(myNewMatrix)),myLSAspace)

+ cosine(as.vector(q1),as.vector(q2))

+ }

> ComparePhrases("cat dog","monster dog")
         [,1]
[1,] 0.995967

> ComparePhrases("sushi hamster","monster dog")
           [,1]
[1,] -0.1948665


The complete source code is available here.
 

No comments: