Monday, May 30, 2011

Google Prediction API v1.2 for R

Google has recently made available a new version (v1.2) of Google Prediction API. (see announcement from Google I/O 2011).

Hence the previously available implementation of Google Prediction API for R has stopped working :(

I've spent some time and adapted it to new version of the API as well as made some small extensions and modification for Windows/R 2.13.0.

You can find the source code and R package at:

Sample usage of the package:

# install package
install.packages("googlepredictionapi_0.12.tar.gz", repos=NULL, type="source")
 #--- initialize
 # turn off SSL check - see: &
options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
 # put your own email, password and API key below
myEmail <- "***"
myPassword <- "***"
myAPIkey <- "***"
 # put path to python.exe on your computer and path do gsutil directory
myPython <- "c:/Python27/python.exe"
myGSUtilPath <- "c:/gsutil/"
myVerbose <- FALSE
 #--- work
 # upload local CVS file to Google Storage and initiate training; local file must be in R working directory
my.model <- PredictionApiTrain(data="./language_id_pl.txt",remote.file="gs://prediction_example/prediction_models/languages")
 # alternative: initiate training of a model already uploaded to Google Storage
my.model <- PredictionApiTrain(data="gs://prediction_example/prediction_models/languages",tillDone=FALSE) # tillDone - repeat checking till model is trained
 # check whether model is trained; if tillDone=TRUE was set above, there is no need for that
result <- PredictionApiCheckTrainingStatus("prediction_example","prediction_models/languages",verbose=TRUE)
 # you can adapt the result returned by PredictionApiCheckTrainingStatus to 'predictionapimodel' class used in predictions
my.model <- WrapModel(result)
 # check new data against model (I have added some Polish-language texts to the Google Prediction API 'Hello World' example)
predict(my.model,"'Prezydent Obama spotkał się z parlamentarzystami'")
 # please note, this package returns all labels and scores for a given data in a format:
# [1] "Polish"   "French"   "Spanish"  "English"  "0.36195"  "0.26396"  "0.260067" "0.114022"
 # some other prediction request
predict(my.model,"'This is a test'")
 # list objects in a Google Storage bucket

Sunday, May 15, 2011

Testing LSA in R

Latent Semantic Analysis (LSA) is a technique used in text mining for identifying concepts in documents.

It can be used, among others, for querying documents for concepts related to specified keywords, or estimating similarities between groups of terms (such as phrases).

Since there is a package implementing LSA in R available, I've decided to put it to a very simple test in the two areas mentioned above:

1. LSA-based search

> q <- fold_in(query("sushi hamster", rownames(myNewMatrix)),myLSAspace)

> qd <- 0
> for (i in 1:ncol(myNewMatrix)) {

+ qd[i] <- cosine(as.vector(q),as.vector(myNewMatrix[,i]))

+ }

> cor(q,myNewMatrix,method="spearman")
                      D1 D2         D3         D4
SUSHI HAMSTER -0.5882353  1 -0.7647059 -0.5882353

> qd
[1]  0.05031141  0.99187467 -0.22654816  0.22207399

2. phrases similarity

> ComparePhrases <- function(p1, p2) {

+ q1 <- fold_in(query(p1, rownames(myNewMatrix)),myLSAspace)
+ q2 <- fold_in(query(p2, rownames(myNewMatrix)),myLSAspace)

+ cosine(as.vector(q1),as.vector(q2))

+ }

> ComparePhrases("cat dog","monster dog")
[1,] 0.995967

> ComparePhrases("sushi hamster","monster dog")
[1,] -0.1948665

The complete source code is available here.

Wednesday, May 11, 2011

TextRank implementation in R

TextRank is a graph algorithm for keywords extraction and summarization based on PageRank developed by Larry Page from Google.

You can read the description of the algorithm and its evaluation in the paper "Text Rank: Bringing Order into Texts" by Rada Mihalcea and Paul Tarau.

I have made a quick and dirty implementation of TextRank in R ( for keywords extraction only.

My implementation has two differences to the algorithm presented in the above mentioned paper:

  1. it calculates weights of the edges based on the number of instances when two nodes are connected (it is not used in the calculation of ranks though)
  2. it allows circular references, where a node has an edge to itself (used in the calculation) 

I have used three R libraries to speed up implementation:

  1. tm (text mining) for preprocessing text to be analyzed
  2. openNLP for part of speach tagging
  3. graph for constructing graphs

I'd like to emphasize once more that this is a really quick and dirty implementation.

Please find source code here.