Wednesday, May 11, 2011

TextRank implementation in R

TextRank is a graph algorithm for keywords extraction and summarization based on PageRank developed by Larry Page from Google.

You can read the description of the algorithm and its evaluation in the paper "Text Rank: Bringing Order into Texts" by Rada Mihalcea and Paul Tarau.

I have made a quick and dirty implementation of TextRank in R ( for keywords extraction only.

My implementation has two differences to the algorithm presented in the above mentioned paper:

  1. it calculates weights of the edges based on the number of instances when two nodes are connected (it is not used in the calculation of ranks though)
  2. it allows circular references, where a node has an edge to itself (used in the calculation) 

I have used three R libraries to speed up implementation:

  1. tm (text mining) for preprocessing text to be analyzed
  2. openNLP for part of speach tagging
  3. graph for constructing graphs

I'd like to emphasize once more that this is a really quick and dirty implementation.

Please find source code here.