Saturday, May 18, 2013

Using Term Document Matrix

Problem Statement:

Let's say you have a number of documents. Each document has a number of words (we refer them as terms). The problem is to identify two documents which are most similar.

Given below is one of the solution approaches

Use of Term Document Matrix:

Each row of a Term document matrix (let's name it as D) is a document vector, with one column for every term in the entire corpus of documents. The matrix is sparse as not all document might contain a term. The value in each cell of the matrix is the term frequency.

example:
docid      term1      term2     term3     term4                                term n
d1            2             1            0              0                                    3
d2            0             2            3              4                                    1
d3            1             0            4              2                                    0
.
.

The transpose of the same Term Document Matrix (DT) will look as follows

docid      d1     d2        d3  .............. dn
term1       2      0          1
term2       1      2          0
term3       0      3          4
.
.

We can create a Similarity Matrix (S) by multiplying D with DT { e.g. S = D * DT}
The structure of Similarity matrix will be as follows

docid   d1       d2        d3 ............. dn
d1       x11     x12      x13             x1n
d2       x21     x22      x23             x2n
d3       x31     x32      x33             x3n
.
.
dn        xn1    xn2      xn3               xnn

where Xmn = SUM Product of all term frequencies of docids dm and dn
Intuitively higher the value of Xmn, the more similar are the documents with doc ids dm and dn.

Coming back to our original problem, find which to documents are most similar.
Simply look into the Similarity matrix and find our the row, col of the highest value in the matrix.