Problem Statement:
Let's say you have a number of documents. Each document has a number of words (we refer them as terms). The problem is to identify two documents which are most similar.
Given below is one of the solution approaches
Use of Term Document Matrix:
Each row of a Term document matrix (let's name it as D) is a document vector,
with one column for every term in the entire corpus of documents. The matrix is sparse as not all document might contain a term. The value in each cell of the matrix is the term frequency.
example:
docid term1 term2 term3 term4 term n
d1 2 1 0 0 3
d2 0 2 3 4 1
d3 1 0 4 2 0
.
.
The transpose of the same Term Document Matrix (DT) will look as follows
docid d1 d2 d3 .............. dn
term1 2 0 1
term2 1 2 0
term3 0 3 4
.
.
We can create a Similarity Matrix (S) by multiplying D with DT { e.g. S = D * DT}
The structure of Similarity matrix will be as follows
docid d1 d2 d3 ............. dn
d1 x11 x12 x13 x1n
d2 x21 x22 x23 x2n
d3 x31 x32 x33 x3n
.
.
dn xn1 xn2 xn3 xnn
where Xmn = SUM Product of all term frequencies of docids dm and dn
Intuitively higher the value of Xmn, the more similar are the documents with doc ids dm and dn.
Coming back to our original problem, find which to documents are most similar.
Simply look into the Similarity matrix and find our the row, col of the highest value in the matrix.
No comments:
Post a Comment