## Cosine Similarity between documents

N - total number of words (unique?) in the document collection - dimensions
ND - total number of documents
di: {w_1i,w_2i,...,w_ni},
dj: {w_1j,w_2j,...,w_nj}
w_ki = 0, if k-th word not in i-th document, else
w_ki = f_ki*log(ND/nd_k) - weight of k-th word in i-th document,
nd_k - number of documents, which contains k-th word
f_ki - local frequency of k-th word in i-th document, kind of local importance.
f_ki=n_ki - simply count
f_ki=1+log(n_ki), n_ki > 0, f_ki=0, n_ki=0
f_ki=n_ki/sum(n_ki) - normalize to the document length;
f_ki=n_ki/max(n_ki)
sim(di,dj) = DI*DJ/(|DI|*|DJ|) =>
sim(di,dj) = sum(k=1,N, w_ki*w_kj)/
( sqrt(sum(k=1,N,w_ki^2))*sqrt(sum(k=1,N,w_kj^2)));
in sum(k=1,N, w_ki*w_kj) we consider only intersected entries.
0<=sim()<=1

example:

ND=2
d1 = "to be or not to be"
d2 = "to be and not to be"
words = {and, be, not, or, to} ,N=5 - dimensions
d1 = {X,be,not,or,to}
d2 = {and,be,not,X,to}
sim(Jaccard) = d1 \land d2/(d1 \lor d2) = 5/12 ( 3/5, if unique)
==================
w_ki=n_ki - counts
d1={0,2,1,1,2}, |d1|=sqrt(10)
d2={1,2,1,0,2}, |d2|=sqrt(10)
D1*D2 = (0*1+2*2+1*1+1*0+2*2) = 9
sim=9/10
=================
w_ki=n_ki*log(ND/nd_k) - normalized
d1={0,0,0,0.3,0}
d2={0.3,0,0,0,0}
sim=0 !!!??? - since d1,d2 consists of spam words only
variant:
w_ki=n_ki*log(1+ND/nd_k)
d1={0,1,1,1.3,1}
d2={1.3,1,1,0,1}
sim=3/4.69=0.64,
==================
w_ki=n_ki*(ND/nd_k)
d1={0,2,2,2,2}
d2={2,2,2,0,2}
sim=12/16=3/4

Notice: s1=similarity ( {1,2,3},{1,2,3} ) = 1, s2=similarity ( {1,2,3,4,….,100},{1,2,3,4,….,100} ) = 1. BUT, I'd say s2 must be bigger than s1, so we need to further normalize our sim() to take into account the absolute size of overlapped region.

sim = sim*(1/(1+log(NMAX/Noverlapped)) with NMAX - maximum size of overlapped region = maximum document length Noverlapped - size of the overlapped region