对zhwiki数据集进行LSI分析 (latent semantic indexing)

October 22, 2016

在 zhwiki 数据集中取300篇文章作为数据集，进行LSI分析，借助python的gensim库使用tfidf

# coding:utf-8

from gensim import corpora, models
import numpy as np
import thulac
import heapq

news = []
with open('news.txt', 'r') as f:
    news = f.readlines()

thu1 = thulac.thulac("-seg_only")
texts = [thu1.cut(n) for n in news]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

def split_sen(document):
    doc = document
    l = ['。', '，', ',', '：', '？', '！', '；']
    for i in l:
        doc = ' '.join(doc.split(i))

    return doc.split()

def dedupe(items):
    seen = set()
    for item in items:
        if item not in seen:
            yield item
            seen.add(item)

def has_num(word):
    s = '1234567890'
    for i in s:
        if i in word:
            return True
    return False

for test_n in [10,21,66,71,87]:
    test_news = news[test_n]
    test_sentences = test_news.split('。')
    test_sentences = split_sen(test_news)

    test_sentences = list(dedupe(test_sentences[3:]))

    sen_list = []
    for sentence in test_sentences:
        word_list = thu1.cut(sentence)
        tf_list = []
        w_tf_list = []
        for n in tfidf[dictionary.doc2bow(word_list)]:
            tf_list.append([dictionary[n[0]], n[1]])

        temp_l = heapq.nlargest(10, tf_list, key=lambda x: x[1])
        gi = 0
        if len(temp_l) < 4:
            gi = 0.1
        else:
            gi = 0
            for index, g_n in enumerate(temp_l):
                gi = gi + g_n[1]

        g = gi / np.log(30 + len(word_list))

        sen_list.append((sentence, g))

    res = list(heapq.nlargest(3, sen_list, key=lambda s: s[1]))
    print test_n + 1
    res_list = []
    for s in res:
        res_list.append([s[0], test_sentences.index(s[0])])

    res = sorted(res_list, key=lambda x: x[1])

    for s in res:
        print s[0]

k取不同值时，对term-doc矩阵近似程度的变化情况当 k=2 时

[[[ 0.          0.16872732]
  [ 1.          0.28385104]]

 [[ 0.          0.17087512]
  [ 1.          0.25177949]]

 [[ 0.          0.09310504]
  [ 1.          0.06657734]]

 ..., 
 [[ 0.          0.12855813]
  [ 1.          0.05259002]]

 [[ 0.          0.16851364]
  [ 1.         -0.05191722]]

 [[ 0.          0.10435963]
  [ 1.          0.02791414]]]

k=3 时

[[[ 0.          0.16897327]
  [ 1.          0.28507598]
  [ 2.          0.13649461]]

 [[ 0.          0.17080689]
  [ 1.          0.24925751]
  [ 2.          0.08284236]]

 [[ 0.          0.09272927]
  [ 1.          0.06470531]
  [ 2.          0.00207098]]

 ..., 
 [[ 0.          0.12813977]
  [ 1.          0.04945443]
  [ 2.         -0.06896024]]

 [[ 0.          0.1677073 ]
  [ 1.         -0.05385373]
  [ 2.         -0.13225321]]

 [[ 0.          0.10473032]
  [ 1.          0.02769582]
  [ 2.         -0.03988227]]]

k=4 时

[[[ 0.          0.16871857]
  [ 1.         -0.28459085]
  [ 2.         -0.13770419]
  [ 3.         -0.069019  ]]

 [[ 0.          0.17117359]
  [ 1.         -0.25219388]
  [ 2.         -0.08207103]
  [ 3.         -0.09496462]]

 [[ 0.          0.09306016]
  [ 1.         -0.06247403]
  [ 2.         -0.00422359]
  [ 3.         -0.01881501]]

 ..., 
 [[ 0.          0.12850835]
  [ 1.         -0.04631836]
  [ 2.          0.06921176]
  [ 3.         -0.1830543 ]]

 [[ 0.          0.16830373]
  [ 1.          0.05348783]
  [ 2.          0.13382081]
  [ 3.         -0.1658659 ]]

 [[ 0.          0.10508035]
  [ 1.         -0.02866824]
  [ 2.          0.03776732]
  [ 3.         -0.14062948]]]

k=10 时

[[[  0.00000000e+00   1.68929300e-01]
  [  1.00000000e+00   2.81778194e-01]
  [  2.00000000e+00   1.39412123e-01]
  ..., 
  [  7.00000000e+00   5.72861970e-02]
  [  8.00000000e+00  -1.24237738e-01]
  [  9.00000000e+00  -1.26407685e-01]]

 [[  0.00000000e+00   1.70888003e-01]
  [  1.00000000e+00   2.48411625e-01]
  [  2.00000000e+00   8.09905073e-02]
  ..., 
  [  7.00000000e+00   8.28638508e-02]
  [  8.00000000e+00  -1.15528742e-01]
  [  9.00000000e+00   1.88724720e-02]]

 [[  0.00000000e+00   9.24077340e-02]
  [  1.00000000e+00   6.41785279e-02]
  [  2.00000000e+00   1.33265938e-02]
  ..., 
  [  7.00000000e+00  -7.64332711e-03]
  [  8.00000000e+00  -2.11026728e-02]
  [  9.00000000e+00   5.42782920e-02]]

 ..., 
 [[  0.00000000e+00   1.28673796e-01]
  [  1.00000000e+00   5.48490181e-02]
  [  2.00000000e+00  -7.34370691e-02]
  ..., 
  [  7.00000000e+00  -1.09288792e-01]
  [  8.00000000e+00   2.09079613e-03]
  [  9.00000000e+00   2.57721482e-01]]

 [[  0.00000000e+00   1.67892585e-01]
  [  1.00000000e+00  -5.02225432e-02]
  [  2.00000000e+00  -1.28673874e-01]
  ..., 
  [  7.00000000e+00  -1.16714887e-02]
  [  8.00000000e+00  -5.75828930e-03]
  [  9.00000000e+00   2.78914422e-02]]

 [[  0.00000000e+00   1.04604000e-01]
  [  1.00000000e+00   2.73290079e-02]
  [  2.00000000e+00  -4.04171763e-02]
  ..., 
  [  7.00000000e+00  -1.52651544e-01]
  [  8.00000000e+00  -2.51226745e-03]
  [  9.00000000e+00   2.34189977e-01]]]

可以看到，k较小时，近似效果较强，但会损失一些信息

当k取2时，所有相关词和相关文档投射到同一个二维潜在语义空间上的情形其中红色圆点代表文档，蓝色叉号为单词相近的词会拥有较为相似的权值

分别构造原始的term-term、doc-doc矩阵，以及k取2时的term-term、doc-doc矩阵，分别画出词、文档投射到二维潜在语义空间上的情形，并对词和词、文档和文档之间的相似度进行比较原始的term-term矩阵为

[[ 0.00048803  0.00024793  0.00030132 ...,  0.          0.          0.        ]
 [ 0.00024793  0.00039648  0.00030132 ...,  0.          0.          0.        ]
 [ 0.00030132  0.00030132  0.0003662  ...,  0.          0.          0.        ]
 ..., 
 [ 0.          0.          0.         ...,  0.01086935  0.01086935
   0.01086935]
 [ 0.          0.          0.         ...,  0.01086935  0.01086935
   0.01086935]
 [ 0.          0.          0.         ...,  0.01086935  0.01086935
   0.01086935]]

原始的doc-doc矩阵为

[[ 1.          0.09597075  0.01256651 ...,  0.02078486  0.01757974
   0.06769401]
 [ 0.09597075  1.          0.03060802 ...,  0.02004322  0.01013162
   0.01479339]
 [ 0.01256651  0.03060802  1.         ...,  0.00817854  0.0031423
   0.00641839]
 ..., 
 [ 0.02078486  0.02004322  0.00817854 ...,  1.          0.00933102
   0.00819435]
 [ 0.01757974  0.01013162  0.0031423  ...,  0.00933102  1.          0.01914053]
 [ 0.06769401  0.01479339  0.00641839 ...,  0.00819435  0.01914053
   0.98913065]]

k取2时的term-term矩阵为

[[  9.35276963e+00   2.69901289e-14]
 [  2.69901289e-14   4.09090060e+00]]

doc-doc矩阵为

[[ 0.1065028   0.09890749  0.03303424 ...,  0.03655308  0.01427938
   0.02576516]
 [ 0.09890749  0.09213056  0.03146902 ...,  0.03531263  0.01608087
   0.02516947]
 [ 0.03303424  0.03146902  0.01250538 ...,  0.01524178  0.01248608
   0.0115397 ]
 ..., 
 [ 0.03655308  0.03531263  0.01524178 ...,  0.01929222  0.01882462
   0.01497451]
 [ 0.01427938  0.01608087  0.01248608 ...,  0.01882462  0.03065     0.01610867]
 [ 0.02576516  0.02516947  0.0115397  ...,  0.01497451  0.01610867
   0.01180562]]

当k取3时 term-term矩阵为

[[  9.35281176e+00  -4.56032781e-14   1.03085943e-14]
 [ -4.56032781e-14   4.09011943e+00  -6.34128167e-15]
 [  1.03085943e-14  -6.34128167e-15   3.78963029e+00]]

doc-doc矩阵为

[[ 0.12634687  0.10942283  0.03535683 ...,  0.02701973 -0.00381796
   0.01941215]
 [ 0.10942283  0.09709541  0.03306415 ...,  0.02955603  0.0058315
   0.02124305]
 [ 0.03535683  0.03306415  0.01319404 ...,  0.01503685  0.01140856
   0.01127659]
 ..., 
 [ 0.02701973  0.02955603  0.01503685 ...,  0.02345334  0.02733017
   0.01751517]
 [-0.00381796  0.0058315   0.01140856 ...,  0.02733017  0.04731181
   0.02149145]
 [ 0.01941215  0.02124305  0.01127659 ...,  0.01751517  0.02149145
   0.01335409]]

将相关词和相关文档投射到同一个三维潜在语义空间红色圆点为文档，蓝色叉号为词