• Engineering and Technology • Previous Articles Next Articles
LI Guo, ZHANG Chunjie, ZHANG Zhiyuan
Received:
Revised:
Online:
Published:
Abstract:
In order to overcome the shortcomings and limitations of traditional clutering algorithms in dealing with largescale and high dimension text clustering, a text clustering method is presented based on weighted LDA(latent dirichlet allocation)model. Two distributions are obtained by LDA: the topic distribution and the word distribution of different topics hidden in the text, which are then combined as the text feature to obtain the final text similarity. Using the classic K-Means algorithm in both English and Chinese corpus, the experimental results show that compared with the pure VSMcombined with K-Means, this algorithm has better clustering effect.
Key words: LDA, vector space model, data mining, text clustering, K-means
CLC Number:
TP18
LI Guo, ZHANG Chunjie, ZHANG Zhiyuan. Text clustering method based on weighted LDA model[J]. .
0 / / Recommend
Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks
URL: https://www.cauc.edu.cn/jweb_cauc/EN/
https://www.cauc.edu.cn/jweb_cauc/EN/Y2016/V34/I2/46
[1] 唐果.基于语义领域向量空间模型的文本相似度计算[D].昆明:云南大学,2013.[2] 孙昌年,郑诚,夏青松. 基于LDA 的中文文本相似度计算[J]. 计算机技术与发展,2013(1):217-220.[3] DEERWESTER S,DUMAIS STA. Indexing by latent semantic analysis[J]. Journalof theSocietyfor InformationScience,1990,41(6):391-407.[4] 高茂庭,王正欧. 几种文本特征降维方法的比较分析[J]. 计算机工程与应用,2011,42(30):157-159.[5] 王刚,邱玉辉,蒲国林.一个基于语义元的相似度计算方法研究[J].计算机应用研究,2008(11):3253-3255.[6] 郑诚,李鸿. 基于主题模型的K-均值文本聚类[J]. 计算机与现代化,2013(8):78-80,84.[7] 王振振,何明,杜永萍,等. 基于LDA 主题模型的文本相似度计算[J].计算机科学,2013,40(12):229-232.[8] HOFMANN T. Unsupervised learning by probabilistic latent semantic analysis[J].Machine Learning, 2001, 42(1):177-196.[9] BLEI D, NG A, JORDAN M. Latent dirichlet allocation[J].Journal of Machine Learning Research, 2003(3):993-1022.[10] GRIFFITHS T L,STEYVERS M. Finding scientific topics[J]. PNAS,2004, 101(1):1137-1145.[11] 石晶,胡明,石鑫,等. 基于LDA模型的文本分割[J]. 计算机学报,2008,31(10):1865-1873.