word2vec笔记

    xiaoxiao2021-11-30  19

    word2vec原始版本应用可以参考博文: http://blog.csdn.net/jj12345jj198999/article/details/11069485 在linux上安装使用的步骤大概是:下载源码,make,执行如下命令进行训练: ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1   //这里指定输出为vectors.bin文件,显然输出到文件便于以后重复利用,省得每次都要计算一遍。 输入如下代码就行开启交互式找同义词: ./distance vectors.bin   聚类执行如下代码: ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500   按类别排序: sort classes.txt -k 2 -n > classes.sorted.txt   训练集中通常需要将 每篇文章转换位1行text文本,并且去掉了标点符号等内容 这里使用的是google word2vec的python接口版本: https://github.com/danielfrg/word2vec (1)安装: $ pip install word2vec #报下面的错误 Downloading/unpacking word2vec Downloading word2vec-0.9.1.tar.gz (49kB): 49kB downloaded Running setup.py (path:/tmp/pip_build_zhangwj/word2vec/setup.py) egg_info for package word2vec Traceback (most recent call last): File "<string>", line 17, in <module> File "/tmp/pip_build_zhangwj/word2vec/setup.py", line 23, in <module> from Cython.Build import cythonize ImportError: No module named Cython.Build Complete output from command python setup.py egg_info: Traceback (most recent call last): File "<string>", line 17, in <module> File "/tmp/pip_build_zhangwj/word2vec/setup.py", line 23, in <module> from Cython.Build import cythonize ImportError: No module named Cython.Build 需要先安装Cython: $ sudo pip install Cython #注意如果没有sudo,安装过程中会因为权限问题报错。 安装 word2vec: $ sudo pip install word2vec #同样需要sudo (2)在本机编写测试代码: 参考项目中examples下的word2vec.ipynb文件: #encoding=utf-8 import word2vec word2vec.word2vec("files/data_fenci.txt","files/data_fenci.bin",size=100,verbose=True) #通过训练预料生成bin二进制模型,模型生成后下次执行可以不用再调用该语句,直接用下面的方法加载即可。 model = word2vec.load( "files/data_fenci.bin" ) indexs,metrics = model.cosine( u'感冒' ) #查找与感冒相似的语句 print indexs,metrics # print model.vocab[indexs] for ele,similarity in model.generate_response(indexs,metrics).tolist(): print ele,similarity (3)代码分析 word2vec.word2vec是调用的构造函数,在scripts_interface.py文件中,构造函数如下: def word2vec(train, output, size= 100 , window= 5 , sample= '1e-3' , hs= 0 , negative= 5 , threads= 12 , iter_= 5 , min_count= 5 , alpha= 0.025 , debug= 2 , binary= 1 , cbow= 1 , save_vocab=None, read_vocab=None, verbose=False): """ word2vec execution Parameters for training: train <file> 训练数据 Use text data from <file> to train the model output <file> 输出 Use <file> to save the resulting word vectors / word clusters size <int> vector大小 Set size of word vectors; default is 100 window <int> 窗口 Set max skip length between words; default is 5 sample <float> Set threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled; default is 0 (off), useful value is 1e-5 hs <int> 层级softmax,默认使用 Use Hierarchical Softmax; default is 1 (0 = not used) negative <int> 负采样,默认不使用 Number of negative examples; default is 0, common values are 5 - 10 (0 = not used) threads <int> Use <int> threads (default 1) min_count <int> This will discard words that appear less than <int> times; default is 5 alpha <float> Set the starting learning rate; default is 0.025 debug <int> Set the debug mode (default = 2 = more info during training) binary <int> Save the resulting vectors in binary moded; default is 0 (off) cbow <int> Use the continuous back of words model; default is 1 (skip-gram model) save_vocab <file> The vocabulary will be saved to <file> read_vocab <file> The vocabulary will be read from <file>, not constructed from the training data verbose Print output from training 该文件中还包含 word2clusters,word2phrase,doc2vec等方法。这些方法会将传递进来的参数拼到str中,然后调用run_cmd去执行。 word2vec还提供了experimental的 doc2vec,参考: https://github.com/zhangweijiqn/word2vec/blob/master/examples/doc2vec.ipynb TensorFlow 的Word2Vec: TensorFlow中生成的词向量(Word Embedding) 官网: https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html 测试代码需要安装 sklearn 和 matplotlib, 安装sklearn,参考:: pip install -U scikit-learn #报错,缺少scipy包 参考scipy官网安装方法: http://www.scipy.org/install.html sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose 报错参考: http://stackoverflow.com/questions/11114225/installing-scipy-and-numpy-using-pip
    转载请注明原文地址: https://ju.6miu.com/read-678932.html

    最新回复(0)