Glove笔记

    xiaoxiao2021-11-29  22

    论文出处: http://nlp.stanford.edu/projects/glove/ 下面准备测试的是python实现版本: github地址: https://github.com/maciejkula/glove-python 安装: sudo pip install glove_python 下载源码: git clone --recursive https://github.com/maciejkula/glove-python.git 执行测试examples: ipython -i -- examples/example.py -c my_corpus.txt -t 10 报错:未安装gensim 安装gensim: https://radimrehurek.com/gensim/install.html sudo pip install --upgrade gensim 安装smart_open(文件open): sudo pip install smart_open 安装。。。。。。(依赖的文件太多,放弃,考虑使用C语言版) 下面测试C语言版: github地址: https://github.com/stanfordnlp/GloVe 源码下载及测试: $ git clone http://github.com/stanfordnlp/glove$ cd glove && make$ ./demo.sh 执行后可以看到生成了下面的文件: cooccurrence.bin cooccurrence.shuf.bin vocab.txt #词典中词及对应的索引 vectors.bin #词向量二进制文件 vectors.txt #词典中词对应的词向量 vectors.bin看上去和word2vec生成的词向量二进制文件类似,那么是否可以直接利用word2vec去读取vectors.bin呢,使用word2vec去测试一下: model = word2vec.load( "vectors.bin" ) 报下面的错误: ValueError: invalid literal for int() with base 10: '\x95\x87\xf2\x9aD\xa7\xf2?;\xbe6\x9dg\x05\xe8?\xf6\xf8\x93?\xfb*\xdb?%M}\xb9\xa1\x92\xd1?.\x85\xba\x19\xd7\xb4\xdb\xbf\x17\x16' 进一步查看load方法,是可以读取txt文件的: def load(fname, kind= 'auto' , *args, **kwargs): if kind == 'auto' : if fname.endswith( '.bin' ): kind = 'bin' elif fname.endswith( '.txt' ): kind = 'txt' else : raise Exception( 'Could not identify kind' ) if kind == 'bin' : return word2vec.WordVectors.from_binary(fname, *args, **kwargs) elif kind == 'txt' : return word2vec.WordVectors.from_text(fname, *args, **kwargs) elif kind == 'mmap' : return word2vec.WordVectors.from_mmap(fname, *args, **kwargs) else : raise Exception( 'Unknown kind' ) from_text方法中有如下代码: with open(fname, 'rb' ) as fin: header = fin.readline() vocab_size, vector_size = list(map(int, header.split())) 可以看到第一行header是读取的 vocab_size, vector_size,而查看vectors.txt中的数据,第一行就是word和其word vector。 是否可以在vectors.txt第一行插入vocab_size, vector_size这两个值就可以直接load呢(有待测试。。。。。。) 这里采用的方法是参考的博文: http://blog.csdn.net/adooadoo/article/details/38505497, 测试代码在: https://github.com/eclipse-du/glove_py_model_load/blob/master/glove_dist.py,经过测试是可以运行的。 稍作修改后的代码在: https://github.com/zhangweijiqn/testPython/blob/master/src/Word2vec/test_glove.py 上面的测试是使用demo.sh是根据text8生成的文件,那么自己的语料库如果用来训练? 查看demo.sh代码后,发现通过简单修改就可以将inputfile改为自定义的,修改后代码如下: #!/bin/bash set -e make CORPUS=data_fenci.txt #这里是已经分好词的文件路径。 VOCAB_FILE=vocab.txt #输出的字典 COOCCURRENCE_FILE=cooccurrence.bin COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin BUILDDIR=build SAVE_FILE=vectors VERBOSE=2 MEMORY=4.0 VOCAB_MIN_COUNT=5 VECTOR_SIZE=50 MAX_ITER=15 WINDOW_SIZE=15 BINARY=2 NUM_THREADS=8 X_MAX=10 echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE" $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE" $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE" $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE" $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE 执行上面的代码,在结果中看到实际执行的cmd如下,很多参数都可以进行设置,如windowSize,vectorSize: $make mkdir -p build gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result $ build/vocab_count -min-count 5 -verbose 2 < /home/zhangwj/Applications/Scrapy/baike/files/data_fenci.txt > vocab.txt BUILDING VOCABULARY Processed 19899975 tokens. Counted 388318 unique words. Truncating vocabulary at min count 5. Using vocabulary of size 107254. $ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 - window-size 15 < /home/zhangwj/Applications/Scrapy/baike/files/data_fenci.txt > cooccurrence.bin COUNTING COOCCURRENCES window size: 15 context: symmetric max product: 13752509 overflow length: 38028356 Reading vocab from file "vocab.txt"...loaded 107254 words. Building lookup table...table contains 106188469 elements. Processed 19899975 tokens. Writing cooccurrences to disk.........3 files in total. Merging cooccurrence files: processed 63589490 lines. $ build/shuffle -memory 4.0 -verbose 2 < cooccurrence.bin > cooccurrence.shuf.bin SHUFFLING COOCCURRENCES array size: 255013683 Shuffling by chunks: processed 63589490 lines. Wrote 1 temporary file(s). Merging temp files: processed 63589490 lines. $ build/glove -save-file vectors -threads 8 -input-file cooccurrence.shuf.bin -x-max 10 -iter 15 - vector-size 50 -binary 2 -vocab-file vocab.txt -verbose 2 TRAINING MODEL Read 63589490 lines. Initializing parameters...done. vector size: 50 vocab size: 107254 x_max: 10.000000 alpha: 0.750000 07/22/16 - 01:27.37PM, iter: 001, cost: 0.083864 将生成的vectors.txt使用 https://github.com/zhangweijiqn/testPython/blob/master/src/Word2vec/test_glove.py进行测试,可以进行similar words查找。
    转载请注明原文地址: https://ju.6miu.com/read-678657.html

    最新回复(0)