拆解Cluene系列(10)——详解索引的合并(一)

xiaoxiao2026-06-22 114

前一篇博文提到索引的合并，在SegmentMerger.merge()中，主要包含以下几部分：

合并域：mergeFields() 合并词典和倒排表：mergeTerms(); 合并标准化因子：mergeNorms();

合并词向量：mergeVectors(); 下面详细介绍这几部分：

合并域：mergeFields()

主要包含两部分：一部分是合并fnm信息，即域元数据信息，一部分是合并fdt,fdx信息，也即域数据信息。

合并fnm信息 • 首先生成新的域元数据信息：fieldInfos = new FieldInfos(); • 依次用reader读取每个合并段的域元数据信息，加入上述对象,代码如下： fieldInfos = _CLNEW FieldInfos(); // merge field names SegmentReader* reader = NULL; int32_t docCount = 0; //Iterate through all readers for (uint32_t i = 0; i < readers.size(); i++){ reader = readers[i]; TCHAR** tmp = NULL; tmp = reader->getIndexedFieldNames(true);//获取需要建立索引的field,不需要存储TermVector fieldInfos->add((const TCHAR**)tmp, true, true); tmp = reader->getIndexedFieldNames(false);//获取需要建立索引的field,需要存储TermVector fieldInfos->add((const TCHAR**)tmp, true, false); tmp = reader->getFieldNames(false);//获取需要建立索引的field fieldInfos->add((const TCHAR**)tmp, false, false); }

合并段数据信息fdt,fdx

使用indexReader读取所有要合并的数据。并添加到FieldWriter中：伪代码如下：

FieldsWriter* fieldsWriter = _CLNEW FieldsWriter(directory, segment, fieldInfos); try { IndexReader* reader = NULL; int32_t maxDoc = 0; //Iterate through all readers for (uint32_t i = 0; i < readers.size(); i++) { //get the i-th reader reader = (SegmentReader*)readers[i]; int32_t maxDoc = reader->maxDoc(); //Iterate through all the documents managed by the current reader for (int32_t j = 0; j < maxDoc; j++){ //Check if the j-th document has been deleted, if so skip it if (!reader->isDeleted(j)){ //Get the document Document* doc = reader->document(j); //Add the document to the new FieldsWriter fieldsWriter->addDocument( doc ); docCount++; //doc is not used anymore so have it deleted _CLDELETE(doc); } } } }_CLFINALLY( fieldsWriter->close(); _CLDELETE( fieldsWriter ); ); 合并标准化因子合并标准化因子的过程比较简单，和合并Field 数据类似，基本就是对每一个域，用指向合并段的reader读出标准化因子，然后再写入新生成的段。 void SegmentMerger::mergeNorms() { IndexReader* reader = NULL; OutputStream* output = NULL; //iterate through all the Field Infos instances for (int32_t i = 0; i < fieldInfos->size(); i++) { FieldInfo* fi = fieldInfos->fieldInfo(i); if (fi->isIndexed){ //Create an new filename for the norm file const char* buf = Misc::segmentname(segment,".f", i); output = directory->createFile( buf ); _CLDELETE_CaARRAY( buf ); //Iterate throug all SegmentReaders for (uint32_t j = 0; j < readers.size(); j++) { //Get the i-th IndexReader reader = readers[j]; //Get an InputStream to the norm file for this field in this segment uint8_t* input = reader->norms(fi->name); //Get the total number of documents including the documents that have been marked deleted int32_t maxDoc = reader->maxDoc(); //Iterate through all the documents for(int32_t k = 0; k < maxDoc; k++) { uint8_t norm = input != NULL ? input[k] : 0; //Check if document k is deleted if (!reader->isDeleted(k)){ //write the new norm output->writeByte(norm); } } } if (output != NULL){ //Close the OutputStream output output->close(); //destroy it _CLDELETE(output); } } } }

合并词向量：mergeVectors() 合并词向量的过程和合并Norms过程非常类似，再此不再叙述。

合并词典和倒排表　　以上都是合并正向信息，相对过程比较清晰。而合并词典和倒排表就不这么简单了，因为在词典中，Clucene要求按照字典顺序排序，在倒排表中，文档号是个内部编号，要按照从小到大顺序排序排序，在每个seg中，文档号都是从零开始编号的。

所以反向信息的合并包括两部分：

对词典的合并，需要对词典中的Term进行重新排序

对于相同的Term，对包含此Term的文档号列表进行合并，需要对文档号重新编号。

　　后者相对简单，假设如果第一个seg的编号是0~N，第二个seg的编号是0~M，当两个seg合并成一个seg的时候，第一个seg的编号依然是0~N，第二个seg的编号变成N~N+M就可以了，也即增加一个偏移量(前一个seg的文档个数)。

　　对词典的合并需要找出两个seg中相同的词，Clucene是通过一个SegmentMergeInfo类型的数组以及称为queue的SegmentMergeQueue实现的，SegmentMergeQueue是继承于PriorityQueue，是一个优先级队列，是按照字典顺序排序的。SegmentMergeInfo保存要合并的seg的词典及倒排表信息，在SegmentMergeQueue中用来排序的key是它代表的seg中的Term。

　　我们来举一个例子来说明合并词典的过程，以便后面解析代码的时候能够很好的理解：假设要合并五个seg，每个seg包含的Term也是按照字典顺序排序的，如下图所示。首先把五个seg全部放入优先级队列中，每个seg在其中也是按照第一个Term的字典顺序排序的，如下图。

从优先级队列中弹出第一个Term(“a”)放到match数组中。寻找含有Term(“a”)的其他seg从队列中弹出也放到match数组中。（图2）合并这些seg的第一个Term(“a”)的倒排表，并把此Term和它的倒排表一同加入新生成的seg中。将match数组中还有Term的seg重新放入优先级队列中. 优先级队列变成下面的样子.(图3)跳转到1,直到队列为空。

转载请注明原文地址: https://ju.6miu.com/read-1310785.html

最新回复(0)