【垂直搜索引擎搭建05】heritrix:Queue-assignment-policy

    xiaoxiao2025-04-12  11

    Heritrix使用了Berkeley DB来构建链接队列。这些队列被置放于BdbMultipleWorkQueues中时,总是先给予一个Key,然后将那些Key值相同的链接放在一起, 成为一个队列,也就是一个Queue。在Heritrix中,为每个队列赋上Key值的策略,也就是它的queue-assignment-policy。

    Heritrix默认使用的queue-assignment-policy是HostnameQueueAssignmentPolicy,一个继承于QueueAssignmentPolicy抽象类的队列分配策略。顾名思义,它是以链接的Host名称为Key值来解决这个问题的。换句话也就是说,相同Host名称的所有URL都会被置放于同一个队列中间。

    HostnameQueueAssignmentPolicy有一个问题,就是它对于某个单独网站的网页抓取,会造成有一个队列的长度非常长,而其它队列则几乎都处于空闲的情况,这使得在多线程抓取的情况下,效率得不到提高。

    为了解决这个问题,我们可以定制自己的QueueAssignmentPolicy。下面以ELFHash的哈希散列算法定制一个名为ELFHashQueueAssignmentPolicy的队列分配策略,它也是继承于QueueAssignmentPolicy的。

    /* ELFHashQueueAssignmentPolicy */ package org.archive.crawler.frontier; import java.util.logging.Level; import java.util.logging.Logger; import org.apache.commons.httpclient.URIException; import org.archive.crawler.datamodel.CandidateURI; import org.archive.crawler.framework.CrawlController; import org.archive.net.UURI; import org.archive.net.UURIFactory; /** * QueueAssignmentPolicy based on the hostname:port evident in the given * CrawlURI. * * @author gojomo */ public class ELFHashQueueAssignmentPolicy extends QueueAssignmentPolicy { private static final Logger logger = Logger .getLogger(ELFHashQueueAssignmentPolicy.class.getName()); /** * When neat host-based class-key fails us */ private static String DEFAULT_CLASS_KEY = "default..."; private static final String DNS = "dns"; public String getClassKey(CrawlController controller, CandidateURI cauri) { String scheme = cauri.getUURI().getScheme(); String candidate = null; try { if (scheme.equals(DNS)){ if (cauri.getVia() != null) { UURI viaUuri = UURIFactory.getInstance(cauri.flattenVia()); candidate = viaUuri.getAuthorityMinusUserinfo(); // adopt scheme of triggering URI scheme = viaUuri.getScheme(); } else { candidate= cauri.getUURI().getReferencedHost(); } } else { String uri = cauri.getUURI().toString(); long hash = ELFHash(uri); //candidate = cauri.getUURI().getAuthorityMinusUserinfo(); candidate = Long.toString(hash % 100); } if(candidate == null || candidate.length() == 0) { candidate = DEFAULT_CLASS_KEY; } } catch (URIException e) { logger.log(Level.INFO, "unable to extract class key; using default", e); candidate = DEFAULT_CLASS_KEY; } if (scheme != null && scheme.equals(UURIFactory.HTTPS)) { // If https and no port specified, add default https port to // distinguish https from http server without a port. if (!candidate.matches(".+:[0-9]+")) { candidate += UURIFactory.HTTPS_PORT; } } // Ensure classKeys are safe as filenames on NTFS return candidate.replace(':','#'); } public static long ELFHash(String str) { long hash = 0; long x = 0; for (int i = 0; i < str.length(); i++) { hash = (hash << 4) + str.charAt(i); if ((x = hash & 0xF0000000L) != 0) { hash ^= (x >> 24); hash &= ~x; } } return (hash & 0x7FFFFFFF); } }

    接下来就是配置问题:

    第一步:到org.archive.crawler.frontier.AbstractFrontier这个类下找到public AbstractFrontier(String name, String description)这个方法,在里面找到:

    String queueStr = System.getProperty(AbstractFrontier.class.getName() + “.” + ATTR_QUEUE_ASSIGNMENT_POLICY, HostnameQueueAssignmentPolicy.class.getName() + ” ” + IPQueueAssignmentPolicy.class.getName() + ” ” + BucketQueueAssignmentPolicy.class.getName() + ” ” + SurtAuthorityQueueAssignmentPolicy.class.getName() + ” ” + TopmostAssignedSurtQueueAssignmentPolicy.class.getName());

    将我们自己写的ELFHashQueueAssignmentPolicy类添加进去,即变成:

    String queueStr = System.getProperty(AbstractFrontier.class.getName() + “.” + ATTR_QUEUE_ASSIGNMENT_POLICY, ELFHashQueueAssignmentPolicy.class.getName() + ” ” + //HostnameQueueAssignmentPolicy.class.getName() + ” ” + IPQueueAssignmentPolicy.class.getName() + ” ” + BucketQueueAssignmentPolicy.class.getName() + ” ” + SurtAuthorityQueueAssignmentPolicy.class.getName() + ” ” + TopmostAssignedSurtQueueAssignmentPolicy.class.getName());

    第二步:到org.archive.crawler.frontier.AdaptiveRevisitFrontier下找到

    protected final static String DEFAULT_QUEUE_ASSIGNMENT_POLICY = HostnameQueueAssignmentPolicy.class.getName();

    将其改为:

    protected final static String DEFAULT_QUEUE_ASSIGNMENT_POLICY = ELFHashQueueAssignmentPolicy.class.getName();

    然后继续往后找到public AdaptiveRevisitFrontier(String name, String description) 方法,对其中的一段:

    String queueStr = System.getProperty(AbstractFrontier.class.getName() + “.” + ATTR_QUEUE_ASSIGNMENT_POLICY, HostnameQueueAssignmentPolicy.class.getName() + ” ” + IPQueueAssignmentPolicy.class.getName() + ” ” + BucketQueueAssignmentPolicy.class.getName() + ” ” + SurtAuthorityQueueAssignmentPolicy.class.getName() + ” ” + TopmostAssignedSurtQueueAssignmentPolicy.class.getName());

    修改为:

    String queueStr = System.getProperty(AbstractFrontier.class.getName() + “.” + ATTR_QUEUE_ASSIGNMENT_POLICY, ELFHashQueueAssignmentPolicy.class.getName() + ” ” + //HostnameQueueAssignmentPolicy.class.getName() + ” ” + IPQueueAssignmentPolicy.class.getName() + ” ” + BucketQueueAssignmentPolicy.class.getName() + ” ” + SurtAuthorityQueueAssignmentPolicy.class.getName() + ” ” + TopmostAssignedSurtQueueAssignmentPolicy.class.getName());

    第三步:

    到heritrix.properties文件下找到

    org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy = \ org.archive.crawler.frontier.HostnameQueueAssignmentPolicy \ org.archive.crawler.frontier.IPQueueAssignmentPolicy \ org.archive.crawler.frontier.BucketQueueAssignmentPolicy \ org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy \ org.archive.crawler.frontier.TopmostAssignedSurtQueueAssignmentPolicy

    将我们的ELFHashQueueAssignmentPolicy类添加进去,即变成这样:

    org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy = \ my.ELFHashQueueAssignmentPolicy \ org.archive.crawler.frontier.IPQueueAssignmentPolicy \ org.archive.crawler.frontier.BucketQueueAssignmentPolicy \ org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy \ org.archive.crawler.frontier.TopmostAssignedSurtQueueAssignmentPolicy

    这样,当我们使用Heritrix抓取网页的时候,Heritrix就变成默认使用ELFHashQueueAssignmentPolicy来分配连接队列了。经过验证,爬取的效率的确能得到了很大的提高。

    转载请注明原文地址: https://ju.6miu.com/read-1297978.html
    最新回复(0)