【垂直搜索引擎搭建03】heritrix：扩展FontierScheduler抓取特定的信息

xiaoxiao2025-04-06 29

FontierScheduler是一个 PostProcessor，它的作用是将在Extractor中所分析得出的链接加入到 Frontier中，以待继续处理。

FontierScheduler：

/* FrontierScheduler */ package org.archive.crawler.postprocessor; import java.util.logging.Level; import java.util.logging.Logger; import org.archive.crawler.datamodel.CandidateURI; import org.archive.crawler.datamodel.CrawlURI; import org.archive.crawler.datamodel.FetchStatusCodes; import org.archive.crawler.framework.Processor; public class FrontierScheduler extends Processor implements FetchStatusCodes { private static final long serialVersionUID = -5178775477602250542L; private static Logger LOGGER = Logger.getLogger(FrontierScheduler.class.getName()); /** * @param name Name of this filter. */ public FrontierScheduler(String name) { //构造函数 super(name, "FrontierScheduler. 'Schedule' with the Frontier " + "any CandidateURIs carried by the passed CrawlURI. " + "Run a Scoper before this " + "processor so links that are not in-scope get bumped from the " + "list of links (And so those in scope get promoted from Link " + "to CandidateURI).");//调用父类的构造函数 } protected void innerProcess(final CrawlURI curi) {//内部实现 if (LOGGER.isLoggable(Level.FINEST)) { LOGGER.finest(getName() + " processing " + curi); } //如果当前链接的处理结果中，有一些高优先级的链接要被处理，先处理优先级较高的链接 // Handle any prerequisites when S_DEFERRED for prereqs if (curi.hasPrerequisiteUri() && curi.getFetchStatus() == S_DEFERRED) { handlePrerequisites(curi); return; } //对当前这个Processor进行同步 synchronized(this) { //从处理结果中，取出所有链接进行循环 for (CandidateURI cauri: curi.getOutCandidates()) { schedule(cauri); } } } protected void handlePrerequisites(CrawlURI curi) { schedule((CandidateURI)curi.getPrerequisiteUri()); } /** * Schedule the given {@link CandidateURI CandidateURI} with the Frontier. * @param caUri The CandidateURI to be scheduled. */ //调用frontier中的schedule()方法 //将传入的链接加入等待队列，所有的链接都加入到等待队列中 protected void schedule(CandidateURI caUri) { getController().getFrontier().schedule(caUri); } }

MyFrontierScheduler：

package org.archive.crawler.postprocessor; import org.archive.crawler.datamodel.CandidateURI; public class MyFrontierScheduler extends FrontierScheduler {//继承自FrontierScheduler类 private static final long serialVersionUID = 1L; public MyFrontierScheduler(String name) {//构造函数 super(name);//调用父类的构造函数 } /* * super() 的含义可有以下四个理解： 1.子类的构造过程中必须调用父类的构造方法 2.子类可在自己的构造方法中使用super（）来调用父类的构造方法 3.如果子类的构造方法中没有显示的调用父类的构造方法，则系统默认的调用父类的无参的构造方法 4.如果子类的构造方法中既没有显示调用父类的构造方法，而父类中又没有无参的构造方法，则编译出错 * */ //重写父类的Schedule()方法 //判断条件，只有链接中包含一定条件的链接才能加入到等待队列中 protected void schedule(CandidateURI caUri) { //如果url链接中包含这些内容，就打印出来，并且加入队列中 if(caUri.toString().contains("2016.sina.com.cn")){ System.out.println(caUri);//打印加入队列中的链接 getController().getFrontier().schedule(caUri); } } }

转载请注明原文地址: https://ju.6miu.com/read-1297794.html

最新回复(0)