运行heritrix1.14.4报错 thread-10 org.archive.util...
来源:百度文库 编辑:神马文学网 时间:2024/10/06 17:41:23
Heritrix取消robots.txt
文章分类:Java编程Robots.txt是一种用于限制网络爬虫的文件,如果在构建网站时,在站点内放置一个Robots.txt文件,在其中可以声明不希望搜索引擎访问的部分。然而,这也是Heritrix爬虫在抓取网页时花费过多的时间去判断该Robots.txt文件是否存在。。。好在这个协议本身是一种附加协议,完全可以不遵守。
在Heritrix的org.archive.crawler.prefetch.PreconditionEnforcer类中定义了获取Robots.txt的方法,我的选择是无论Robots.txt是否存在,都返回不存在,修改方法如下
Java代码- private boolean considerRobotsPreconditions(CrawlURI curi) {
- //此处为提高抓取效率,将忽略Robots.txt协议
- return false; }
运行heritrix1.14.4报错 thread-10 org.archive.util.ArchiveUtils.
10:03:00.765 EVENT Started WebApplicationContext[/,Heritrix Console]
10:03:00.859 EVENT The scratchDir you specified: F:\project3.5\heritrix\target\jsp-compiled-development is unusable.
10:03:01.000 EVENT Started SocketListener on 127.0.0.1:8088
10:03:01.000 EVENT Started org.mortbay.jetty.Server@1f6ba0f
2010-07-10 10:03:01.250 严重 thread-10 org.archive.util.ArchiveUtils.
java.lang.NullPointerException
at java.io.Reader.
at java.io.InputStreamReader.
at org.archive.util.ArchiveUtils.
at org.archive.crawler.settings.CrawlSettingsSAXHandler$DateHandler.endElement(CrawlSettingsSAXHandler.java:385)
at org.archive.crawler.settings.CrawlSettingsSAXHandler.endElement(CrawlSettingsSAXHandler.java:248)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at org.archive.crawler.settings.XMLSettingsHandler.readSettingsObject(XMLSettingsHandler.java:298)
at org.archive.crawler.settings.XMLSettingsHandler.readSettingsObject(XMLSettingsHandler.java:339)
at org.archive.crawler.settings.SettingsHandler.initialize(SettingsHandler.java:130)
at org.archive.crawler.settings.XMLSettingsHandler.initialize(XMLSettingsHandler.java:124)
at org.archive.crawler.admin.CrawlJobHandler.loadProfile(CrawlJobHandler.java:385)
at org.archive.crawler.admin.CrawlJobHandler.loadProfiles(CrawlJobHandler.java:348)
at org.archive.crawler.admin.CrawlJobHandler.
at org.archive.crawler.admin.CrawlJobHandler.
at org.archive.crawler.Heritrix.
at org.archive.crawler.Heritrix.
at org.archive.crawler.Heritrix.doCmdLineArgs(Heritrix.java:718)
at org.archive.crawler.Heritrix.main(Heritrix.java:556)虽然报错,但是可以进入登陆页面,UI已经正常启动。这个东西没有用过,前一天刚刚使用cmd命令运行成功,今天在eclipse中建工程又碰到新问题。一步一坎啊。昨天运行的时候后台是没有报这个错误的,但是今天在eclipse下配置文件位置不对也报过NullPointerException的错误。因此分析还是少了某个文件。经过几个小时调试,发现是少了一个名字为tlds-alpha-by-domain.txt的文件。发布包中对应位置是有该文件的,具体位置为org\archive\util,在该路径下补充该文件就不报错了。该文件可以在源文件包src\resources路径下找到。 出现这种情况的原因是编译器没有把txt编译到bin 目录上去,设置你的开发工具让其编译txt文件即可
运行heritrix1.14.4报错 thread-10 org.archive.util...
解决Window环境下启动Hadoop时出现 java.lang.NoClassDefFoundError: org/apache/hadoop/util
白板报 ? Blog Archive ? 石家庄往事
十年 » Blog Archive » 远程运行Windows的程序
java Thread--娱乐Java
java Thread--娱乐Java
Managing Application Thread Use
Per-thread global variables
Anonymouse.org
Difference between PHP thread safe and non thread safe binaries
tomcat 并发问题 (terminating thread)
Thread.Sleep方法的实例
多线程程(二):Thread
thread.join()用法及例子
Apache安装 APR-util错误解决
如何使用java.util.regex包
基于JDK5.0的一些Thread总结
Writing Reentrant and Thread-Safe Code
我什么时候应该使用 Thread.getContextClassLoader()?
基于JDK5.0的一些Thread总结
RT- Thread新的命令行构建系统
C#中 BackGroundWorker与Thread的区别
latex \itemize报错
数据库备份报错