在安装nutch前先下载安装Cygwin,Cygwin是一个在Windows平台上模拟运行Unix的环境. nutch的安装/设置可以在windows下完成,但执行nutch的命令(如crawl)需要在Cygwin中进行.
-->安装JDK,我安装的是JDK1.5
-->安装TOMCAT,我安装的是tomcat5.5
到http://www.nutch.org下载Nutch 0.8(4.2已经有0.9)
解压与安装
测试
用Cygwin进入nutch-0.8目录,执行bin/nutch,
看到下列提示,则说明安装成功:
Usage: nutch COMMAND
where COMMAND is one of : ......
设置待抓取网站 在nucth-0.8目录下建立urls目录(也可以自己命名),在urls目录下建立个文件,我取名为zju,没有扩展名
打开刚才建立的这个名称为nutch的文件,输入待抓取的网站地址,如:
http://www.zju.edu.cn/
最后的/不能漏掉.
编辑conf目录下的crawl-urlfilter.txt文件,该文件用于设置爬虫的过滤条件
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
修改为:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*zju.edu.cn/
表示抓取http://([a-z0-9]*\.)*zju.edu.cn/域名下的所有页面
编辑conf目录下的nutch-site.xml文件,该文件用于将爬虫信息告诉被抓取的网站,如果不进行设置nutch不能运行.
该文件默认为这样:
下面是修改后的一个例子:
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
the User-Agent header. It appears in parenthesis after the agent name.
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
上述文件描述了爬虫的名称/描述/来自哪个网站/联系email等信息.
抓取
执行抓取并建立索引的命令:
bin/nutch crawl urls -dir sunleap -depth 4 -threads 5 -topN 1000 >&logs/log1.log
其中
urls目录中的文件里记录了待爬网站地址
-dir指定爬来的信息放到哪个目录下
-depth 指定抓取的深度
-thread 指定线程数
-topN 指定抓取该网站的前若干页,这个参数对于抓取大网站的网页非常有用 >&logs/log1.log指定日志存放位置,如果你想在控制台监视运行情况,可以不使用这行代码
等待几分钟,抓取及建立索引结束
配置tomcat
到tomcat的webapps/下,新建nutch目录,解压nutch-0.8.war内容到此目录
修改/webapps/ROOT/WEB-INF/classes/nutch-site.xml :
将
换成
把value里的内容替换为你实际存放抓取内容(zju)的地址,注意最后没有/,我开始加了/好像不行.
为了支持中文需要修改tomcat的配置文件,打开tomcat\conf下的server.xml文件,将其中的Connector部分改成如下形式即可:
注意最后一行的两项是新加的. 利用tomcat搜索 重启tomcat,在浏览器中输入:http://127.0.0.1:8080/nutch
出现nutch搜索界面, 在搜索框中输入关键字并搜索,将看到你的搜索结果
附带我的搜索 tomcat log:
2007-04-26 15:15:51,671 INFO Configuration - parsing jar:file:/D:/Tomcat%205.5/webapps/nutch/WEB-INF/lib/hadoop-0.4.0-patched.jar!/hadoop-default.xml2007-04-26 15:15:52,000 INFO Configuration - parsing file:/D:/Tomcat%205.5/webapps/nutch/WEB-INF/classes/nutch-default.xml2007-04-26 15:15:52,078 INFO Configuration - parsing file:/D:/Tomcat%205.5/webapps/nutch/WEB-INF/classes/nutch-site.xml2007-04-26 15:15:52,312 INFO Configuration - parsing file:/D:/Tomcat%205.5/webapps/nutch/WEB-INF/classes/hadoop-site.xml2007-04-26 15:15:52,359 INFO PluginRepository - Plugins: looking in: D:\Tomcat 5.5\webapps\nutch\WEB-INF\classes\plugins2007-04-26 15:15:52,890 INFO PluginRepository - Plugin Auto-activation mode: [true]2007-04-26 15:15:53,015 INFO PluginRepository - Registered Plugins:2007-04-26 15:15:53,015 INFO PluginRepository - CyberNeko HTML Parser (lib-nekohtml)2007-04-26 15:15:53,015 INFO PluginRepository - Site Query Filter (query-site)2007-04-26 15:15:53,015 INFO PluginRepository - Html Parse Plug-in (parse-html)2007-04-26 15:15:53,015 INFO PluginRepository - Regex URL Filter Framework (lib-regex-filter)2007-04-26 15:15:53,015 INFO PluginRepository - Basic Indexing Filter (index-basic)2007-04-26 15:15:53,015 INFO PluginRepository - Basic Summarizer Plug-in (summary-basic)2007-04-26 15:15:53,015 INFO PluginRepository - Text Parse Plug-in (parse-text)2007-04-26 15:15:53,015 INFO PluginRepository - JavaScript Parser (parse-js)2007-04-26 15:15:53,015 INFO PluginRepository - Regex URL Filter (urlfilter-regex)2007-04-26 15:15:53,015 INFO PluginRepository - Basic Query Filter (query-basic)2007-04-26 15:15:53,015 INFO PluginRepository - HTTP Framework (lib-http)2007-04-26 15:15:53,015 INFO PluginRepository - URL Query Filter (query-url)2007-04-26 15:15:53,015 INFO PluginRepository - Http Protocol Plug-in (protocol-http)2007-04-26 15:15:53,015 INFO PluginRepository - the nutch core extension points (nutch-extensionpoints)2007-04-26 15:15:53,015 INFO PluginRepository - OPIC Scoring Plug-in (scoring-opic)2007-04-26 15:15:53,015 INFO PluginRepository - Registered Extension-Points:2007-04-26 15:15:53,015 INFO PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)2007-04-26 15:15:53,015 INFO PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)2007-04-26 15:15:53,015 INFO PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)2007-04-26 15:15:53,015 INFO PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)2007-04-26 15:15:53,015 INFO PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)2007-04-26 15:15:53,015 INFO PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)2007-04-26 15:15:53,015 INFO PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)2007-04-26 15:15:53,015 INFO PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)2007-04-26 15:15:53,015 INFO PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)2007-04-26 15:15:53,015 INFO PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)2007-04-26 15:15:53,015 INFO PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)2007-04-26 15:15:53,093 INFO NutchBean - creating new bean2007-04-26 15:15:53,265 INFO NutchBean - opening merged index in /zju/index2007-04-26 15:15:54,281 INFO Configuration - found resource common-terms.utf8 at file:/D:/Tomcat%205.5/webapps/nutch/WEB-INF/classes/common-terms.utf82007-04-26 15:15:54,453 INFO NutchBean - opening segments in /zju/segments2007-04-26 15:15:54,703 INFO SummarizerFactory - Using the first summarizer extension found: Basic Summarizer2007-04-26 15:15:54,718 INFO NutchBean - opening linkdb in /zju/linkdb2007-04-26 15:15:54,812 INFO NutchBean - query request from 127.0.0.12007-04-26 15:15:54,968 INFO NutchBean - query: 浙江2007-04-26 15:15:54,968 INFO NutchBean - lang: zh2007-04-26 15:15:55,171 INFO NutchBean - searching for 20 raw hits2007-04-26 15:15:56,281 INFO NutchBean - total hits: 993
1 条评论:
请问:Nutch里面ontology包的作用是什么?可以在Nutch上进行本体推理吗?
发表评论