First, you need to get a copy of the Nutch code. You can download a release from http://lucene.apache.org/nutch/release/
. Unpack the release and connect to its top-level directory. Or, check out the latest source code from subversion
and build it with Ant
.
Try the following command:
bin/nutch
This will display the documentation for the Nutch command script.
这部分工作有如下几步:
1、运行cygwin
安装完成cygwin后运行,执行命令:
cd d:nutch
cd nutch-0.9
cygwin所示的当前目录为:
/cygdrive/d/nutch/nutch-0.9
在此目录下执行命令:bin/nutch,如果正确的话,会有Usage:nutch COMMAND提示
Intranet: Configuration
To configure things for intranet crawling you must:
- Create a directory with a flat file of root urls. For example, to crawl the nutch
site you might start with a file named urls/nutch
containing the url of just the Nutch home page. All other Nutch pages should be reachable from this page. The urls/nutch
file would thus contain:
http://lucene.apache.org/nutch/
- Edit the file conf/crawl-urlfilter.txt
and replace MY.DOMAIN.NAME
with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org
domain, the line should read:
+^http://([a-z0-9]*\.)*apache.org/
This will include any url in the domain apache.org
.
- Edit the file conf/nutch-site.xml
, insert at minimum following properties into it and edit in proper values for the properties:
<property>
<name>http.agent.name</name>
<value></value>
<description>Our HTTP 'User-Agent' request header.</description>
</property>
<property>
<name>http.robots.agents</name>
<value>*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence.</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header.</description>
</property>
对于第一条应在d:\nutch\nutch-0.9下建文件夹urls,在此文件夹下建文本文件nutch.txt,其中的内容为:http://lucene.apache.org/nutch/
对于第二条,打开conf/crawl-urlfilter.txt
,找到MY.DOMAIN.NAME
,修改为:
+^http://([a-z0-9]*\.)*apache.org/
对于第三条,此次实验使用nutch-default.xml, 修改如下属性:
http.agent.name
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
例如:
<property>
<name>http.agent.name</name>
<value>NutchCVS</value>
<description>Our HTTP 'User-Agent' request header.</description>
</property>
<property>
<name>http.robots.agents</name>
<value>*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence.</description>
</property>
<property>
<name>http.agent.description</name>
<value>Nutch</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://lucene.apache.org/nutch/</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>nutch-agent@lucene.apache.org</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header.</description>
</property>
修改完成后保存。
Intranet: Running the Crawl
Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:
- -dir
dir
names the directory to put the crawl in.
- -threads
threads
determines the number of threads that will fetch in parallel.
- -depth
depth
indicates the link depth from the root page that should be crawled.
- -topN
N
determines the maximum number of pages that will be retrieved at each level up to the depth.
For example, a typical call might be:
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Typically one starts testing one's configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN
), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN
) for a full crawl can be from tens of thousands to millions, depending on your resources.
Once crawling has completed, one can skip to the Searching section below.
此处只需运行如下命令即可:
bin/nutch crawl urls -dir crawled-depth 3 -topN 50 >&crawl.log
运行完成后,会生成crawled文件夹和crawl.log日志文件。
在日志文件中会发现抛pdf文件错误,那是因为默认情况下不支持对pdf文件的索引,要想对pdf文件也进行正确的索上,找到nutch-default.xml中的plugin.includes属性,添加上pdf,即为parse-(text|html|js|pdf)。
crawled中包含有segment, linkdb, indexed, index, crawldb文件夹。
到此为止,索引数据准备完毕。
下面是如何在tomcat中运行。
将nutch-0.9.war拷到tomcat的webapps目录下,并改名为nutch.war;
进入conf\Catalina\localhost目录下,创建文件nutch.xml,内容如下:
Context path="/nutch" debug="0" privileged="true"
/contect
启运tomcat;
进入解压后的webapps\nutch\WEB-INF\classes目录,将nutch-default.xml的search.dir设置为D:\nutch\nutch-0.9\crawled;
打开浏览器,运行http://localhost:8080/
nutch;
现就可以进行搜索了,输入apache,就可以查询得到相关的结果。
相关推荐
nutch配置nutch-default.xml
apache-nutch-2.3.1-src.tar.gz
nutch-param-setnutch-param-setnutch-param-setnutch-param-set
eclipse配置nutch,eclipse配置nutch
Nutch搜索引擎·Nutch简介及安装(第1期) Nutch搜索引擎·Solr简介及安装(第2期) Nutch搜索引擎·Nutch简单应用(第3期) Nutch搜索引擎·Eclipse开发配置(第4期) Nutch搜索引擎·Nutch浅入分析(第5期)
Nutch是一款刚刚诞生的完整的开源搜索引擎系统,可以结合数据库进行索引,能快速构建所需系统。Nutch 是基于Lucene的,Lucene为 Nutch 提供了文本索引和搜索的API,所以它使用Lucene作为索引和检索的模块。Nutch的...
Nutch 是一个开源Java 实现的搜索引擎。这里是它的安装包。
1.Hadoop的配置文件,Hadoop-default.xml和Hadoop-site.xml。 2.Nutch的配置文件,Nutch-default.xml和Nutch-site.xml。 3.Nutch的插件的配置文件,这些插件的配置文件在加载插件的时候由插件自行加载,如filter的...
nutch不用安装,是个应用程序,下载后为nutch-1.6.tar.gz,双击桌面上的cygwin快捷方式;执行以下命令: $ cd D:/Downloads/Soft $ tar zxvf nutch-1.0.tar.gz 在e盘下面出现nutch-0.9文件夹说明解压成功了.然后环境...
apache-nutch-2.3.1-src.tar ,网络爬虫的源码, 用ivy2管理, ant runtime 编译 apache-nutch-2.3.1-src.tar ,网络爬虫的源码, 用ivy2管理, ant runtime 编译
一个开源Java 实现的搜索引擎nutch
apache-nutch-1.6-src.tar.gz 来自APACHE官网,本人亲自测试可以使用。
apache-nutch-1.4-bin.part1
apache-nutch-1.3 的源码包,需要的可以看下
apache-nutch-1.4-bin.tar.gz.part2
apache-nutch-1.4-bin.tar.gz.part1
Linux 下 Nutch 单机配置
Nutch是一个由Java实现的,刚刚诞生开放源代码(open-source)的web搜索引擎。
图解搜索引擎nutch配置,自己制作的教程。因为在网上搜索到的教程很多都是粗略,对于初学nutch搜索引擎很难配置好,所以自己亲自打造了一篇图解教程!希望你能够配置成功!
nutch-1.0-dev.jar nutch devlope