Differences between revisions 3 and 4
Revision 3 as of 2008-07-02 03:52:41
Size: 5741
Revision 4 as of 2009-09-06 02:49:56
Size: 5747
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 95: Line 95:
 * [http://localhost:5940 http://localhost:5940]  * [[http://localhost:5940|http://localhost:5940]]
Line 103: Line 103:
http://lucasmanual.com/out/harvestman.png {{http://lucasmanual.com/out/harvestman.png}}

First Time Usage

When using HarvestMan you have two options of downloading pages/files. You can either use command line options or use config.xml file.

Command Line Options

  • To get a list of commands, type:

harvestman --help
  • Here are some available choices:

  -h, --help            show this help message and exit
  -v, --version         Print version information and exit
  -m, --simulate        Simulates crawling with the given configuration,
                        without performing any actual downloads (same as "-g
  -C CFGFILE, --configfile=CFGFILE
                        Read all options from the configuration file CFGFILE
  -P PROJFILE, --projectfile=PROJFILE
                        Load the project file PROJFILE
  -F URLFILE, --urlfile=URLFILE
                        Read a list of start URLs from file URLFILE and crawl
  -b BASEDIR, --basedir=BASEDIR
                        Set the (optional) base directory to BASEDIR
  -p PROJECT, --project=PROJECT
                        Set the (optional) project name to PROJECT
  -V LEVEL, --verbosity=LEVEL
                         Set the verbosity level to LEVEL. Ranges from 0-5,
                        default is 2
  -f LEVEL, --fetchlevel=LEVEL
                        Set the fetch-level of this project to LEVEL. Ranges
                        from 0-4, default is 0
  -l LOCALISE, --localise=LOCALISE
                        Localize urls after download (yes/no, default is yes)
                        Set the number of retry attempts for failed urls to
                        Enable and set proxy to PROXYSERVER (host:port)
  -U USERNAME, --proxyuser=USERNAME
                        Set username for proxy server to USERNAME
  -W PASSWORD, --proxypass=PASSWORD
                         Set password for proxy server to PASSWORD
                        Limit number of simultaneous network connections to
  -c CACHE, --cache=CACHE
                        Enable/disable caching of downloaded files. If
                        enabled(default), files will not be saved unless their
                        timestamp is newer than the cache timestamp
  -d DEPTH, --depth=DEPTH
                        Set the limit on the depth of urls to DEPTH
                        Enable worker threads and set the number of worker
                        threads to NUMWORKERS
                        Limit the number of tracker threads to NUMTHREADS
  -M NUMFILES, --maxfiles=NUMFILES
                        Limit the number of files downloaded to NUMFILES
  -t PERIOD, --timelimit=PERIOD
                        Run the program for the specified time period PERIOD
                        (in seconds)
  -s, --subdomain       If set, treats subdomains in the same parent domain
                        (like my.foo.com & his.foo.com) as the same
  -R ROBOTS, --robots=ROBOTS
                        Enable/disable Robot Exclusion Protocol and checking
                        of META ROBOTS tags.
  -u FILTER, --urlfilter=FILTER
                        Use regular expression FILTER for filtering urls
  -g PLUGINS, --plugins=PLUGINS
                        Load the set of plugins PLUGINS (Specified as
  -o <name=value>, --option=<name=value>
                        Pass a configuration param using <name=value> syntax
  --ui                  Start HarvestMan in Web UI mode
  --selftest            Run a self test

Config.xml Options

  • To create config.xml you can run our configuration program.
  • Go to a folder where you checkout the code.

cd harvestman/tools/

  • Start genconfig.py

python genconfig.py
  • Then with your browser go to:
  • http://localhost:5940

  • Fill all the information.
  • When done save the xml to the folder and run the following command:

harvestman -C config.xml
  • Here is how the web interface look.


2 #summary Description of "HarvestMan"

What is HarvestMan

HarvestMan is an open source, multi-threaded, modular, extensible web crawler program/framework in pure Python.

HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization options. HarvestMan is a console (command-line) application.

HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the GNU General Public License.


#summary How to download and install Harvestman-crawler

Check out code from svn

  • First you need to checkout version of harvestman from repository.
  • You will need subversion program to do that. When subversion is installed run this command:

svn checkout http://harvestman-crawler.googlecode.com/svn/trunk/ harvestman-crawler

Install harvestman

  • Go into the harvestman folder, and run setup file.

cd harvestman-crawler/HarvestMan/
python setup.py install

For more information see: http://code.google.com/p/harvestman-crawler/w/list

MyWiki: harvestman (last edited 2009-09-06 02:49:56 by localhost)