First Time Usage

When using HarvestMan you have two options of downloading pages/files. You can either use command line options or use config.xml file.

== Command Line Options ===

harvestman --help

  -h, --help            show this help message and exit
  -v, --version         Print version information and exit
  -m, --simulate        Simulates crawling with the given configuration,
                        without performing any actual downloads (same as "-g
  -C CFGFILE, --configfile=CFGFILE
                        Read all options from the configuration file CFGFILE
  -P PROJFILE, --projectfile=PROJFILE
                        Load the project file PROJFILE
  -F URLFILE, --urlfile=URLFILE
                        Read a list of start URLs from file URLFILE and crawl
  -b BASEDIR, --basedir=BASEDIR
                        Set the (optional) base directory to BASEDIR
  -p PROJECT, --project=PROJECT
                        Set the (optional) project name to PROJECT
  -V LEVEL, --verbosity=LEVEL
                         Set the verbosity level to LEVEL. Ranges from 0-5,
                        default is 2
  -f LEVEL, --fetchlevel=LEVEL
                        Set the fetch-level of this project to LEVEL. Ranges
                        from 0-4, default is 0
  -l LOCALISE, --localise=LOCALISE
                        Localize urls after download (yes/no, default is yes)
                        Set the number of retry attempts for failed urls to
                        Enable and set proxy to PROXYSERVER (host:port)
  -U USERNAME, --proxyuser=USERNAME
                        Set username for proxy server to USERNAME
  -W PASSWORD, --proxypass=PASSWORD
                         Set password for proxy server to PASSWORD
                        Limit number of simultaneous network connections to
  -c CACHE, --cache=CACHE
                        Enable/disable caching of downloaded files. If
                        enabled(default), files will not be saved unless their
                        timestamp is newer than the cache timestamp
  -d DEPTH, --depth=DEPTH
                        Set the limit on the depth of urls to DEPTH
                        Enable worker threads and set the number of worker
                        threads to NUMWORKERS
                        Limit the number of tracker threads to NUMTHREADS
  -M NUMFILES, --maxfiles=NUMFILES
                        Limit the number of files downloaded to NUMFILES
  -t PERIOD, --timelimit=PERIOD
                        Run the program for the specified time period PERIOD
                        (in seconds)
  -s, --subdomain       If set, treats subdomains in the same parent domain
                        (like & as the same
  -R ROBOTS, --robots=ROBOTS
                        Enable/disable Robot Exclusion Protocol and checking
                        of META ROBOTS tags.
  -u FILTER, --urlfilter=FILTER
                        Use regular expression FILTER for filtering urls
  -g PLUGINS, --plugins=PLUGINS
                        Load the set of plugins PLUGINS (Specified as
  -o <name=value>, --option=<name=value>
                        Pass a configuration param using <name=value> syntax
  --ui                  Start HarvestMan in Web UI mode
  --selftest            Run a self test

Config.xml Options

cd harvestman/tools/


harvestman -C config.xml

2 #summary Description of "HarvestMan"

What is HarvestMan

HarvestMan is an open source, multi-threaded, modular, extensible web crawler program/framework in pure Python.

HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization options. HarvestMan is a console (command-line) application.

HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the GNU General Public License.


#summary How to download and install Harvestman-crawler

Check out code from svn

svn checkout harvestman-crawler

Install harvestman

cd harvestman-crawler/HarvestMan/
python install