OpenWebSpider v0.6 v0.7 Handbook

Requirements

  1. OS: Win32 or UNIX/Linux
  2. Mysql 4.1+ (suggested 5.0+) with developer library (mysql-dev or mysql-devel under linux (depending on your linux distribution))
  3. Apache Httpd with PHP support

Getting Started

First, you need to get a copy of the OpenWebSpider.
You can download a release from http://www.openwebspider.org/openwebspider_download.php.
Unpack the package and compile it (with “make” under linux and with MS Visual C++ under windows)
the package contains the needed windows binary ready to use in the Win32_bin/ directory.

    If you compile OpenWebSpider under windows you must make sure you have set the correct path of your mysql.h in openwebspider->version<.c
    line: #include “E:\\Programmi\\MySQL\\MySQL Server 5.0\\include\\mysql.h”

Now you can try to execute OpenWebSpider with (under linux):

    $ ./openwebspider

(under windows)

    > openwebspider.exe



How to configure MySQL Server?

After you have installed MySQL you must configure it!
Default configuration should work OK but some interesting parameter can be modified or added in the configuration file:

    1. Specify the minimum length of the word to be included in a FULLTEXT index
      Specify the maximum length of the word to be included in a FULLTEXT index
      Use stopwords from this file instead of built-in list
      Size of the Key Buffer, used to cache index blocks for MyISAM tables
      Size of the buffer used for doing full table scans of MyISAM tables
      Allocated per thread, if a full scan is needed.
      This buffer is allocated when MySQL needs to rebuild the index in
      REPAIR, OPTIMZE, ALTER table statements as well as in LOAD DATA INFILE
      into an empty table. It is allocated per thread so be careful with
      large settings
  1. ft_min_word_len=3 ft_max_word_len=20 ft_stopword_file= [path]/stop-words.txt key_buffer_size=200M read_buffer_size=100M
    read_rnd_buffer_size=100M sort_buffer_size=300M

These values are indicative and should change depending on your box!
Now that you have installed and configured your MySQL Server you must create
all tables needed by the web spider. This is an easy task by using ows_macro

    1. With this command you are creating all needed Databases(with deafult names)
      and Tables
  1. ./ows_macro -s localhost -u root -p password -c

How to configure the crawler? openwebspider.conf

The first thing you must do before beginning to use the crawler is to set some variable in the openwebspider.conf file.
Here you can set the address of your mysql server and the logins and the names of the databases needed by the spider.
An example configuration file:

    1. Make sure that mysqlserver2 and mysqlserver3 are the same because this release (0.6) of OpenWebSpider needs that two databases on the same server!
  1. #server1 is the server with the DB ‘DB1’
    #db1 is the database of the hosts
    db1=hosts
    mysqlserver1=localhost
    userdb1=db_user1
    passdb1=password1

    #server2 is the server with the DB ‘DB2’
    #db2 is the database of the indexed pages
    db2=spiderdb
    mysqlserver2=192.168.0.9
    userdb2=db_user3
    passdb2=password3

    #server3 is the server with the DB ‘DB3’
    #db3 is the database of the temporany tables (used to speed-up the indexing phase)
    db3=temptables
    mysqlserver3=192.168.0.9
    userdb3=db_user3
    passdb3=password3

    #password for the OWS Server
    ows_server_password=server_password

    #EOF#

openwebspider arguments

-I [Search Query]

    (Perform a fulltext search)
    the argument -I followed by a search query is used to performing a fulltext query on the database (spiderdb)
    examples:
    $ ./openwebspider -I shen139
    $ ./openwebspider -I “download openwebspider”

-i [URL]

    (Index a web site)
    the argument -i tells to the spider to start by crawling the specified URL
    examples:
    $ ./openwebspider -i http://www.openwebspider.org/
    $ ./openwebspider -i www.openwebspider.org/openwebspider_download.php

-t [number]

    (number of threads)
    1. An high value in the number of threads can cause problems to the web servers crawled and can take a lot of memory and CPU clock
  1. -t followed by a number specifies the number of the threads to create to perform the crawling of the URL specified by the argument -i
    the default value is of 20 threads

-s

    (Single host mode)
    1. this argument should be used when you want to index only your personal web site
  1. the argument -s specifies to the spider that it must index the URL pointed by -i and then exits

    for example “$ ./openwebspider -i www.example.com -s” tells to the spider to index the whole www.example.com and to exit

    if you omit this argument the spider will index the specified URL and recursively all the web sites linked by the previous one

-m [number]

    (Maximum level of depth in the tree of the pages)
    the argument -m followed by a number specifiy the maximum number of the pages to be indexed based on the level of depth in the tree of the pages

    for example if you want to index the home page and all the pages directly linked by the home pages of www.example.com you must type:
    $ ./openwebspider -i www-example.com -s -m 2
    the default value is to index all pages ignoring this setting and is 0 so you can simply omit this argument if you don’t want to use this limitation!

-l [number]

    (number of pages)
    1. the limit of the number of the pages stored in the database will be probably higer depending on the number of the threads used.
      this is because the spider send a signal to all the threads that are indexing pages when it reach the limit but the threads must finish their work before closing themself
  1. this argument is used to limit the maximum number of the pages indexed per site

    for example if you want to index only 20 pages from your web site you could do:
    $ ./openwebspider -i www.my-web-site.org -s -l 20

-e

    (Ignore External Host)
    1. this will index only your web site and add to the table of the hosts all the web sites linked by yours
      this will start crawl a not valid URL and then will index all the web sites stored in the table of the host without adding new ones
  1. this argument is used to tells to the spider to do not add to the table with the list of the host to index the hosts that it will find in the current session

    for example if you want to index your personal web site you probably need this argument because you don’t want to index the web sites linked by yours
    on the other hand if you want to index your web site and all that are linked by yours you could do:
    $ ./openwebspider -i www.my-web-site.org -s $ ./openwebspider -i nothing -e

-r [0 or 1 or 2]

    (Relationships)
    1. this should be usually used when you don’t care about the relationships (for example when you index your personal web site)
      this could be used if you want keep track of the relationships between hosts (for example if you want write some kind of statistic) (you could use this argument with -n to store only the relationships without indexing the pages)
      this could be used if you want keep track of the relationships between the hosts and the pages (for example if you want write some kind of statistic or if you need an easy way to develop a site-map of your site) (you could use this argument with -n to store only the relationships without indexing the pages and with -s to store relationships only for you web site)
  1. -r followed by a number (between 0 and 2) specify to the spider to store (or not) relationships between pages and/or hosts

    0 – with 0 the spider will not store any relationship

    1 – with 1 the spider will store the relationships between hosts (who link who and who is linked from)

    2 – with 2 will stored all the relationships between pages (who link who and who is linked from)

-T

    (Testing mode)
    this set the testing mode.
    the testing mode means that the spider crawl web sites as specified by other arguments but don’t store/index anything

-n

    (Do not index pages)
    this argument is very similar to -T, it differs only from the fact that it doesen’t index anything but it continues to store hosts and relationships between pages and/or hosts (if specified by -r)

-u

    (Update mode)
    1. you could use this argument to recover from a crash or to fully index a web site that were indexed with some limitations
  1. this argument could be used to index only new pages.
    when openwebspider starts to crawl a site it deletes all pages indexed in the past!
    with -u you tells the spider to preserve all the stored pages and to add to the index only pages that aren’t indexed!

-d [number (of milliseconds)]

    (Crawl Delay)
    1. openwebspider support the “crawl-delay” directive in the robots.txt but if you set a crawl delay of 2000 milliseconds with -d and the spider, parsing the robots.txt of the current web site, found a crawl-delay of 1 second the effective delay will be of 1 second because the directive in the robots.txt has the priority

      take care that the crawl delay in the robots.txt is in seconds (and not in milliseconds as the parameter for the argument -d)

  1. with -d you can specifies the amount of milliseconds between the download of a page and the next one (crawl delay).

    for example if you have a slow web server you can set a delay of 1000 milliseconds (1 second) per page with:
    $ ./openwebspider -i www.example.com -d 1000

-x

    (Cache)
    this argument is needed if you want to save the (un-modified) source of the pages and of the documents that the spider indexes
    this is very useful if you want to take a copy of the pages (and documents) that are indexed by the web spider to provide them as a cache or to elaborate them via external software

-z

    (Compressed Cache)
    1. make sure that your mysql server is compiled with zlib support due the fact that the spider uses the mysql sql function COMPRESS() to compress the data (more info here: http://dev.mysql.com/doc/refman/5.0/en/encryption-functions.html )

      you can retrive the pages stored and compressed with COMPRESS() with the mysql sql function UNCOMPRESS()

  1. is the same of -x with the difference that the data are compressed with zlib

-f [module]

    (Import loadable functions from an external module)
    1. Filter what pages to index and what not
    2. Censor some (defined) words in the text to be indexed
    3. Make the web spider capable to index un-supported file types
        This funcionality is still under development and may require a little
        modification of the source code of the web spider
  1. with -f is possible to extend the capability of the web spider to index (and even how) (or not) web pages!
    Common uses of this features are: A more detailed docs about how to write an external module here

-S [TCP PORT]

    (OpenWebSpider Server)
    1. ows_server_password=my_password
      The maximum number of concurrent login is defined in OWSSERVERMAXLOGINS in server.h
      and is set by default as 10.
  1. Make the spider listening to the specified TCP Port waiting for commands
    With this feature is possible to send commands(via a commond web-browser like firefox) to a running web spider.
    For example if we have a running web spider listening on port 8080 (-S 8080) we can get what it’s doing,
    we can stop it, make it switch to a defined URL, … by simply connecting to the spider via our favourite browser

    The Web Server inside the spider has an authentication mechanism session-like!
    To enable password protection you must add(or edit) the field “ows_server_password” in
    the configuration file of the web spider (default: openwebspider.conf)
    For Example:

    The authentication mechanism let you to connect to this server from different IPs The (session) timeout for each IP is of 10 minutes!!!

-F

    (Free Indexing mode)
    1. this new feature is still under hard developing and it’s not safe to use
      in addition the web spider’s memory usage can grow indefinitely so you are encouraged
      to use some kind of limitation (-m, -l or even -f)
  1. This is another new feature in OpenWebSpider v0.6! Free indexing mode means that
    the web spider will index pages while it encounter them!
    For example if we have a web site with the following structure:

              home_page
            /      |    \
         link1   link2  http://www.example.com/linkX
    

    the web spider will index (in order): home_page, link1, link2 and http://www.example.com/linkX
    Whereas without this argument the web spider will index only: home_page, link1 and link2

Leave a Reply