OpenWebSpider# v0.1 has 2 new fields in the table hostlist_extras:

  • include_pages_regex
  • exclude_pages_regex

These fields are really useful if you want to index a website and include or exclude certain pages.

Rules:

  1. the robots.txt file has the priority over everything. If you include certain pages in your regex but these pages are exluded by the robots.txt: these won’t be downloaded neither indexer
  2. exclude_pages_regex has the priority over include_pages_regex.
    For example if you specify in the include regex that you want to index all pages beginning with “/download” but in the exclude regex you specify that you don’t want to index pages ending with “PHP”: the page “/download.php” won’t be indexed!!!
  3. include_pages_regex and exclude_pages_regex are optional.
    You can specify pages to include or pages to exclude or both.
    The crawler will IGNORE the regex if blank.

Remember when you write your regular expression that pages crawled by OpenWebSpider always begin with a slash: “/”



Example:

  • Index only pages beginning with “/index.php” or with “/download” but not pages containing the word: “test”
    1. include_pages_regex = (^/index.php)|(^/download)
    2. exclude_pages_regex = test

    Indexed Pages:

    • /index.php
    • /index.php?arg=xyz
    • /download.php

    Not indexed:

    • /
    • /index.html
    • /index.php?this_is_a=test
Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Furl
  • Live
  • Reddit
  • Segnalo
  • StumbleUpon
  • Technorati
  • Upnews
  • Wikio
  • YahooMyWeb