OpenWebSpider# v0.1: hostlist_extras regex

OpenWebSpider# v0.1 has 2 new fields in the table hostlist_extras:

  • include_pages_regex
  • exclude_pages_regex

These fields are really useful if you want to index a website and include or exclude certain pages.

Rules:

  1. the robots.txt file has the priority over everything. If you include certain pages in your regex but these pages are exluded by the robots.txt: these won’t be downloaded neither indexer
  2. exclude_pages_regex has the priority over include_pages_regex.
    For example if you specify in the include regex that you want to index all pages beginning with “/download” but in the exclude regex you specify that you don’t want to index pages ending with “PHP”: the page “/download.php” won’t be indexed!!!
  3. include_pages_regex and exclude_pages_regex are optional.
    You can specify pages to include or pages to exclude or both.
    The crawler will IGNORE the regex if blank.

Remember when you write your regular expression that pages crawled by OpenWebSpider always begin with a slash: “/”



Example:

  • Index only pages beginning with “/index.php” or with “/download” but not pages containing the word: “test”
    1. include_pages_regex = (^/index.php)|(^/download)
    2. exclude_pages_regex = test

    Indexed Pages:

    • /index.php
    • /index.php?arg=xyz
    • /download.php

    Not indexed:

    • /
    • /index.html
    • /index.php?this_is_a=test

2 Responses

  1. bernd says:

    Hello,

    is this possible for spidering only .de hosts?

    include_pages_regex = (^.de/*)

  2. Shen139 says:

    No,
    the regex used in that table are used only to include or exclude pages to being indexed from a domain.

    If you are indexing http://www.example.com and you want to be indexed only .HTML pages you could use that table.

    I’m planning to add this feature in future.

    Stefano

Leave a Reply