OpenWebSpider# v0.1.3

Released OpenWebSpider v0.1.3

CHANGELOG:

  • New feature: CRAWLER NAME and CRAWLER VERSION used in the User-Agent string in HTTP Requests
  • New feature: New configuration file field: sql_hostlist_where
  • New feature: new command-line argument: –keep-dup
  • BUG: fixed the regex used to extract URLs from <BASE>
  • BUG: fixed in the function that extracts URLs
  • BUG: fixed a bug in page.cs::normalizePage()
  • BUG: fixed minor bugs
  • BUG: fixed a bug in robots.txt’s parser
  • BUG: fixed a bug in page-rels handler

Source code and binary are available in the package: Download

Documentation of OpenWebSpider# v0.1

You may also like...

2 Responses

  1. bernd says:

    Hi,

    how works it?
    “New feature: New configuration file field: sql_hostlist_where”

    thanks

  2. Shen139 says:

    Hi,
    As you can see in openwebspider.con:
    –[
    # used to influence the query that choose the domains to index
    # SELECT id FROM hostlist WHERE status=0 AND ( < $sql_hostlist_where here$> ) ORDER by priority DESC
    # sql_hostlist_where is “1=1” by default
    # Examples:
    # index only “.com” domains
    # sql_hostlist_where = hostname LIKE ‘%.com’
    # index only domains on port 8080
    # sql_hostlist_where = port = 8080
    # index only a portion of the database; from ID 600 to 2000
    # sql_hostlist_where = ID >= 600 AND ID <= 2000 ]-- This is useful if you run many crawlers and you want that each crawler index only a fixed amount of domains! Stefano