Released OpenWebSpider v0.1.3
CHANGELOG:
- New feature: CRAWLER NAME and CRAWLER VERSION used in the User-Agent string in HTTP Requests
- New feature: New configuration file field: sql_hostlist_where
- New feature: new command-line argument: –keep-dup
- BUG: fixed the regex used to extract URLs from <BASE>
- BUG: fixed in the function that extracts URLs
- BUG: fixed a bug in page.cs::normalizePage()
- BUG: fixed minor bugs
- BUG: fixed a bug in robots.txt’s parser
- BUG: fixed a bug in page-rels handler
Source code and binary are available in the package: Download
Documentation of OpenWebSpider# v0.1
bernd responded on 06 Nov 2008 at 2:53 pm #
Hi,
how works it?
“New feature: New configuration file field: sql_hostlist_where”
thanks
Shen139 responded on 11 Dec 2008 at 3:16 pm #
Hi,
As you can see in openwebspider.con:
–[
# used to influence the query that choose the domains to index
# SELECT id FROM hostlist WHERE status=0 AND ( < $sql_hostlist_where here$> ) ORDER by priority DESC
# sql_hostlist_where is "1=1" by default
# Examples:
# index only ".com" domains
# sql_hostlist_where = hostname LIKE '%.com'
# index only domains on port 8080
# sql_hostlist_where = port = 8080
# index only a portion of the database; from ID 600 to 2000
# sql_hostlist_where = ID >= 600 AND ID <= 2000
]–
This is useful if you run many crawlers and you want that each crawler index only a fixed amount of domains!
Stefano