Command Line Arguments:
- −−index, −i [URL]
Indexes the website specified with [URL] - −−add−hostlist
Doesn’t index the [URL] specified with “−−index”; it simply adds the hostname to the list of the Hosts (hostlist), prints its ID and and exits! This command is extremely useful if you want to use the power of the table: hostlist_extras [Read more about: hostlist_extras regex] and you need the ID of a host (not) present in the table of the hosts. - −−threads, −t [1−100]
Sets number of threads - −s
Single Mode: On (Default: Off). If Single Mode = On : Indexes the website specified with “−−index” and exits. - −−cache
Saves a copy of each indexed page (Default: Doesn’t save cache) - −−cache−compressed
Saves a compressed copy of each indexed page (Default: Doesn’t save cache) - −−rels, −r [1,2]
Saves relationships between pages (Default: Doesn’t save rels). This could be really useful to generate a map of who links who and who is linked by.
1: saves only hostnames (Example: www.example.com links www.test.com; www.example.com links www.domain.net)
2: saves hostnames and pages (Example: www.example.com/index.html links: www.example.com/download.html, www.example.com/test.html and www.test.com/docs.php; …)
Let’s see an example of what you can do with this feature:
http://lab.openwebspider.org/8like.php - −−add−external, −e
Adds External Hosts (Default: Doesn’t add external hosts).
If not specified all external hosts found in crawled pages will be ignored by the crawler and won’t indexed in the future. - −−conf−file [filename]
Sets a configuration file (Default: openwebspider.conf)
Limits:
- −−max−depth, −m [0−1000]
Sets Max Depth Level of the pages to index. (Default: −1 (Index all pages))
Depth Level = 0 : Index only home−page
Depth Level = 1 : Index home−page and all pages directly linked by the home−page
… - −−max−pages, −l [1−1000000]
Sets Max Pages to Index (per domain) - −−max−seconds, −c [1−100000]
Sets Max Seconds (per domain) - −−max−kb, −k [1−100000]
Sets Max Kb to Download (per domain) - −−errors [1−1000]
Sets Max HTTP Errors Code (per domain)
Help:
- −−help, −h
New Features in OpenWebSpiderCS v0.1.1
- −−crawl−delay [seconds]
Seconds between the download of a page and the next one (Default: 0 seconds) - −−req−timeout [seconds]
HTTP Request Timeout (in seconds) (Default: 60 seconds) - −−stress−test [value]
Downloads the same page (specified with −−index) x-times and exits
Useful to perform stress-tests over your web server
New Features in OpenWebSpiderCS v0.1.2
- −−images
Indexes images - −−req−timeout [seconds]
HTTP Request Timeout (in seconds) (Default: 60 seconds) - −−stress−test [value]
NOW improved: OpenWebSpider# doesn’t require a configuration file and a MySQL Server and it doesn’t check robots.txt - −−no−index
Doesen’t index crawled pages;
Useful to index images or to create a map of a website (using −−rels)
New Features in OpenWebSpiderCS v0.1.3
- −−keep−dup
Doesen’t delete duplicated pages
New Features in OpenWebSpiderCS v0.1.4
- −−pdf
Indexes PDFs - −−mp3
Indexes MP3s
OpenWebSpider# v0.1.1 | OpenWebSpider responded on 21 Aug 2008 at 11:49 am #
[...] [New features here: OpenWebSpider# v0.1 Command Line Arguments/Usage] [...]
fatticat responded on 14 May 2009 at 2:21 am #
Thanks your great web spider.
i just use rails to make simple tool just like your php example.
found a problem, i can not show the title text in utf-8 charset.
googled around for the answer,…
encoding = utf8
…
but not working
so if you can help me to extract proper content to rails ?
fatticat responded on 14 May 2009 at 4:25 am #
my rails show title : ã€äºŒæ‰‹è½¦èµ„讯】_优å¡äºŒæ‰‹è½¦ç½‘ 个人登录 | ç»é”€å•†ç™»å½• │ i dont know why
your php example is good
Mysql :SHOW VARIABLES LIKE ‘char%’
“Variable_name”,”Value”
“character_set_client”,”utf8″
“character_set_connection”,”utf8″
“character_set_database”,”utf8″
“character_set_filesystem”,”binary”
“character_set_results”,”utf8″
“character_set_server”,”latin1″
“character_set_system”,”utf8″
“character_sets_dir”,”/usr/share/mysql/charsets/”
enviroment.rb:
$KCODE = ‘utf8′
require ‘jcode’
application.rb:
before_filter :configure_charsets
def set_charset
content_type = headers["Content-Type"] || ‘text/html’
if /^text\/html/.match(content_type)
headers["Content-Type"] = “text/html; charset=utf-8″
end
headers["Keep-Alive"] = “timeout=60, max=256″
end
def configure_charsets
set_charset
end
database.yml:
development:
adapter: mysql
database: se_spiderdb
encoding: utf8
host: localhost
port: 3306
username: root
Shen139 responded on 18 May 2009 at 10:02 am #
I can’t help you because I don’t use rails.
I can suggest you to download latest version of OWS and rebuild the database from scratch
(every new version of OWS improve encodings and something canges in the DB structure).
PHP-Side:
add this code at the beginning of the script:
header(“Content-type: text/html; charset=UTF-8″);
add this code after MySQL Connect (mysql_connect):
mysql_set_charset(‘utf8′,$link);
mysql_query(“SET NAMES ‘UTF8′;”, $link);
mysql_query(“SET CHARACTER SET UTF8;”, $link);
add this line to the HEAD of the page:
META HTTP-EQUIV=”Content-Type” CONTENT=”text/html; charset=UTF-8″