OpenWebSpider# v0.1 Command Line Arguments/Usage

Command Line Arguments:

  • −−index, −i [URL]
    Indexes the website specified with [URL]
  • −−add−hostlist
    Doesn’t index the [URL] specified with “−−index”; it simply adds the hostname to the list of the Hosts (hostlist), prints its ID and and exits! This command is extremely useful if you want to use the power of the table: hostlist_extras [Read more about: hostlist_extras regex] and you need the ID of a host (not) present in the table of the hosts.
  • −−threads, −t [1−100]
    Sets number of threads
  • −s
    Single Mode: On (Default: Off). If Single Mode = On : Indexes the website specified with “−−index” and exits.
  • −−cache
    Saves a copy of each indexed page (Default: Doesn’t save cache)
  • −−cache−compressed
    Saves a compressed copy of each indexed page (Default: Doesn’t save cache)
  • −−rels, −r [1,2]
    Saves relationships between pages (Default: Doesn’t save rels). This could be really useful to generate a map of who links who and who is linked by.
    1: saves only hostnames (Example: www.example.com links www.test.com; www.example.com links www.domain.net)
    2: saves hostnames and pages (Example: www.example.com/index.html links: www.example.com/download.html, www.example.com/test.html and www.test.com/docs.php; …)
    Let’s see an example of what you can do with this feature:
    http://lab.openwebspider.org/8like.php
  • −−add−external, −e
    Adds External Hosts (Default: Doesn’t add external hosts).
    If not specified all external hosts found in crawled pages will be ignored by the crawler and won’t indexed in the future.
  • −−conf−file [filename]
    Sets a configuration file (Default: openwebspider.conf)



Limits:

  • −−max−depth, −m [0−1000]
    Sets Max Depth Level of the pages to index. (Default: −1 (Index all pages))
    Depth Level = 0 : Index only home−page
    Depth Level = 1 : Index home−page and all pages directly linked by the home−page
  • −−max−pages, −l [1−1000000]
    Sets Max Pages to Index (per domain)
  • −−max−seconds, −c [1−100000]
    Sets Max Seconds (per domain)
  • −−max−kb, −k [1−100000]
    Sets Max Kb to Download (per domain)
  • −−errors [1−1000]
    Sets Max HTTP Errors Code (per domain)

Help:

  • −−help, −h

New Features in OpenWebSpiderCS v0.1.1

  • −−crawl−delay [seconds]
    Seconds between the download of a page and the next one (Default: 0 seconds)
  • −−req−timeout [seconds]
    HTTP Request Timeout (in seconds) (Default: 60 seconds)
  • −−stress−test [value]
    Downloads the same page (specified with −−index) x-times and exits
    Useful to perform stress-tests over your web server

New Features in OpenWebSpiderCS v0.1.2

  • −−images
    Indexes images
  • −−req−timeout [seconds]
    HTTP Request Timeout (in seconds) (Default: 60 seconds)
  • −−stress−test [value]
    NOW improved: OpenWebSpider# doesn’t require a configuration file and a MySQL Server and it doesn’t check robots.txt
  • −−no−index
    Doesen’t index crawled pages;
    Useful to index images or to create a map of a website (using −−rels)

New Features in OpenWebSpiderCS v0.1.3

  • −−keep−dup
    Doesen’t delete duplicated pages

New Features in OpenWebSpiderCS v0.1.4

  • −−pdf
    Indexes PDFs
  • −−mp3
    Indexes MP3s

4 Responses

  1. fatticat says:

    Thanks your great web spider.
    i just use rails to make simple tool just like your php example.
    found a problem, i can not show the title text in utf-8 charset.
    googled around for the answer,…
    encoding = utf8

    but not working

    so if you can help me to extract proper content to rails ?

  2. fatticat says:

    my rails show title : 【二手车资讯】_优卡二手车网 个人登录 | 经销商登录 │ i dont know why
    your php example is good

    Mysql :SHOW VARIABLES LIKE ‘char%’
    “Variable_name”,”Value”
    “character_set_client”,”utf8″
    “character_set_connection”,”utf8″
    “character_set_database”,”utf8″
    “character_set_filesystem”,”binary”
    “character_set_results”,”utf8″
    “character_set_server”,”latin1″
    “character_set_system”,”utf8″
    “character_sets_dir”,”/usr/share/mysql/charsets/”

    enviroment.rb:
    $KCODE = ‘utf8’
    require ‘jcode’

    application.rb:
    before_filter :configure_charsets

    def set_charset
    content_type = headers[“Content-Type”] || ‘text/html’
    if /^text\/html/.match(content_type)
    headers[“Content-Type”] = “text/html; charset=utf-8”
    end
    headers[“Keep-Alive”] = “timeout=60, max=256”
    end

    def configure_charsets
    set_charset
    end

    database.yml:
    development:
    adapter: mysql
    database: se_spiderdb
    encoding: utf8
    host: localhost
    port: 3306
    username: root

  3. Shen139 says:

    I can’t help you because I don’t use rails.
    I can suggest you to download latest version of OWS and rebuild the database from scratch
    (every new version of OWS improve encodings and something canges in the DB structure).

    PHP-Side:
    add this code at the beginning of the script:
    header(“Content-type: text/html; charset=UTF-8”);

    add this code after MySQL Connect (mysql_connect):

    mysql_set_charset(‘utf8’,$link);

    mysql_query(“SET NAMES ‘UTF8’;”, $link);
    mysql_query(“SET CHARACTER SET UTF8;”, $link);

    add this line to the HEAD of the page:
    META HTTP-EQUIV=”Content-Type” CONTENT=”text/html; charset=UTF-8″

  1. August 21, 2008

    […] [New features here: OpenWebSpider# v0.1 Command Line Arguments/Usage] […]

Leave a Reply