OpenWebSpider# v0.1 Database Structure

OpenWebSpider# v0.1 still uses MySQL Server as the backend to store the index.
The structure of the two databases used by this version of the crawler is slightly different from OpenWebSpider v0.7;

The structure of the database has been changed in OpenWebSpider# v0.1.4. Please upgrade your database.

[Read more on how to create database and tables here: OpenWebSpider# v0.1: Getting Started -> Create Database and Tables]

Database (1): ows_hosts:
This database contains 4 tables (one new table added with OpenWebSpider# v0.1.4):

  1. hostlist: the list of all the websites indexed and to index!

    • id [auto_increment]
    • hostname
    • port
    • status [0: to index; 1: indexed; 2: indexing]
    • lastvisit
    • indexed_pages
    • time_sec
    • bytes_downloaded
    • error_pages
    • priority [websites with higher priority will be indexed first]
  2. rels: the table that store the relationships between websites and pages

    • host_id [ID of the website who links]
    • page
    • linkedhost_id [ID of the website who is linked]
    • linkedpage
    • textlink [the anchor text used in the link]
  3. hostlist_extras: the table where are stored custom limits

    • host_id [the ID of the website to limit]
    • max_pages
    • max_level [max level of depth]
    • max_seconds
    • max_bytes
    • max_HTTP_errors
    • include_pages_regex¬† [Read more about hostlist_extras regex]
    • exclude_pages_regex
  4. crawler_act: this table is used to send commands to running crawlers

    • crawler_id [the identifier for a crawler]
    • act [the action (act = 0: Normal Use, Play; act = 1: Exit; act = 2: Pause)]

Database (2): ows_index:
(3 new tables added with OpenWebSpider# v0.1.4)

  1. pages: it’s the index! Contains all the pages indexed!

    • id [auto_increment]
    • host_id
    • hostname
    • page
    • title [page <TITLE>]
    • anchor_text [the first anchor text used to link this page]
    • text [the text extracted from the HTML]
    • cache [the HTML of the page]
    • html_md5 [the MD5 of the page (used to store only unique pages)]
    • level [depth level]
    • rank
    • date
    • time
  2. images: this is a new table fully supported in OpenWebSpider# v0.1.4 that stores images found while crawling web pages

    • id [auto_increment]
    • src_host_id
    • src_page [src_host_id + src_page are the URL where the image is found]
    • image_host_id
    • image [image_host_id + image are the URL of the image]
    • alt_text
    • title_text
  3. mp3: the table with the MP3 info (this is a new table introduced with OpenWebSpider# v0.1.4)

    • id [auto_increment]
    • host_id
    • filename [host_id + filename are the URL of the MP3]
    • mp3_size [size in bytes]
    • mp3_artist
    • mp3_title
    • mp3_album
    • mp3_genre
    • mpr3_duration [in seconds]
  4. pdf: the table with the text of the PDFs (the text of PDFs is also indexed in the table pages) (this is a new table introduced with OpenWebSpider# v0.1.4)

    • id [auto_increment]
    • host_id
    • filename [host_id + filename are the URL of the PDF]
    • pdf_size [size in bytes]
    • pdf_text

1 Response

  1. May 7, 2009

    […] Released OpenWebSpider# v0.1.4 now with MP3s and PDFs support! New tables has been added please refer to this page to learn more: Database Structure […]

Leave a Reply