OpenWebSpider# v0.1 still uses MySQL Server as the backend to store the index.
The structure of the two databases used by this version of the crawler is slightly different from OpenWebSpider v0.7;
The structure of the database has been changed in OpenWebSpider# v0.1.4. Please upgrade your database.
[Read more on how to create database and tables here: OpenWebSpider# v0.1: Getting Started -> Create Database and Tables]
Database (1): ows_hosts:
This database contains 4 tables (one new table added with OpenWebSpider# v0.1.4):
- hostlist: the list of all the websites indexed and to index!
Fields:- id [auto_increment]
- hostname
- port
- status [0: to index; 1: indexed; 2: indexing]
- lastvisit
- indexed_pages
- time_sec
- bytes_downloaded
- error_pages
- priority [websites with higher priority will be indexed first]
- rels: the table that store the relationships between websites and pages
Fields:- host_id [ID of the website who links]
- page
- linkedhost_id [ID of the website who is linked]
- linkedpage
- textlink [the anchor text used in the link]
- hostlist_extras: the table where are stored custom limits
Fields:- host_id [the ID of the website to limit]
- max_pages
- max_level [max level of depth]
- max_seconds
- max_bytes
- max_HTTP_errors
- include_pages_regex [Read more about hostlist_extras regex]
- exclude_pages_regex
- crawler_act: this table is used to send commands to running crawlers
Fields:- crawler_id [the identifier for a crawler]
- act [the action (act = 0: Normal Use, Play; act = 1: Exit; act = 2: Pause)]
Database (2): ows_index:
(3 new tables added with OpenWebSpider# v0.1.4)
Tables:
- pages: it’s the index! Contains all the pages indexed!
Fields:- id [auto_increment]
- host_id
- hostname
- page
- title [page <TITLE>]
- anchor_text [the first anchor text used to link this page]
- text [the text extracted from the HTML]
- cache [the HTML of the page]
- html_md5 [the MD5 of the page (used to store only unique pages)]
- level [depth level]
- rank
- date
- time
- images: this is a new table fully supported in OpenWebSpider# v0.1.4 that stores images found while crawling web pages
Fields:- id [auto_increment]
- src_host_id
- src_page [src_host_id + src_page are the URL where the image is found]
- image_host_id
- image [image_host_id + image are the URL of the image]
- alt_text
- title_text
- mp3: the table with the MP3 info (this is a new table introduced with OpenWebSpider# v0.1.4)
Fields:
- id [auto_increment]
- host_id
- filename [host_id + filename are the URL of the MP3]
- mp3_size [size in bytes]
- mp3_artist
- mp3_title
- mp3_album
- mp3_genre
- mpr3_duration [in seconds]
- pdf: the table with the text of the PDFs (the text of PDFs is also indexed in the table pages) (this is a new table introduced with OpenWebSpider# v0.1.4)
Fields:- id [auto_increment]
- host_id
- filename [host_id + filename are the URL of the PDF]
- pdf_size [size in bytes]
- pdf_text
OpenWebSpider# v0.1.4 | OpenWebSpider responded on 07 May 2009 at 11:32 am #
[...] Released OpenWebSpider# v0.1.4 now with MP3s and PDFs support! New tables has been added please refer to this page to learn more: Database Structure [...]