Archive for the 'OpenWebSpider' Category

OpenWebSpider# v0.1.4

Released OpenWebSpider# v0.1.4 now with MP3s and PDFs support! New tables has been added please refer to this page to learn more: Database Structure

This is the complete CHANGELOG:

  • Mysql/NET Connector upgraded to 5.2.5.0
  • Enhanced encodings support
  • New feature: Support to META: “robots” (NOINDEX, NOFOLLOW)
  • New feature: New configuration file field: crawler_id
  • New field “crawler_id” in table “hostlist”
  • New table: crawler_act
  • New feature: Remote actions over running crawlers [Status, Play, Pause, Kill]
  • New file support: PDFs [Using PDFBox and IKVM]
  • New table: pdf
  • New file support: MP3s [Using UltraID3Lib]
  • New table: mp3
  • New feature: new command-line argument: −−pdf
  • New feature: new command-line argument: −−mp3

Go to the DOWNLOAD page

OpenWebSpider# explained with 4 video: Compile, Configure and RUN!

OpenWebSpider Shen139 07 May 2009 5 Comments

OpenWebSpider# v0.1.3

Released OpenWebSpider v0.1.3

CHANGELOG:

  • New feature: CRAWLER NAME and CRAWLER VERSION used in the User-Agent string in HTTP Requests
  • New feature: New configuration file field: sql_hostlist_where
  • New feature: new command-line argument: –keep-dup
  • BUG: fixed the regex used to extract URLs from <BASE>
  • BUG: fixed in the function that extracts URLs
  • BUG: fixed a bug in page.cs::normalizePage()
  • BUG: fixed minor bugs
  • BUG: fixed a bug in robots.txt’s parser
  • BUG: fixed a bug in page-rels handler

Source code and binary are available in the package: Download

Documentation of OpenWebSpider# v0.1

News & OpenWebSpider & Release Shen139 05 Nov 2008 2 Comments

OpenWebSpider# v0.1.2

Released OpenWebSpider v0.1.2

CHANGELOG:

  • BUG: fixed the regex used to extract URLs from (I)FRAME
  • New feature: OpenWebSpider# can index images (new table: images)
  • New feature: new command-line argument: −−images
  • Improved Stress-test facility: now OpenWebSpider# doesn’t require a configuration file and a MySQL Server and it doesn’t check robots.txt (in stress-test mode)
  • Timeout in execution of SQL queries set to 120 seconds (2 minutes)
  • New feature: new configuration file fields: CRAWLER NAME and CRAWLER VERSION
  • New feature: CRAWLER NAME used over robots.txt

Source code and binary are available in the package: Download

Documentation of OpenWebSpider# v0.1

OpenWebSpider Shen139 09 Sep 2008 24 Comments

OpenWebSpider# v0.1.1

Released OpenWebSpider v0.1.1

CHANGELOG:

  • New feature: new command-line argument: −−req−timeout
  • New feature: new command-line argument: −−stress−test
  • BUG: fixed a bug in http.cs::getURL()

[New features here: OpenWebSpider# v0.1 Command Line Arguments/Usage]

[ Read more about: Why C#? Why .NET Framework? ]

Source code and binary are available in the package: Download

Documentation of OpenWebSpider# v0.1

News & OpenWebSpider & Release Shen139 21 Aug 2008 22 Comments

OpenWebSpider# v0.1

Released the first public version of OpenWebSpider entirely written in C#
[ Read more about: Why C#? Why .NET Framework? ]

Source code and binary are available in the package: Download

Documentation of OpenWebSpider# v0.1

News & OpenWebSpider & Release Shen139 29 Jul 2008 19 Comments