You may also like...

19 Responses

  1. SHL says:

    Just installed the # v.01 version, and it looks very promising!

    I’m using sphinx ( http://www.sphinxsearch.com/ ) for indexing and had a hard time complementing it with a fast and easy-configured multi-threaded crawler with mysql backend, and then I found this, which seems to really fill in that need.

    Keep up the good work!

    Best Regards
    // Samuel

  2. amando says:

    I have a CentOS5 server with mono installed and it seems to not work on my server.

    The spider gets stuck at this:

    [root@sphinx bin]# mono OpenWebSpiderCS.exe –index http://www.geniecomm.com
    OpenWebSpider#(v0.1a) [ http://www.openwebspider.org/ ]
    Developed by Shen139 aka Stefano Alimonti

    + Parsing Command Line Arguments…
    + Parsing Configuration File [ openwebspider.conf ]
    – Database[1]: ows_hosts
    – Server[1]: localhost
    – Server[1] port: 3306
    – Username[1]: root
    – Password[1]: ******
    – Database[2]: ows_index
    – Server[2]: localhost
    – Server[2] port: 3306
    – Username[2]: root
    – Password[2]: ******

    + Connecting to MySQL Server [localhost]; DB [ows_hosts]
    + Connecting to MySQL Server [localhost]; DB [ows_index]

    I have left hours and nothing happends, no errors. What could be the reason?

    Regards & ciao!
    Amando

  3. Shen139 says:

    Are you sure you are running latest version of mono?

    There are 2 known possible errors when trying to connect to mysql servers:
    1) mysql server is not running and you’ll get an error like this:
    –[
    + Connecting to MySQL Server [127.0.0.1]; DB [ows_hosts]
    Unable to connect to any of the specified MySQL hosts.
    + Connecting to MySQL Server [127.0.0.1]; DB [ows_index]
    Unable to connect to any of the specified MySQL hosts.
    – Error(2) while trying to connect to one or more mysql server.
    ]–

    2) you don’t have Mysql.Data.dll in the same path where openwebspider is running and you should get something like this:
    –[
    + Connecting to MySQL Server [127.0.0.1]; DB [ows_hosts]

    ** (OpenWebSpiderCS.exe:18373): WARNING **: The following assembly referenced from /home/linux/Projects/OpenWebSpiderCS/bin/OpenWebSpiderCS.exe could not be loaded:
    Assembly: MySql.Data (assemblyref_index=2)
    Version: 5.1.6.0
    Public Key: c5687fc88969c44d
    The assembly was not found in the Global Assembly Cache, a path listed in the MONO_PATH environment variable, or in the location of the executing assembly (/home/linux/Projects/OpenWebSpiderCS/bin/).

    ** (OpenWebSpiderCS.exe:18373): WARNING **: Could not load file or assembly ‘MySql.Data, Version=5.1.6.0, Culture=neutral, PublicKeyToken=c5687fc88969c44d’ or one of its dependencies.

    ** (OpenWebSpiderCS.exe:18373): WARNING **: Missing method .ctor in assembly /home/linux/Projects/OpenWebSpiderCS/bin/OpenWebSpiderCS.exe, type MySql.Data.MySqlClient.MySqlConnection

    Unhandled Exception: System.IO.FileNotFoundException: Could not load file or assembly ‘MySql.Data, Version=5.1.6.0, Culture=neutral, PublicKeyToken=c5687fc88969c44d’ or one of its dependencies.
    File name: ‘MySql.Data, Version=5.1.6.0, Culture=neutral, PublicKeyToken=c5687fc88969c44d’
    at OpenWebSpiderCS.ows.mysqlConnect () [0x00000]
    at OpenWebSpiderCS.mainClass.Main (System.String[] args) [0x00000]
    ]–

    this is another possible case! When you use an invalid mysql login; you get:
    –[
    + Connecting to MySQL Server [127.0.0.1]; DB [ows_hosts]
    Access denied for user ‘root’@’localhost’ (using password: YES)
    + Connecting to MySQL Server [127.0.0.1]; DB [ows_index]
    Access denied for user ‘root’@’localhost’ (using password: YES)
    – Error(2) while trying to connect to one or more mysql server.
    ]–

    When openwebspider successfully connects to mysql it writes:
    –[
    + Connecting to MySQL Server [127.0.0.1]; DB [ows_hosts]
    + Connecting to MySQL Server [127.0.0.1]; DB [ows_index]
    – Connected to both MySQL Servers
    ]–
    so it seems that on your box the spider doesn’t connect to the mysql server at all… but it’s very strange that you don’t get any error or warning.
    Do you have a firewall or something that block the connection from the spider to your mysql server? (maybe the spider is waiting a reply from the server)
    if not:
    I can only suggest you to upgrade to the latest version of mono!

  4. amando says:

    Yes, updating mono made it work.

    I am still getting one error when indexing:

    + Deleting duplicated pages…
    Unable to execute SQL Query: DELETE FROM pages WHERE host_id = 23 AND id NOT IN ( SELECT id FROM view_unique_pages WHERE host_id = 23 )
    Mysql1 Connected: True
    Mysql1 Connected: True
    Error [executeSQLQuery()]: There is no ‘root’@’%’ registered

    Is this common or something to fix?

    Regards!
    Amando

  5. Shen139 says:

    It seems to be a MySql BUG.
    I’ve found this article that could help you fix the problem:
    http://bugs.mysql.com/bug.php?id=16589
    My suggestion: Upgrade to the latest version of mysql: http://dev.mysql.com/downloads/
    and create tables and views again!

  6. joshua.chi says:

    Help!
    I am new to C# or .Net. I want to compile the openwebspider source file and build them into an .exe file again for I had changed something in the sourcefile. How can I do this? Use mono?
    Please give me some suggestions. Many thanks!

  7. Shen139 says:

    If you are trying to compile owsCS under linux the best way is using MonoDevelop!
    Run monodevelop -> Open a Solution / File -> select the file “OpenWebSpiderCS.mds”

    now you can edit the source code and compile it!
    The executable made with monodevelop works on both Windows and Linux!

  8. joshua.chi says:

    Hi Shen,
    Thanks to your reply!
    I have downloaded the OpenWebSpider v0.7. And it seems it is not a C# project , but a C project. Can I build this version by using the MonoDevelop.

    BTW, I am winxp user. Which tool can I used to make and build the source file?
    Many thanks!

  9. webie says:

    Hi the new version in C# is looking good and feels very stable. But just what to see if anybody else is have problems on centos 5 & windows spider runs for couple on minutes and the all i get is

    ‘ HTTP Status Code: 0 -] Error: Response received from server was null ‘
    I have left if for over 20 minutes and same response stoped the crawls and tried again on cenos & windows still the same every time i run the crawler.

    The error seems to happen when ows hit xml, zip, gz file then after this it seems to stop indexing and returns the HTTP Status Code: 0 -] Error: Response received from server was null ?

    Kind Regards

    Darren

  10. Shen139 says:

    Hi,
    Yes, OWS v0.7 is a C project.
    I suggest you to compile it with Microsoft Visual C++ (the express edition is free to use and to download from the MS website)

    “openwebspider.vcproj” is the project file for OWS v0.7

    MonoDevelop is an IDE for building .NET projects under Linux!

  11. Shen139 says:

    OpenWebSpiderCS v0.1 has an hard-coded timeout for each request: 60 seconds.
    If an HTTP request take longer that that time the crawler will ignore that page.
    (I’ve added in my TODO list a new argument with which specify that timeout via the command line! Example: –timeout 180)

    Another problem could be that the Web Server is under load and can’t serve pages within 60 seconds!
    I can suggest you to:
    – use a crawl delay (–crawl-delay )
    the number of seconds between the download of a page and the next one
    – use a lower number of threads

    When ows hit files that has a content-type un-handled it doesn’t download them at all, so I think that your case is only a coincidence.

  12. webie says:

    Hi Shen,

    Going back to my post when ows see’s zip,file etc it it gives me ‘ HTTP Status Code: 0 -] Error: Response received from server was null ‘ then stops crawling and every link returns the same error.

    I have reduced the threads to only 6 and also set crawl delay but still will not index after hitting zip file or ISO file etc

    This is my commands

    –index http://www.fedora.org –threads 6 –add-external –crawl-delay 1

    may be you can point me in right direction in how to find what maybe causing this error.

    Regards

    Darren

  13. Shen139 says:

    Hi,
    it’s impossible that the error is on http://www.fedora.org because that site has a robots.txt that disallow everything so you have to say me the exact website that makes OWS stop to work. (Please use -s (single host mode) to perform tests)

    I’ve tested the crlawler (a version with little modification respect the public one) over: “download.opensuse.org” (where there is a lot of un-handled content-type (ISO, torrent, …) )

    command line arguments:
    –index http://download.opensuse.org/ –threads 6 –add-external -s

    –[

    T[0] Downloading… [ http://download.opensuse.org:80/distribution/11.0/iso/dvd/openSUSE-11.0-DVD-i386.iso ] [Depth Level: 1]
    T[0] Downloaded 0 Kb (0 bytes) in 4609 ms
    T[0] HTTP Status Code: 200 -][- Content-Type: application/x-iso9660-image
    T[0] Indexing…NOT INDEXED [0 ms ]

    T[2] Downloading… [ http://download.opensuse.org:80/distribution/11.0/iso/cd/openSUSE-11.0-GNOME-LiveCD-i386.iso ] [Depth Level: 1]
    T[2] Downloaded 0 Kb (0 bytes) in 53015 ms
    T[2] HTTP Status Code: 200 -][- Content-Type: application/octet-stream
    T[2] Indexing…NOT INDEXED [0 ms ]

    T[2] Downloading… [ http://download.opensuse.org:80/distribution/11.0/iso/torrent/openSUSE-11.0-DVD-i386.torrent ] [Depth Level: 1]
    T[2] Downloaded 0 Kb (0 bytes) in 56609 ms
    T[2] HTTP Status Code: 200 -][- Content-Type: application/x-bittorrent
    T[2] Indexing…NOT INDEXED [0 ms ]

    T[2] Downloading… [ http://download.opensuse.org:80/distribution/11.0/iso/dvd/openSUSE-11.0-DVD-x86_64.iso.metalink ] [Depth Level: 2]
    T[2] Downloaded 0 Kb (0 bytes) in 953 ms
    T[2] HTTP Status Code: 200 -][- Content-Type: application/metalink+xml; charset=UTF-8
    T[2] Indexing…NOT INDEXED [0 ms ]

    ]–
    everything seems to work well!!!

    (few minutes later)

    oh s..t!
    I’ve tested the same website with OpenWebSpiderCS v0.1 and it doesn’t work well 🙂
    I fix the code, I add 2 new features and I publish it!!!

    Thanks for the suggestion!
    😉

  14. webie says:

    Hi Shen,

    I thought it was me going mad many thanks i look forward to the fix

    regards

    darren

  15. webie says:

    Hi shen,

    I forgot to say i had new front end made for ows which uses sphinx it makes a powerfull alternative to the standard search its free i was trying to give back i was wondering if you would like to take a look and maybe add to ows site as an add on to ows users the search demo for ows & sphinx is here http://www.linuxhostuk.co.uk/searchdemo let me know your thoughts.

    Regards

    Darren

  16. Shen139 says:

    Hi,
    I really like Sphinx!
    It’s hundred times faster than MySQL Full-Text Index and I suggest it to everyone who needs good performance over a big index!

    I think that http://www.linuxhostuk.co.uk/searchdemo really rocks 🙂 it’s fast and cool!
    I’m interested in your front-end, I’ll be very happy to publish it here!

    Please contact me @ shen139 (at) openwebspider [.] org

  17. webie says:

    Hi Shen,

    I have no problem if you want to add the source code front end to your project .

    When i get time i want to create some ready made smarty templates for ows front end putting all this together would make a great free search engine project.

    Regard

    Darren

    PS Big Thanks to you Shen for creating OWS with out your skills and giving up your time we could not follow our online adventures!

  18. Shen139 says:

    Hi,
    call me Stefano! That’s my name 🙂

    You have my email, when you want you can send me your work and I’ll publish it!

    😉

  19. neddy says:

    webie, in regards to sphinx.. I have been using ows + sphinx since about ows v1.1 about early last year some time and it works great!.. If any of you guys need a hand configuring your sphinx setups with ows flick me an email and ill be happy to help…

    If you want to see a working search that is indexed by sphinx, and spidered by ows check out http://www.hfvs.net/ Also the database that i use for sites to spider can be found at: http://cyberdawn.w3dt.net/ it contains about 90 million tld’s.