OpenWebSpider# v0.1.1

Released OpenWebSpider v0.1.1

CHANGELOG:

  • New feature: new command-line argument: −−req−timeout
  • New feature: new command-line argument: −−stress−test
  • BUG: fixed a bug in http.cs::getURL()

[New features here: OpenWebSpider# v0.1 Command Line Arguments/Usage]

[ Read more about: Why C#? Why .NET Framework? ]

Source code and binary are available in the package: Download

Documentation of OpenWebSpider# v0.1

You may also like...

22 Responses

  1. webie says:

    Hi Shen,

    Run new version 1.1 using commands from previous post using win machine spider run well after about five minutes ows hit some patch files list on suse! ows went crazy and then displayed unreadable sreen display i could not issue stop command to ows it did not work then i tried ctrl , alt del this did not work i had to reboot windows i managed to grab some screenshots before power off. Please take a look at screen shots at #bargainshack-dot-co-dot-uk/ows.

    Regards

    Darren

    Command used
    –index http://download.opensuse.org/ –threads 6 –add-external -s

  2. Shen139 says:

    Yes, I experienced the same error on that website!
    (Now I can’t do a test) but when I saw that error I saw that the spider tried to index pages with a strange content-type!
    OWS by default indexes only pages with content-type starting with “text” like:
    text/plain; text/html; text/xml, …

    I think that OWS tried to index something marked as text but containing binary data.
    (When I got that error I read an SQL error and then I heard many beeps! I think that’s bacause OWS printed binary data to the console… but I’m not sure….)

    I’ll check this later in the afternoon and I’ll reply here!!!

  3. Shen139 says:

    The problem is exactly what I thought!!!

    The crawler correctly downloads this page: http://download.opensuse.org/update/11.0/driverupdate and try to index it (because its content-type is “text/plain”) but it isn’t “text/plain” at all… if you download it and open it with an HEX Editor you will see that it’s a binary file….

    OK! It’s a binary file instead of a text file (expected)… but: what’s the problem?
    The problem is that OWS (when execute a SQL query and gets an error prints to the console the SQL Query that gave the error…. this SQL Query contains binary data and also contains “bell character” ( http://en.wikipedia.org/wiki/Bell_character ) that slow down the process of printing characters to the console and so the execution of the crawler).

    In other words: OWS works correctly but is slowed down from the bell characters.

    The solution:
    I’m working to remove the error messages sent to the console and start using a log file!!!

  4. Shen139 says:

    I’ve fixed this kind of errors pre-processing all pages downloaded removing all un-wanted characters! (this process slows down the crawler a little but it is doubtless needed)

    I’ll release the source-code and the binary tomorrow!

  5. webie says:

    Hi Stefano,

    Many thanks i also put up another screenshot from test linux crawl the ows crashed and outputed some strange stuff i hope you don’t think i being a pain i am giving the C# ows really good test and trying to post back what happens i dumping sreen shots for you to help. any way ows has been running now on the linux box for two hours with out any problems.

    I have added new screen shot to the directory called Screenshot-ows-crash-linux.png.

    Kind Regards

    Darren

  6. Shen139 says:

    Hi,
    I can’t publish a new release of OpenWebSpider because this version doesn’t have bugs!
    As I tried to explain in the previous comment OWS downloaded a page that contained binary data, tried to index it but the data was too long to index and printed out to the console the error saying that there was an error in the last SQL Query and this error contained many bell-characters that slowed down the crawler with many other problems.

    I’ve published in the public folder: http://www.openwebspider.org/public/source-code/ows/OpenWebSpiderCS_v0.1/ the updated source code

    I’ve added a line in the CHANGELOG: http://www.openwebspider.org/public/source-code/ows/OpenWebSpiderCS_v0.1/CHANGELOG.txt
    – New feature: preprocessing all pages by removing all un-wanted characters

    and I’ve create a bin folder with the executable of the new versione of the crawler!

    It should work! It removes all binary data from pages!

  7. webie says:

    Hi Stefano,

    Just wanted to bounce sone ideas of you with the new ows is it possiable to move some settings to the config file like set crawler name, crawl per tld ie .uk,.com,bk etc.

    also could you add some code if no page title is present to ad default title ie: No Title could even put this setting in the config file.

    also is it possiable to run ows on more than 1 server sharing the same database and config file does ows lock the record its ready? if not could we have this feature.

    Regards

    Darren

  8. Shen139 says:

    What’s the need to change the crawler name?

    Right now, if the page title is not present the indexer uses the anchor text!
    I think that you could print “No Title” when you show the results in your search engine: if( $title == ” ) echo “No Title”;
    Do you understand?

    Yes, it’s possible to run many crawlers at the same time.
    The spider lock the first available URL so other crawlers will index other URLs!

    Stefano

  9. webie says:

    Hi Stefano,

    Getting lots of time out errors on all mysql servers i mysql settings are ok this ia what i am getting.

    Unable to execute SQL Query: DELETE FROM pages WHERE host_id = 48133 AND id NOT IN ( SELECT id FROM view_unique_pages WHERE host_id = 48133 )
    Mysql1 Connected: True
    Mysql1 Connected: True
    Error [executeSQLQuery()]: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

    Regards
    Darren

  10. Shen139 says:

    Hi,
    by default the timeout is set to 30 seconds
    but maybe this value is low if you are indexing thousands pages!
    I’ll increase it to 120 seconds (2 minutes)

  11. webie says:

    Hi Stefano,

    Makes sense i am have 44 million records in the database doing test on dmoz importer the script has been modified and now import the conent rdf database directly into pages & hostlist.

    do i need to set number of pages index default to 1 in the hostlist or does this value does not matter?

    also the now ows runs for five minutes and the get an error saying add external host which i all ready did -e maybe a bug but i am not sure has i had alot of problems with mono on centos 5.2 do you have guide in what i need for mono as centos comes with 1.2 version and the latest is 1.9 el4 but i have to force the install or i get lots of dependacy errors?

    Regards

    Darren

  12. Shen139 says:

    no, the table hostlist uses an auto-number and you don’t need to change it!
    What you need to keep in mind is that in the tables pages there is a field called “host_id” that’s the ID of the hostname in the table hostlist!!!

    What error?
    Maybe don’t you have enough RAM? External hosts and rels are stored in memory… and if you index very huge sites you need a good amount of RAM!

    I don’t know how to setup latest version of mono under CentOS!
    I’m sure you will find everything with google!

  13. webie says:

    Hi Stefano,

    I have for 4GB ram installed but have locked mysql into memory with memlock does ows use swap memory? Maybe i take mysql out of lock and try again.

    Regards

    Darren

  14. Shen139 says:

    OWS uses its own memory not that of mysql!

    Please give me the error you get!

  15. cangieit says:

    Ciao Stefano,
    bel progetto…..utilizzavo una release + vecchia la .7 ora ho scaricato la 1.1, ma durante il collegamento al mysql entra in sleep la connessione, qualche suggerimento?

    ciao e grazie e un saluto a tutti……
    max

  16. Shen139 says:

    Ciao,
    (che bello… un commento in italiano 🙂 )
    Quando ci sono di questi problemi solitamente la causa si nasconde in mono.
    Sicuro che hai l’ultima versione?
    Se usi lo spider sotto windows e hai il framework .NET versione 2.0 o successiva stai apposto!

    Se il motivo non sta dietro a mono allora forse sta dietro al mysql!
    Il connector che usa OWS è l’ultima versione (al momento del rilascio dello spider) e forse se hai una versione un pò vecchia di mysql ci sono problemi di retrocompatibilità.

    Ottieni qualche errore o si blocca e basta?

    Stefano

  17. cangieit says:

    Ciao Stefano,
    ogni tanto non solo inglese non è male…. 😉
    mi spiego meglio, che sono stato molto vago:

    Centos 5, ( CentOS (2.6.18-92.1.10.el5))

    mono-core 1.2.4-2.el5.centos The Mono CIL runtime, suitable for running .NET code
    mono-data 1.2.4-2.el5.centos Database connectivity for Mono
    mono-devel 1.2.4-2.el5.centos Development tools for Mono
    mono-web 1.2.4-2.el5.centos ASP.NET, Remoting, and Web Services for Mono
    mono-winforms 1.2.4-2.el5.centos Windows Forms implementation for Mono

    mysql-server 5.0.45-7.el5

    questo è quello che ho…..la cosa strana è che si blocca e rimane appesa, andando a vedere sotto mysql rimane in sleep e sta li…..a vita senza dare nessun errore esplicito….

    Ciao e grazie
    Max

  18. Shen139 says:

    Per me a occhio e croce è un problema di mono!
    Prova a upgradare alla versione 1.9.1!
    Il fatto è che mono è un bel progettone ma che purtroppo non è molto interfacciato con lo sviluppo del framework .NET di microsoft… soprattutto nelle versioni precedenti. Le ultime versioni sono molto stabili e molto aderenti agli standard!

    Molta gente mi ha scritto di avere vari problemi con vecchie versioni di mono… problemi risolti al 100% con l’ultima versione stabile disponibile!

    Io stesso quando sviluppo cerco di usare pacchetti aggiornati alle ultimissime versioni… questo però mi crea un sacco di problemi con le persone che hanno versioni vecchie e su cui io non ho mai testato lo spider.

    Prova e fammi sapere!

    Ciau

  19. hilkiah says:

    Hi Shen139,

    I’ve just stumbled upon OpenWebSpider and I’m hoping that it’ll serve as a replacement for Nutch (crawler/indexer functionality).

    My question is…how can I crawl multiple pages? Say I have a list of 1000 URLS to crawl, put it another way (the Nutch way), how can I do a web crawl with OWS?

    Secondly, OWS only parses html/text files, thus if someone wanted to assist the project by writing a parser for another file format (say pdf), would they HAVE to use C#?

    Thanks,
    Hilkiah

  20. cangieit says:

    Ciao stefano,
    PERFETTO, avevi ragione, il problema era in mono…..il casino delle dipendenze non sono riuscito a risolverlo e quindi Centos5 -> Centos4 ed è perfetto. ;;;)))) Complimenti ancora.

    L’unica cosa che alla fine della scansione dopo il “Bye Bye” mi compare “premere un tasto per continuare” come posso fare x toglierlo? è possibile evitarlo?

    Ciao e grazie ancora
    Max

  21. Shen139 says:

    Hi hilkiah,
    OWS uses a table (hostlist) where are stored all domains you want to index
    for example: http://www.example.com, http://www.test.com, ….
    You can simply add all the domains you want to index to that table and run
    openwebspider on the first of the list.
    To add hostnames you can use the followind command:
    openwebspider.exe −−index http://www.example.com −−add−hostlist
    (this command won’t index http://www.example.com)

    The easy way to index a single website is:
    openwebspider.exe −−index http://www.example.com -s

    To index a single page:
    openwebspider.exe −−index http://www.example.com/page.html -m 0 -s

    Please refer to this page for a complete how-to:
    http://www.openwebspider.org/documentation/openwebspider-v01/openwebspider-v01-command-line-argumentsusage/

  22. Shen139 says:

    Ciao Max,
    sono contento che hai risolto il problema!

    Per quanto riguarda la pressione di invio alla fine non ti nascondo che l’ho messo più per effettuare i debug che per altro… sto per rilasciare la versione 0.1.2 (con interessanti novità come l’indicizzazione delle immagini e altro….) e di sicuro ti tolgo quella rottura!

    Grazie per la segnalazione!

    Stefano