Released OpenWebSpider v0.1.1
CHANGELOG:
- New feature: new command-line argument: −−req−timeout
- New feature: new command-line argument: −−stress−test
- BUG: fixed a bug in http.cs::getURL()
[New features here: OpenWebSpider# v0.1 Command Line Arguments/Usage]
[ Read more about: Why C#? Why .NET Framework? ]
Source code and binary are available in the package: Download
Documentation of OpenWebSpider# v0.1
webie responded on 21 Aug 2008 at 1:41 pm #
Hi Shen,
Run new version 1.1 using commands from previous post using win machine spider run well after about five minutes ows hit some patch files list on suse! ows went crazy and then displayed unreadable sreen display i could not issue stop command to ows it did not work then i tried ctrl , alt del this did not work i had to reboot windows i managed to grab some screenshots before power off. Please take a look at screen shots at #bargainshack-dot-co-dot-uk/ows.
Regards
Darren
Command used
–index http://download.opensuse.org/ –threads 6 –add-external -s
Shen139 responded on 21 Aug 2008 at 3:48 pm #
Yes, I experienced the same error on that website!
(Now I can’t do a test) but when I saw that error I saw that the spider tried to index pages with a strange content-type!
OWS by default indexes only pages with content-type starting with “text” like:
text/plain; text/html; text/xml, …
I think that OWS tried to index something marked as text but containing binary data.
(When I got that error I read an SQL error and then I heard many beeps! I think that’s bacause OWS printed binary data to the console… but I’m not sure….)
I’ll check this later in the afternoon and I’ll reply here!!!
Shen139 responded on 21 Aug 2008 at 4:50 pm #
The problem is exactly what I thought!!!
The crawler correctly downloads this page: http://download.opensuse.org/update/11.0/driverupdate and try to index it (because its content-type is “text/plain”) but it isn’t “text/plain” at all… if you download it and open it with an HEX Editor you will see that it’s a binary file….
OK! It’s a binary file instead of a text file (expected)… but: what’s the problem?
The problem is that OWS (when execute a SQL query and gets an error prints to the console the SQL Query that gave the error…. this SQL Query contains binary data and also contains “bell character” ( http://en.wikipedia.org/wiki/Bell_character ) that slow down the process of printing characters to the console and so the execution of the crawler).
In other words: OWS works correctly but is slowed down from the bell characters.
The solution:
I’m working to remove the error messages sent to the console and start using a log file!!!
Shen139 responded on 21 Aug 2008 at 5:31 pm #
I’ve fixed this kind of errors pre-processing all pages downloaded removing all un-wanted characters! (this process slows down the crawler a little but it is doubtless needed)
I’ll release the source-code and the binary tomorrow!
webie responded on 21 Aug 2008 at 11:04 pm #
Hi Stefano,
Many thanks i also put up another screenshot from test linux crawl the ows crashed and outputed some strange stuff i hope you don’t think i being a pain i am giving the C# ows really good test and trying to post back what happens i dumping sreen shots for you to help. any way ows has been running now on the linux box for two hours with out any problems.
I have added new screen shot to the directory called Screenshot-ows-crash-linux.png.
Kind Regards
Darren
Shen139 responded on 22 Aug 2008 at 9:52 am #
Hi,
I can’t publish a new release of OpenWebSpider because this version doesn’t have bugs!
As I tried to explain in the previous comment OWS downloaded a page that contained binary data, tried to index it but the data was too long to index and printed out to the console the error saying that there was an error in the last SQL Query and this error contained many bell-characters that slowed down the crawler with many other problems.
I’ve published in the public folder: http://www.openwebspider.org/public/source-code/ows/OpenWebSpiderCS_v0.1/ the updated source code
I’ve added a line in the CHANGELOG: http://www.openwebspider.org/public/source-code/ows/OpenWebSpiderCS_v0.1/CHANGELOG.txt
- New feature: preprocessing all pages by removing all un-wanted characters
and I’ve create a bin folder with the executable of the new versione of the crawler!
It should work! It removes all binary data from pages!
webie responded on 22 Aug 2008 at 1:38 pm #
Hi Stefano,
Just wanted to bounce sone ideas of you with the new ows is it possiable to move some settings to the config file like set crawler name, crawl per tld ie .uk,.com,bk etc.
also could you add some code if no page title is present to ad default title ie: No Title could even put this setting in the config file.
also is it possiable to run ows on more than 1 server sharing the same database and config file does ows lock the record its ready? if not could we have this feature.
Regards
Darren
Shen139 responded on 23 Aug 2008 at 12:14 pm #
What’s the need to change the crawler name?
Right now, if the page title is not present the indexer uses the anchor text!
I think that you could print “No Title” when you show the results in your search engine: if( $title == ” ) echo “No Title”;
Do you understand?
Yes, it’s possible to run many crawlers at the same time.
The spider lock the first available URL so other crawlers will index other URLs!
Stefano
webie responded on 31 Aug 2008 at 3:35 pm #
Hi Stefano,
Getting lots of time out errors on all mysql servers i mysql settings are ok this ia what i am getting.
Unable to execute SQL Query: DELETE FROM pages WHERE host_id = 48133 AND id NOT IN ( SELECT id FROM view_unique_pages WHERE host_id = 48133 )
Mysql1 Connected: True
Mysql1 Connected: True
Error [executeSQLQuery()]: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Regards
Darren
Shen139 responded on 01 Sep 2008 at 11:00 am #
Hi,
by default the timeout is set to 30 seconds
but maybe this value is low if you are indexing thousands pages!
I’ll increase it to 120 seconds (2 minutes)
webie responded on 01 Sep 2008 at 11:36 am #
Hi Stefano,
Makes sense i am have 44 million records in the database doing test on dmoz importer the script has been modified and now import the conent rdf database directly into pages & hostlist.
do i need to set number of pages index default to 1 in the hostlist or does this value does not matter?
also the now ows runs for five minutes and the get an error saying add external host which i all ready did -e maybe a bug but i am not sure has i had alot of problems with mono on centos 5.2 do you have guide in what i need for mono as centos comes with 1.2 version and the latest is 1.9 el4 but i have to force the install or i get lots of dependacy errors?
Regards
Darren
Shen139 responded on 01 Sep 2008 at 11:53 am #
no, the table hostlist uses an auto-number and you don’t need to change it!
What you need to keep in mind is that in the tables pages there is a field called “host_id” that’s the ID of the hostname in the table hostlist!!!
What error?
Maybe don’t you have enough RAM? External hosts and rels are stored in memory… and if you index very huge sites you need a good amount of RAM!
I don’t know how to setup latest version of mono under CentOS!
I’m sure you will find everything with google!
webie responded on 01 Sep 2008 at 1:14 pm #
Hi Stefano,
I have for 4GB ram installed but have locked mysql into memory with memlock does ows use swap memory? Maybe i take mysql out of lock and try again.
Regards
Darren
Shen139 responded on 01 Sep 2008 at 1:34 pm #
OWS uses its own memory not that of mysql!
Please give me the error you get!
cangieit responded on 02 Sep 2008 at 7:39 pm #
Ciao Stefano,
bel progetto…..utilizzavo una release + vecchia la .7 ora ho scaricato la 1.1, ma durante il collegamento al mysql entra in sleep la connessione, qualche suggerimento?
ciao e grazie e un saluto a tutti……
max
Shen139 responded on 03 Sep 2008 at 10:01 am #
Ciao,
)
(che bello… un commento in italiano
Quando ci sono di questi problemi solitamente la causa si nasconde in mono.
Sicuro che hai l’ultima versione?
Se usi lo spider sotto windows e hai il framework .NET versione 2.0 o successiva stai apposto!
Se il motivo non sta dietro a mono allora forse sta dietro al mysql!
Il connector che usa OWS è l’ultima versione (al momento del rilascio dello spider) e forse se hai una versione un pò vecchia di mysql ci sono problemi di retrocompatibilità.
Ottieni qualche errore o si blocca e basta?
Stefano
cangieit responded on 03 Sep 2008 at 10:43 am #
Ciao Stefano,
ogni tanto non solo inglese non è male….
mi spiego meglio, che sono stato molto vago:
Centos 5, ( CentOS (2.6.18-92.1.10.el5))
mono-core 1.2.4-2.el5.centos The Mono CIL runtime, suitable for running .NET code
mono-data 1.2.4-2.el5.centos Database connectivity for Mono
mono-devel 1.2.4-2.el5.centos Development tools for Mono
mono-web 1.2.4-2.el5.centos ASP.NET, Remoting, and Web Services for Mono
mono-winforms 1.2.4-2.el5.centos Windows Forms implementation for Mono
mysql-server 5.0.45-7.el5
questo è quello che ho…..la cosa strana è che si blocca e rimane appesa, andando a vedere sotto mysql rimane in sleep e sta li…..a vita senza dare nessun errore esplicito….
Ciao e grazie
Max
Shen139 responded on 03 Sep 2008 at 10:53 am #
Per me a occhio e croce è un problema di mono!
Prova a upgradare alla versione 1.9.1!
Il fatto è che mono è un bel progettone ma che purtroppo non è molto interfacciato con lo sviluppo del framework .NET di microsoft… soprattutto nelle versioni precedenti. Le ultime versioni sono molto stabili e molto aderenti agli standard!
Molta gente mi ha scritto di avere vari problemi con vecchie versioni di mono… problemi risolti al 100% con l’ultima versione stabile disponibile!
Io stesso quando sviluppo cerco di usare pacchetti aggiornati alle ultimissime versioni… questo però mi crea un sacco di problemi con le persone che hanno versioni vecchie e su cui io non ho mai testato lo spider.
Prova e fammi sapere!
Ciau
hilkiah responded on 03 Sep 2008 at 2:20 pm #
Hi Shen139,
I’ve just stumbled upon OpenWebSpider and I’m hoping that it’ll serve as a replacement for Nutch (crawler/indexer functionality).
My question is…how can I crawl multiple pages? Say I have a list of 1000 URLS to crawl, put it another way (the Nutch way), how can I do a web crawl with OWS?
Secondly, OWS only parses html/text files, thus if someone wanted to assist the project by writing a parser for another file format (say pdf), would they HAVE to use C#?
Thanks,
Hilkiah
cangieit responded on 03 Sep 2008 at 4:01 pm #
Ciao stefano,
PERFETTO, avevi ragione, il problema era in mono…..il casino delle dipendenze non sono riuscito a risolverlo e quindi Centos5 -> Centos4 ed è perfetto. ;;;)))) Complimenti ancora.
L’unica cosa che alla fine della scansione dopo il “Bye Bye” mi compare “premere un tasto per continuare” come posso fare x toglierlo? è possibile evitarlo?
Ciao e grazie ancora
Max
Shen139 responded on 04 Sep 2008 at 11:01 am #
Hi hilkiah,
OWS uses a table (hostlist) where are stored all domains you want to index
for example: http://www.example.com, http://www.test.com, ….
You can simply add all the domains you want to index to that table and run
openwebspider on the first of the list.
To add hostnames you can use the followind command:
openwebspider.exe −−index http://www.example.com −−add−hostlist
(this command won’t index http://www.example.com)
The easy way to index a single website is:
openwebspider.exe −−index http://www.example.com -s
To index a single page:
openwebspider.exe −−index http://www.example.com/page.html -m 0 -s
Please refer to this page for a complete how-to:
http://www.openwebspider.org/documentation/openwebspider-v01/openwebspider-v01-command-line-argumentsusage/
Shen139 responded on 04 Sep 2008 at 11:04 am #
Ciao Max,
sono contento che hai risolto il problema!
Per quanto riguarda la pressione di invio alla fine non ti nascondo che l’ho messo più per effettuare i debug che per altro… sto per rilasciare la versione 0.1.2 (con interessanti novità come l’indicizzazione delle immagini e altro….) e di sicuro ti tolgo quella rottura!
Grazie per la segnalazione!
Stefano