OpenWebSpider# v0.1.2

Released OpenWebSpider v0.1.2

CHANGELOG:

  • BUG: fixed the regex used to extract URLs from (I)FRAME
  • New feature: OpenWebSpider# can index images (new table: images)
  • New feature: new command-line argument: −−images
  • Improved Stress-test facility: now OpenWebSpider# doesn’t require a configuration file and a MySQL Server and it doesn’t check robots.txt (in stress-test mode)
  • Timeout in execution of SQL queries set to 120 seconds (2 minutes)
  • New feature: new configuration file fields: CRAWLER NAME and CRAWLER VERSION
  • New feature: CRAWLER NAME used over robots.txt

Source code and binary are available in the package: Download

Documentation of OpenWebSpider# v0.1

You may also like...

24 Responses

  1. webie says:

    Hi Stefano,

    I hope you are well running crawl with ows 1.2 and getting this error on the console for some reason? .Error: Object reference not set to an instance of an object

    thought i let you know.

    regards

    Darren

  2. Shen139 says:

    Hi Darren,
    Please give me the command line arguments you used and when and where you got that error! It’s a generic error message and without more info I can’t fix it.

    Thanks

  3. webie says:

    Hi Stefano,

    Command use was :

    mono spider.exe -e -l 200 -t 6 -i http://www.acrylic-nails-store.co.uk

    Best Regards

    Darren

  4. Shen139 says:

    Hi,
    it was a bug in the regex that extracts URLs from the TAG.

    I’ve fixed it and another bug in another function and now OWS works fine.

    I don’t publish a new version now but you can download a working version here:
    http://www.openwebspider.org/public/source-code/ows/OpenWebSpiderCS_v0.1/bin/

    Thanks
    Stefano

  5. Puck says:

    I could not compile the latest version. I am getting the error:

    “error CS1504: Source file ‘C:\Rekabu\OpenWebSpiderCS_v0.1.2\OpenWebSpiderCS\Properties\Settings.Designer.cs’ could not be opened (‘Unspecified error ‘)”

    The file is not there in the directory.

  6. Kedrin says:

    I was testing and i cant seem to get the crawler name or version to do anything. Is this feature functional yet?

  7. Shen139 says:

    Right now the crawler name and version are only used to checks exclusions over the robots.txt

    But maybe in future versions it will be used as the User-Agent string in HTTP Requests.

  8. SHL says:

    Hi,

    Nice with a new version (Will try it asap).

    I have a general usage question also:
    Is it possible to constrain the crawler to two (or more) sites, rather than only one?

    My problem: I’m indexing a site that uses both .com and .org as top-domain, (with internal links using sometimes .com, sometimes .org), so if I choose to index only the .com URL, the crawler will not index pages that happen to include ‘.org’. I didn’t in any way manage to cover the whole site.

    If not in the current version, I think this (e.g. allowing use of RegEx for the base URL) would be a very nice new feature.

    Thanks for sharing this software.
    Best Regards
    // Samuel Lampa

  9. Shen139 says:

    Why don’t you run OWS 2 times?
    the first time over domain.org and then over the .com!

    I’m planning to release a new version with a new feature that makes OWS index pages like google does.
    Now OWS indexes a domain per time, then it will index pages depending on its rank and other parameters for different websites and upgrading the index… it’s not easy to explain!
    But there are many modifications to apply to the database and the spider so I can’t say when I’ll release it.

  10. SHL says:

    Hi Shen,

    The problem is that there are always some links that get “hidden”, because they include the “wrong” top-domain, when starting out from one of the domains.

    The site uses mainly one domain, say .com. So I start crawling the .com site, but all links that happen to point to the .org version instead of the .com version, are seen by the spider as external url’s. In this way some pages seem to form “isolated islands” of interconnected links, that cannot be reached by starting to crawl from either domain. And indexing the .org domain won’t do either, because then nearly 100% (maybe 100% on the first page) of the links will point to the .com version of the pages, and thus be seen as external url:s.

    I understand that this is quite a special situation. Just thought that if it is very easy to implement reg-ex usage for domain names, then that’d be nice, but I understand if it’s not easy, or you don’t have time!

    Looking forward to the new version also. Keep up the good work!

  11. SHL says:

    Update:

    After some more testing I see that the main problem is alternating usage of www or not, in the urls. The .com/.org seems to be a minor issue related to that.

    That is, if I start crawling at http://www.example.com, then the crawler will skip links to http://example.com/somepage(!). So maybe it would be good with a reg-ex anyway?

    Kind Regards
    Samuel

  12. Shen139 says:

    As you say in the first comment the problem is “hidden” links!
    If http://www.example.com links to http://www.test.com/test_page
    It’s possible that indexing http://www.test.com you won’t index test_page because there aren’t internal links that bring to that pages.
    The only possible solution (for me) (and it’s in my TODO list) is to store every page the crawler find and use a flag: Indexed or Not Indexed.

    This solution solves another problem I’ve!
    Someone asked me if there is a way to index a list of URL… with this new table that guy could add all the URL he needs in that table and run OWS and wait.

    This is not easy because OWS needs to be heavily modified in its ground.

  13. SHL says:

    Hi Shen, thanks for your answers!

    [cite=Shen]
    This solution solves another problem I’ve!
    Someone asked me if there is a way to index a list of URL… with this new table that guy could add all the URL he needs in that table and run OWS and wait.
    [/cite]

    Ok, I understand that this is a solution that solves many problems in one.

    Still, I think the problem I mentioned is easiest solved using a regex for specifying allowed domain instead of a fixed text string though (don’t you think?). Because then you could just use this regex:

    http://(www.)?example.(com|org)

    …and it would allow all different variants of the domain, e.g: example.com, example.org, http://www.example.com or http://www.example.org.

    Especially the issue with altering use of www and not, I suppose is quite a general problem out there isn’t it? I noticed that PHP-based “Sphider” handles this by allowing both www, and not www even if you specify the domain using www. (I of course prefer compiled code and multi-threading over a PHP solution though =) ).

    Alternatively, if implementing a regex is too cumbersome, maybe just allowing domains both _with_ and _without_ “www” could be added? Or what do you think?

    Kind Regards
    Samuel Lampa

  14. Shen139 says:

    Hi,
    you are right, a regex will solve the problem but OWS is structured that it uses the host_id (the ID of the current domain) for many tasks, and so using a regex or any other trick we will destroy the fundamentals of its core.

    http://www.domain.com it’s different from domain.com! Many webservers could have different pages for domain with and without WWW. Do you understand? So it’s not simple to solve your way!

    I’m planning a new version of OWS with this feature… OWS 0.7 had a command-line argument:
    –[
    -F

    (Free Indexing mode)

    1.
    This is a new feature in OpenWebSpider v0.6! Free indexing mode means that
    the web spider will index pages while it encounter them!
    For example if we have a web site with the following structure:

    home_page
    / | \
    link1 link2 http://www.example.com/linkX

    the web spider will index (in order): home_page, link1, link2 and http://www.example.com/linkX
    Whereas without this argument the web spider will index only: home_page, link1 and link2
    ]–
    but it wasn’t safe to use and created many problems!

  15. bernd says:

    Hello,

    this is a cool Spider 😉

    I´ve tested this but there is a Errormessage: Getting Urls…Error: Der Index war außerhalb des Arraybereiches.

    I tested with this command line.
    –index one3p.de -t 10 -r 1 –crawl-delay 5 –req-timeout 30 -l 3000 -s

    Another url is working fine but this one works not (http://one3p.de) Whats wrong?

    many thanks

  16. Shen139 says:

    Hi,
    Thanks (for the cool spider)!

    This is a known bug fixed by OpenWebSpider v0.1.3

    I’ll release it as soon as possible (I hope within this week). It fixes many bugs.

    I’ve tested your command line arguments with new version and it works fine!

    Stefano

  17. bernd says:

    Thanks for reply. I have one more question. The spider find urls on a host and insert this in the database. But not every domain would be crawl by spider. Why ?

    I used –add-external .

    Example:
    spider –> example.de (find external example1.de and external2.de) all internal links are spidered.

    in database in table hostlist now I find :
    23 example.de 80 1 2008-10-28 45
    24 example1.de 80 1 2008-10-28 0
    25 example2.de 80 1 2008-10-28 0

    The spider says after crawl, that no more domians (or hosts?) are find etc.
    bye bye

    Could you include a function, that the spider crawl a uncrawled domain from database if he not found a link on the current spidered site?

    Sorry for bad english.

  18. Shen139 says:

    Using −−add−external OWS will add all hosts different to the current one to the table hostlist!
    If example.de contains only: example1.de and example2.de OWS will only index these websites, no others! OK?

    What uncrawled domain? Who tells to OWS what to crawl? (the table hostlist)
    If all the domains in hostlist are indexed OWS won’t have anything to do.
    You should set the status of the domains to: 0
    or run OWS to that domain with: openwebspider –index example.de and then: openwebspider –index example1.de and openwebspider –index example2.de

    Do you understand?
    (Maybe I misunderstood)

  19. bernd says:

    Perhaps is my English to bad. I dont know.

    “If all the domains in hostlist are indexed OWS won’t have anything to do.”

    In my hostlist are 15 domains with status 1 but 0 indexed_pages. When will the spider crawl this? This domains found on a crawl, but the spider has only insert the domain but has no indexed.

    “If example.de contains only: example1.de and example2.de OWS will only index these websites, no others! OK?”

    Ok, but what is if the example1.de contains the example1A.de ?
    example.de
    example1.de
    example1A.de

    Do the spider crawl crawl example.de , example1.de and example1A.de ?

    Many thanks

  20. bernd says:

    What means “Der Wartezustand wurde aufgrund eines abgebrochenen Mutex beendet?”

    On any domains I have this error.

  21. Shen139 says:

    Hi bernd,
    if a domain is with status = 1 means that that domain has been indexed.
    OWS insert new domains with status = 0; status = 1 -> indexed; status = 2 -> indexing!

    If you have 0 indexed_pages per domain means that OWS has not been able to index that sites. I could say because there was an error somewhere.
    Maybe OWS had an error indexing the first domain and then it wasn’t able to index the rest of the domains.

    I don’t speak German and I don’t know what “Der Wartezustand wurde aufgrund eines abgebrochenen Mutex beendet?” is! That error message isn’t inside OWS itselft but in your Framework.

    I can read Mutex… so I guess that you had a problem with mutexes.

    I can only suggest you to wait for next version (many bug fixes in that version) or email me and I’ll send you a copy.

    Stefano

  22. bernd says:

    many thanks. I have send you a email.

  23. SHL says:

    [cite=shen139]
    OWS is structured that it uses the host_id (the ID of the current domain) for many tasks, and so using a regex or any other trick we will destroy the fundamentals of its core.
    [/cite]

    Ok, then I understand why it’s not an easy solution.

    [cite=shen139]
    I’m planning a new version of OWS with this feature…
    [/cite]

    Looking forward to that!

    Thanks for your replies!

    Kind Regards
    // Samuel