OpenWebSpider

OpenWebSpider v0.6 External Modules




What is an external module?

    An external module is a compiled libray of functions that can be loaded and used
    by another program (in this case by OpenWebSpider)

    OpenWebSpider can load and use both Win32(dll) and Linux(so) modules

Getting started

    To develop your first external module you could take a look to a existent module!
    Let's analyze our first filter module
      /* COMMON INCLUDE FILES */
      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      
      /* modHeader.h CONTAINS ALL THE DEFINITIONS AND THE STRUCTURES NEEDED */
      #include "modHeader.h"
      
      /* THIS IS THE FIRST FUNCTION CALLED WHEN THE WEB SPIDER TRY TO LOAD THIS MODULE 
         WITH THIS FUNCTION YOU CAN PARSE A CONFIGURATION FILE AND RETURN ERRORS IF THEY OCCOUR
         IF THIS FUNCTION RETURN 1 TO THE WEB SPIDER THE MODULE WILL BE LOADED
         OTHERWISE IF YOU RETURN 0 THE WEB SPIDER WILL PRINT THE ERROR AND THEN WILL EXIT
      */
      int modInitFilter (char* hostname, char* error)
      {
      /*on OK return 1*/
      	return 1;
      
      
      /* on ERROR return 0 - you can use the variabile error to report what happened
      	strcpy(error,"test");
      	return 0;
      */
      
      }
      
      /* modFilter IS THE MAIN FUNCTION OF ALL THE MODULES
         THIS FUNCTION IS CALLED FOR EACH PAGE PARSED BY THE WEB SPIDER
         HERE WE CAN MAKE DECISIONS ABOUT TO INDEX OR NOT PAGES
         AND EVEN MODIFY THE TEXT AND/OR THE HTML OF THE CURRENT PAGE
         IF WE RETURN 0 THE PAGE WILL NOT INDEXED
      */
      int modFilter (struct functArg* arg)
      {
      /*
      this example function checks if the text of the current page has the string "xxx"
      if yes tells to the webspider to don't index the page otherwise index the page */ if(arg) if(strstr(arg->text,"shen139")>arg->text) return 1; return 0; }

    And now let's take a look at modHeader.h
      /* STRUCTURES AND DEFINITIONS NEEDED */
      
      /* MAX LENGTH FOR AN HOSTNAME */
      #define MAXHOSTSIZE         100
      
      /* MAX LENGTH FOR A PAGE */
      #define MAXPAGESIZE         255
      
      /* MAX LENGTH FOR THE DESCRIPTION OF A PAGE */
      #define MAXDESCRIPTIONSIZE  255
      
      
      typedef struct sHost
      {
      /* HOST-NAME */
      	char Host[MAXHOSTSIZE];
      	
      /* PAGE */
      	char Page[MAXPAGESIZE];
      	
      /* DESCRIPTION */
      	char Description[MAXDESCRIPTIONSIZE];
      	
      /* (TCP) PORT OF THE WEB SERVER*/
      	unsigned short int port;
      	
      /* PAGE TYPE
         TYPES ARE:
         1: HTML PAGE
         2: PLAIN-TEXT FILES (.TXT, .C, .H)
         3: UN-HANDLED FILES
         4: CUSTOM EXTENSIONS
      */
      	unsigned short int type;
      	
      /* IS THIS PAGE INDEXED?
         0: NO
         1: YES
         2: INDEXING
      */
      	unsigned short int viewed;
      	
      /* LEVEL OF THE CURRENT PAGE IN THE TREE OF THE PAGES OF THE CURRENT WEB-SITE
                 p0         Level : 1
               / | \
              p1 p2 p3            : 2
              |     | \
              p4    p5 p6         : 3
      */
      	unsigned short int level;
      
      }SHOST;
      
      /* THIS IS THE STRUCTURE THAT THE WEB SPIDER FILLS AND SEND TO THE MODULES */
      typedef struct functArg
      {
      /* THIS STRUCTURE CONTAINS INFO ABOUT THE CURRENT PAGE (HOSTNAME, PAGE, PORT, ...) */
      	struct sHost* hostInfo;
      
      /* THE HTML OF THE CURRENT PAGE */
      	char* html;
      	
      /* THE LENGTH IN BYTES OF THE HTML */
      	unsigned int htmlLength;
      
      /* THE TEXT OF THE CURRENT PAGE */
      	char* text;
      	
      /* THE LENGTH OF THE TEXT */
      	unsigned int textLength;
      
      /* GENERAL INFO */
      
      /* NUMBER OF THE PAGES INDEXED IN THE CURRENT SESSION */
      	int PagesViewed;
      	
      /* BYTES DOWNLOADED IN THE CURRENT SESSION */
      	long int bytesDownloaded;
      
      /* POINTER TO THE CONNECTION TO THE DB1 */
      	void* mysqlDB1;
      	
      /* POINTER TO THE CONNECTION TO THE DB1 */
      	void* mysqlDB2;
      	
      /* POINTER TO THE CONNECTION TO THE DB1 */	
      	void* mysqlDB3;
      
      
      }FUNCTION_ARGUMENT;
      
      
      /* USEFUL MACRO */
      #ifndef MIN
      	#define MIN(a,b)            (a<b)?a:b
      #endif
      
      	

How to compile an external module?

    Under Linux
      To compile a module under linux is quite easy! If we have a module called my_test_module1.c we can make it with:
        $ gcc -g -c my_test_module1.c
        $ gcc -g -shared -W1,-soname,my_test_module1.so.0 -o my_test_module1.so my_test_module1.o -lc

    Under Windows
      The easiest way to compile a module under windows with MS Visual C++ 6.0 is to open a new Project
      Win32 Dinamic-link Libray and then press "Build it" from the menu: Build (or by pressing F7)






Your Text Link here with
Pay-Per-Link


Hot Property
Hotproperty began in 1995 as publisher of Hotproperty, the comprehensive and easy-to-use source of information for all your property needs.

IT Support
Dramatically Reduce Your IT Costs. Microsoft Server support, Micorsoft PC support. Century Computing will offer a free IT consultation survey.

Black Belt Courses
Six Sigma : Black Belt Courses. Black Belts have a major role within a six sigma programme. Often on a full-time basis, a Black Belt will lead and deliver a number of six sigma projects.

Laboratory Sales Jobs
Zenopa specialises in UK Laboratory Sales jobs. We deal with a variety of companies in the laboratory sales and lab sales industry.

Designer Watches
Click to view a wide range of beautiful designer watches online.

Memory Foam
The increasingly popular option of memory foam mattresses are an option through the Bedtrader.co.uk.

cheap ink cartridge
Just because you want a cheap ink cartridge doesn't mean that it has to be low quality too. We provide fully guaranteed quality cartridges!