OpenWebSpider v0.6 External Modules
What is an external module?
-
An external module is a compiled libray of functions that can be loaded and used
by another program (in this case by OpenWebSpider)
OpenWebSpider can load and use both Win32(dll) and Linux(so) modules
Getting started
-
To develop your first external module you could take a look to a existent module!
Let's analyze our first filter module
/* COMMON INCLUDE FILES */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* modHeader.h CONTAINS ALL THE DEFINITIONS AND THE STRUCTURES NEEDED */
#include "modHeader.h"
/* THIS IS THE FIRST FUNCTION CALLED WHEN THE WEB SPIDER TRY TO LOAD THIS MODULE
WITH THIS FUNCTION YOU CAN PARSE A CONFIGURATION FILE AND RETURN ERRORS IF THEY OCCOUR
IF THIS FUNCTION RETURN 1 TO THE WEB SPIDER THE MODULE WILL BE LOADED
OTHERWISE IF YOU RETURN 0 THE WEB SPIDER WILL PRINT THE ERROR AND THEN WILL EXIT
*/
int modInitFilter (char* hostname, char* error)
{
/*on OK return 1*/
return 1;
/* on ERROR return 0 - you can use the variabile error to report what happened
strcpy(error,"test");
return 0;
*/
}
/* modFilter IS THE MAIN FUNCTION OF ALL THE MODULES
THIS FUNCTION IS CALLED FOR EACH PAGE PARSED BY THE WEB SPIDER
HERE WE CAN MAKE DECISIONS ABOUT TO INDEX OR NOT PAGES
AND EVEN MODIFY THE TEXT AND/OR THE HTML OF THE CURRENT PAGE
IF WE RETURN 0 THE PAGE WILL NOT INDEXED
*/
int modFilter (struct functArg* arg)
{
/*
this example function checks if the text of the current page has the string "xxx"
if yes tells to the webspider to don't index the page otherwise index the page
*/
if(arg)
if(strstr(arg->text,"shen139")>arg->text)
return 1;
return 0;
}
And now let's take a look at modHeader.h
/* STRUCTURES AND DEFINITIONS NEEDED */
/* MAX LENGTH FOR AN HOSTNAME */
#define MAXHOSTSIZE 100
/* MAX LENGTH FOR A PAGE */
#define MAXPAGESIZE 255
/* MAX LENGTH FOR THE DESCRIPTION OF A PAGE */
#define MAXDESCRIPTIONSIZE 255
typedef struct sHost
{
/* HOST-NAME */
char Host[MAXHOSTSIZE];
/* PAGE */
char Page[MAXPAGESIZE];
/* DESCRIPTION */
char Description[MAXDESCRIPTIONSIZE];
/* (TCP) PORT OF THE WEB SERVER*/
unsigned short int port;
/* PAGE TYPE
TYPES ARE:
1: HTML PAGE
2: PLAIN-TEXT FILES (.TXT, .C, .H)
3: UN-HANDLED FILES
4: CUSTOM EXTENSIONS
*/
unsigned short int type;
/* IS THIS PAGE INDEXED?
0: NO
1: YES
2: INDEXING
*/
unsigned short int viewed;
/* LEVEL OF THE CURRENT PAGE IN THE TREE OF THE PAGES OF THE CURRENT WEB-SITE
p0 Level : 1
/ | \
p1 p2 p3 : 2
| | \
p4 p5 p6 : 3
*/
unsigned short int level;
}SHOST;
/* THIS IS THE STRUCTURE THAT THE WEB SPIDER FILLS AND SEND TO THE MODULES */
typedef struct functArg
{
/* THIS STRUCTURE CONTAINS INFO ABOUT THE CURRENT PAGE (HOSTNAME, PAGE, PORT, ...) */
struct sHost* hostInfo;
/* THE HTML OF THE CURRENT PAGE */
char* html;
/* THE LENGTH IN BYTES OF THE HTML */
unsigned int htmlLength;
/* THE TEXT OF THE CURRENT PAGE */
char* text;
/* THE LENGTH OF THE TEXT */
unsigned int textLength;
/* GENERAL INFO */
/* NUMBER OF THE PAGES INDEXED IN THE CURRENT SESSION */
int PagesViewed;
/* BYTES DOWNLOADED IN THE CURRENT SESSION */
long int bytesDownloaded;
/* POINTER TO THE CONNECTION TO THE DB1 */
void* mysqlDB1;
/* POINTER TO THE CONNECTION TO THE DB1 */
void* mysqlDB2;
/* POINTER TO THE CONNECTION TO THE DB1 */
void* mysqlDB3;
}FUNCTION_ARGUMENT;
/* USEFUL MACRO */
#ifndef MIN
#define MIN(a,b) (a<b)?a:b
#endif
How to compile an external module?
-
Under Linux
-
To compile a module under linux is quite easy! If we have a module called my_test_module1.c we can make it with:
-
$ gcc -g -c my_test_module1.c
$ gcc -g -shared -W1,-soname,my_test_module1.so.0 -o my_test_module1.so my_test_module1.o -lc
Under Windows
-
The easiest way to compile a module under windows with MS Visual C++ 6.0 is to open a new Project
Win32 Dinamic-link Libray and then press "Build it" from the menu: Build (or by pressing F7)

