For a college project of 10 dutch 'studypoints', I’d like to build an information retrieval system which combines the distributive approach of file-sharing utilities and the logical framework built into relational database management systems (rDBMS’s).
With the combination of these two users can do more than simply swap large quantities of static data (files) by searching in their filename. The use of a DBMS enables anyone to include a lot of information about a piece of data. It can, for instance, be linked with other items in one or more ‘equivalence’ groups or specific commentary and keywords might be included to increase the hitcount. Textual data might be indexed by a spider for retrieval and other ‘intelligent’ features could be included to create a system which is much more flexible than the current file-based distributed sharing systems.
Since the line between the internet and local data is becoming ever more fuzzy, the difference between finding data on a local system and the network is also disappearing. For about half a year I’ve been working on an document for building this ‘bridge’ at http://atoms.htmlplanet.com1.
On the user’s local computer information is seen as static packages (files) that are accessed in a hierarchical manner. The internet, on the other hand, is built out of websites, consisting of many files and even information from databases. Although some information on local computer systems are also kept in databases, most of it is not.
Eliminating the differences in database and filesystem access may be the key in allowing more useful sharing of information using digital networks such as the internet as well as in making a local computer system more flexible and maintainable.
In this project, I plan to research the useability of the distributed part of this idea, using SQL based databases as the underlying datasources. The reason for using rDBMS’s is the well thought out interface, based on a question and answer scheme. Instead of using binary large object fields containing data inside the database, only links to files are used, thus disabling part of the flexibility of the system. The local computer system is not altered, saving considerable time.
The project I propose to develop is not unique in this field. Although I know of no other project with exactly the same goals, there is some quite interesting work going on in using relational databases to index a harddrive and in distributive information sharing. In the first field the next projects might be interesting, although rDBMS manufacturers are also working on projects like these:
The second part of the project deals with developing a distributive persistent ‘information cloud’ in which people can find the information they want in by querying this system. The term under which these applications are known is peer-to-peer networking. Some well known projects are Napster4and gNutella5.
Since it is wise to modularize a programming exercise of such size I devided the entire system in a couple separate parts. These will be discussed individually below.
The main core of the system is the datasource each user has. This will be built out of standard DBMS technology, so that SQL queries can be used. For the windows platform a Microsoft Access database would probably be preferable, although some work on choosing the right rDBMS should still be done.
To fill the database with useful information, spiders can be used (the term ‘spider’ is used in internet searchengine technology) which can crawl through local files. For each specific filetype a different spider should be developed. This makes it necessary to use a plug-in portal for these tools.
The networking should rely on the TCP/IP protocol, since this is the most common networking protocol today. A lot of data-swapping and information synchronisation techniques I would like to include can already be found in file-sharing utilities such as Napster, iMesh and gNutella. Since the last of these is open-source I will probably use a lot of gNutella code for creating the data-sharing layer.
On top of the simple file-sharing layer an interface will be added which allows for detailed questioning of the current ‘datacloud’ based on the SQL used in the core.
You can see the way these modules connect below:
The datasource module can be seen as the core of the system. Containing all the information the system knows of, it is the central part of the information. The datasource can be built out of a relational DBMS and a port to this system which acts as an information server. The port queries the individual database using standard SQL for information and acts on behalf of the local user, a remote user and based on programmed synchronisation of information in the information cloud.
The database should at least consist of these parts:
Graphically, the datasource is internally and externally connected as shown below:
The Distributed Networking Module is described above as the port program. It will be the port between the local and the remote information and should be invisible from the user’s point of view. Running as a process in the back, this program allows or denies computers trying to connect to the network by connecting to this system, frequently updates values such as number of connected computers and items of information online. On behalf of incoming request from other programs it can query the database and return information based on the results or bounce the request to another known system.
The use of SQL is especially important for this part of the program. Since SQL can be ‘understood’ by a computer program, we can base decisions on incoming queries, and deny, allow or alter these queries before actually running them. The port program should also have security features installed, such as blocking of requests from users with too little access priviliges and perhaps a blacklist based on pryor behaviour.
Furthermore, the port-program has to create standardized output based on the results it receives from the database.
Although not needed, it might be a good idea of adding a grouping feature based on unique numbering. This way a port can be member to specific groups of friends, such as ports in the same building or speciality interests of users controlling the port. Belonging to a group will allow for sharing of protected data, while otherwise only public data will be shared.
On the networking level the program should consist of TCP/IP protocol access with a HTTP layer. Creating an efficient data-exchange language on top of TCP is too much work to my opinion and the HTTP layer allows for standardized and reliable networking.
The last part of the port consists of access points to spiders. The seperation of port and spiders allows for competition in spider-design and expansion of the versatility of the system without having to change the port itself. I will have to look through WinAmp, Netscape or Internet Explorer documentation for designing such a plug-in system.
Below is a graphical outline of the port:
Although the spiders connect to the port through a plug-in system the spiders themselves are part of the port. The separate spiderplatform is used because third-party users might want to create spiders specific to their tasks if this project gets to the point of people playing with it or using it.
Because of this design issue, separate standardization on these plugins should exist. For now, I’m thinking of dynamically loaded libraries whose member function executing the indexing is called by the port-program on a user-defined basis. The output they generate is of course based on the datalayout of the database and therefore I believe it is too early to discuss it here.
The last module deals with the human-computer interface and should be a separate program that calls on the port program to get information. Seperating the UI from the core leaves for many different implementations based on user needs. For this program I propose the following:
Two UI programs, showing the diversity of the system, serving two different groups of users. The first program, which can be used for debugging of the port program will be a simple SQL front-end. The user has to type in it’s queries and retrieves views based on these queries, plain and simple.
The second program can then be a user friendly front-end, showing the user which standard queries the system usually answer, such as a search based on filetype and name. The direct SQL is thus made invisible to the user. Specific implementation of this system is unclear at this point and further research should be done on this point (I believe a course called ‘Interfaces’ is given at the LIACS this semester which I intend to follow if it deals with this subject).
The platform on which I’d like to develop this project is Windows NT or Windows 2000. The reason for this is the following. I have no knowledge about other computer platforms than Windows and Unix and of these two I find the first both developer- and user-friendlier.
The development package I would prefer is Microsoft Visual C++, based on previous experience with this system.
For a project of this size some planning might be useful. Below is the timeline of the program I propose:
Startdate : Enddate (DD/MM)
Workplan for this timespan
08/01 : 19/01
Writing Project Proposal
20/01 : 02/02
Requirement analysis and more precise specification of the different modules based on this research.
03/02 : 09/02
Creating and documenting the Database model, with standardized table and view-definitions.
10/02 : 02/03
Creation of a local port with spiderplug-in system and datasource connection in place.
03/03 : 09/03
Creation of a very basic user interface to test the local port
10/03 : 30/03
Updating of the local port and adding of remote access facilities.
31/03 : 12/04
Creation of spiders for text and webpage indexing and for simple filesystem indexing based on filename.
April 12
First demonstration version should be finished.
13/04 : 11/05
Documentating the system so far and indexing of shortcomings. Altering the system based on this documentation.
12/05 : 01/06
Creation of a second User Interface, with SQL abstraction.
02/06 : 15/06
Reserved for delays and final documentation updates.
June 15
Demonstration of final program.