For the second-level categories I've created two plug-ins. The .mp3 specific plug-in will read any infile tags written in a de facto standard if they are present. The google querying plug-in tries to place files into a directory modeled to the open directory project based on earlier results. I'll handle their results separately.
Quite a few mp3 files found on the internet contain metadata within the file. Because of the specific streamwise encoding of audio in the mp3 format extra information can be added anywhere in the file. This metadata usually contains the original artist, title and album number, giving us plenty of input to use for further automatic indexing. The percentage of successfully categorized entries that contained this formatted metadata approached 100%. Only when the artist info did not reference a single artist or when this artist was not known by the webdirectory did our indexing fail. All in all I think we can say from this that creating plug-ins to read out proprietary metadata formats greatly increases chances of success. This should not come as a surprise.
The google indexer creates keywords from the already created metadata and simply calls the google directory through its webinterface. Whether this returns a correct subtree in the directory is based on two factors: the possible keywords at hand and the selection process we apply to these keywords. The first factor cannot be optimized, since this is dependent on the input from the other modules. We can only make sure that the google plug-in is called after all other (non-web) plug-ins have been called. Finding the correct keywords does make a large difference. Since most raw data is still taken from the filesystem this can be seen as the primary input. When other input is available a spectacular increase of successful results can be noted but this is rarely the case. Finding correct keywords therefore consists mostly of breaking down the filepath and name into subsections and selecting the ones that are most probably of interest to us. The selection algorithm I used is based on some elementary heuristics, for instance that the first part of a filename will describe the name of the author or artist. Using these hardcoded heuristics returns very good results when the user has named his files as expected. Unfortunately this is often not true. Many times the original artist of a number of files is written in the directoryname. A lot of files cannot even be traced to a single artist, for instance business documents.
I can easily say that the testprogram categorized my audio and video collection very well. Then again, I programmed the system to read filenames in this fashion. To extrapolate these correct results to the general case is a lot harder. One way of doing so would be to ask minimal user input, for instance that the user specifies which convention he uses when naming files and directories. Another solution would be to simply try multiple combinations of keywords with the google directory. I have tried this during testing, but removed it due to the very slow response time. Each lookup cost about 1 second, so doing multiple lookups per file proved to be very impractical. Recently google.com released a web services based API for accessing the system which could speed up the process, making this a viable solution. A third option would be to exploit the locality of files. If one file in a directory is indexed successfully, this information could be used for other's as well. Basically all these small heuristics boil down to the same idea. To successfully extract information from a file's path and name the programmer should somehow reverse engineer the user's naming algorithm. Many schemes can be found that will work for some set of files, but only human interaction will consistently help us select the most useful data. I therefore suggest using a learning algorithm where users can give feedback on the acquired results. This way minimal intervention is needed, but adaption to a personal indexing method remains possible.