Math Web Search System Description
A more detailed description of MathWebSearch
MWS is complete system capable of crawling, indexing and searching mathematical data. The standard structure of the system is shown below:
There are multiple nodes that can do crawling or act as search servers, or both. The nodes running search servers have access to a filesystem that stores the indexed data. The data is loaded to memory only once (and reloaded if a change is detected) so access to the filesystem is rare. A node can act as a meta search server as well: it only distributes the search query to other (meta) search servers and combines the results. The nodes running crawlers have access to at least the MySQL database server that contains the database to be updated. All (meta) search servers connect to a main node that runs a meta search server. This main node is also connected to the web server and the MySQL servers. This is where the search results are combined with data from the databases and stored in a sessions database. A daemon providing the described API also runs on this node. The webserver only communicates with this daemon and sends data to the clients as needed. It is easy to have multiple nodes that handle the job of this main node.
The search routines and the indexing code are implemented in C++ and compiled separately as a library. The search routines are multi-threaded (pthread). The search server itself is a daemon that uses the library to answer queries. The meta search server and the crawlers are implemented in Perl. Communication between all components is done via TCP/IP.
The crawlers can browse through a page (set of pages) starting from a seed (set of seeds) or OAI repositories and extract data that can be indexed. This data is currently OpenMath and MathML. The extracted XML is then converted to terms that can be parsed by the indexing code. To each term, an XPointer reference is associated. Currently, http://cnx.org and http://functions.wolfram.com are the only indexed websites.
The purpose of the admin server is to provide an interface for managing what code runs on which nodes. This component is not yet complete.