My Masters Thesis (PDF) is entitled Rapid statistical classification on the Medline database of biomedical literature. I was supervised by Prof. Cathal Seoighe.
My thesis studies techniques for fast, accurate classification of Medline records. It applies pre-processing techniques drawn from information retrieval literature to the problem of machine learning and text classificaton, resulting in a Naive Bayes classifier that is thousands of times faster than the typical implementations in Weka and text classification literature. The main speedups are from using a compressed index of numerical features, numerical transformations of the Naive Bayes formula, and approximating the negative class with background frequencies.
In the process, I also developed a novel improvement to the Laplace smoothing method of estimating feature frequencies for training a Naive Bayes classifier, which I call Split-Laplace Smoothing. This method unbiases the estimator from class skew, allowing one to use wildly unbalanced training data - such as a handful of positive examples and millions of negative examples - and still obtain good estimates.
I built the MScanner web application around the algorithm and hosted it with collaborators at Stanford. The user submits a list of PubMed IDs for known abstracts of interest, and MScanner returns a page of other abstracts in Medline that are most relevant to the topics inferred from the input. MScanner takes about 90 seconds to process the 16 million articles in Medline (2008 figure) using a restricted feature set of MeSH terms, and 180 seconds when also using title and abstract terms.
The MScanner project is hosted on Google Code, and the source code is licensed under the GNU General Public License. The project pages contain information about the implementation and and web application.
MScanner: a classifier for retrieving Medline citations. BMC Bioinformatics 2008 9:108.
Poster presentation at the Intelligent Systems for Molecular Biology (ISMB) conference in Vienna, Austria (07/2007), speaking presentation at the First South African Bioinformatics Workshop (01/2007).
Preliminary results were presented at the first South African Bioinformatics Workshop at Wits University in January 2007.
NRF (National Research Foundation) Scarce Skills Masters Scholarship
Please see the page about my M.Sc. CourseworkRapid statistical classification on the Medline database of biomedical literature by Graham L. Poulter is licensed under a Creative Commons Attribution 3.0 Unported License.