Pubsearch Literature Curation Tool (www.arabidopsis.org) A member of the Generic Model Organism Database consortium: http://gmod.sourceforge.net/ Description --- PubSearch is a web-based literature curation tool to allow curators to search and annotate genes to keywords from articles. It has a simple, MySQL database backend and uses a set of Java Servlets and JSPs for querying, modifying, and adding gene, gene-annotation, and literature information. The official web site for the software is at: http://pubsearch.org Installation --- We've moved the installation instructions into a separate document. Please see doc/install_guide.pdf. Directory structure --- The pubsearch directory structure contains the following: bin: support scripts. jsp: Java Server Pages. data: Test data and sample bulk-loading files. var: Auxillary data directory where Pubsearch stores its log files and indices. WEB-INF: Java class directory, libraries and sources. lib: compile-time libraries that shouldn't be in the web application. Bulk-loading data and maintenence --- We've located several maintenence scripts in the 'bin/' directory. They are Perl wrappers around Java code, with some glue to set the Java CLASSPATH correctly. o bin/add_curator.pl: adds a new curator user the system. o bin/index_full_text.pl: preprocesses full text of articles, genes, and terms to enable full-text search to work properly. Can be used in a cronjob. This is an intensive process; expect this to take a good thirty minutes for large document collections. o bin/generate_hits.pl: preprocesses articles and terms, and generates hit associations between those objects. o bin/bulk_load_articles.pl: bulk loads all of the articles in an xml file. We use a subset of the Pubmed XML format: http://xml.coverpages.org/nlmXML.html bulk_load_articles.pl can take in an additional parameter, --add-pubsources, which will automatically insert proxy publication sources if necessary. A sample article XML file is provided in data/test (sample_pubmed_articles.xml) o bin/fetch_articles.pl: retrieves a batch of articles from Pubmed and Agricola. Takes in three arguments: [query_string] The query string [from_date] Date in the format yyyy/mm/dd [to_date] Date in the format yyyy/mm/dd o bin/bulk_load_terms.pl: bulk loads all the terms in an xml file. We use the Term format defined by the Gene Ontology: http://www.geneontology.org/ There are samples term files in data/test (four_go_terms.xml). o bin/bulk_load_genes.pl: bulk loads all of the genes in an xml file. We are using a homebrewed format at the moment, defined at: http://tesuque.stanford.edu/~iris/gene.dtd A sample input file is 'data/test/one_gene.xml'.