Pubsearch Literature Curation Tool  (www.arabidopsis.org)

A member of the Generic Model Organism Database consortium:

    http://gmod.sourceforge.net/


Description
--- 

PubSearch is a web-based literature curation tool to allow curators to
search and annotate genes to keywords from articles. It has a simple,
MySQL database backend and uses a set of Java Servlets and JSPs for
querying, modifying, and adding gene, gene-annotation, and literature
information.

The official web site for the software is at:

    http://pubsearch.org


Installation
---

We've moved the installation instructions into a separate document.
Please see doc/install_guide.pdf.


Directory structure
---

The pubsearch directory structure contains the following:

    bin: support scripts.

    jsp: Java Server Pages.

    data: Test data and sample bulk-loading files.

    var: Auxillary data directory where Pubsearch stores its log files and
    indices.

    WEB-INF: Java class directory, libraries and sources.

    lib: compile-time libraries that shouldn't be in the web application.


Bulk-loading data and maintenence
---

We've located several maintenence scripts in the 'bin/' directory.
They are Perl wrappers around Java code, with some glue to set the
Java CLASSPATH correctly.

    o  bin/add_curator.pl: adds a new curator user the system.

    o  bin/index_full_text.pl: preprocesses full text of articles, genes,
          and terms to enable full-text search to work properly.  Can be
          used in a cronjob.  This is an intensive process; expect
          this to take a good thirty minutes for large document collections.

    o  bin/generate_hits.pl: preprocesses articles and terms, and generates
          hit associations between those objects.

    o  bin/bulk_load_articles.pl: bulk loads all of the articles in an
       xml file.

          We use a subset of the Pubmed XML format:
        
              http://xml.coverpages.org/nlmXML.html

      bulk_load_articles.pl can take in an additional parameter,
      --add-pubsources, which will automatically insert proxy publication
      sources if necessary.

      A sample article XML file is provided in data/test
     (sample_pubmed_articles.xml)


    o bin/fetch_articles.pl: retrieves a batch of articles from Pubmed
    and Agricola.  Takes in three arguments:

         [query_string]  The query string
         [from_date]     Date in the format yyyy/mm/dd
         [to_date]       Date in the format yyyy/mm/dd


    o  bin/bulk_load_terms.pl: bulk loads all the terms in an xml file.

          We use the Term format defined by the Gene Ontology:

             http://www.geneontology.org/

          There are samples term files in data/test (four_go_terms.xml).


    o  bin/bulk_load_genes.pl: bulk loads all of the genes in an xml file.

          We are using a homebrewed format at the moment, defined at:

              http://tesuque.stanford.edu/~iris/gene.dtd

          A sample input file is 'data/test/one_gene.xml'.