Description of the pipeline scripts that export Pub data to and from
NCGR.


Overview
---

At the moment, we have a data pipeline that consists of a set of
scripts in the pubsearch/maint/pipeline/scripts directory.  Each
script will write to standard output, and is organized by table.

The pipeline consists of a single driver program called
'pubsearch/maint/pipeline/cronable_send_pipeline.py', which
conditionally executes each of the scripts, collects and packages the
output, and does an FTP upload of those files to NCGR.

After NCGR processes the dump files, then it posts up a set of
acknoledgement files to their FTP site.  We periodically poll that FTP
site using the scripts in pubsearch/maint/pipeline/tairtopub/.  Once
we get acknowledgement of what things have been successfully
processed, we mark the table rows that have been transferred by
updating the 'date_last_synchronized' attribute.


Details
---

The pipeline scripts, at the moment, bypass the Java database api in
the pub.db package, and instead, work directly on a table basis using
Perl DBI.  They all output text files; the majority of them output
tab-delimited flat text, though some of them will export XML.


We keep a set of formats for each of the scripts:

    http://tesuque.stanford.edu/wiki/moin.cgi/PipelineFromPubToTair

Because a lot of the tables in Pub have originally derived from NCGR,
there's an unspoken assumption that the types of each of the columns
maps cleanly to the appropriate fields in the TAIR data model in:

    http://arabidopsis.org/search/schemas.html

This isn't always the case, but it's fairly close most of the time.

Each exportable row in Pub has an attribute called
'date_last_synchronized', and the criterion for sending a row off can
be generalized to the SQL condition:

     select * from [some table] where
         date_last_synchronized > date_last_updated

assuming that any action that updates a record should touch the
date_last_updated timestamp.


We maintain a global list of SQLs that the pipeline scripts use to
access the database, in:

    pubsearch/maint/pipeline/scripts/all_sqls.new.txt.

The naming of the file is historical, as there used to be an older
file called 'all_sqls.txt'; when we moved from an older version of the
pub schema to a new one, we made a new sql library, and called it
'all_sqls.new.txt'.

We try to limit our access of the database to the SQL queries in that
file, and have written a small Perl library called 'sql_library.pm' to
access any of them.