Description of the pipeline scripts that export Pub data to and from NCGR. Overview --- At the moment, we have a data pipeline that consists of a set of scripts in the pubsearch/maint/pipeline/scripts directory. Each script will write to standard output, and is organized by table. The pipeline consists of a single driver program called 'pubsearch/maint/pipeline/cronable_send_pipeline.py', which conditionally executes each of the scripts, collects and packages the output, and does an FTP upload of those files to NCGR. After NCGR processes the dump files, then it posts up a set of acknoledgement files to their FTP site. We periodically poll that FTP site using the scripts in pubsearch/maint/pipeline/tairtopub/. Once we get acknowledgement of what things have been successfully processed, we mark the table rows that have been transferred by updating the 'date_last_synchronized' attribute. Details --- The pipeline scripts, at the moment, bypass the Java database api in the pub.db package, and instead, work directly on a table basis using Perl DBI. They all output text files; the majority of them output tab-delimited flat text, though some of them will export XML. We keep a set of formats for each of the scripts: http://tesuque.stanford.edu/wiki/moin.cgi/PipelineFromPubToTair Because a lot of the tables in Pub have originally derived from NCGR, there's an unspoken assumption that the types of each of the columns maps cleanly to the appropriate fields in the TAIR data model in: http://arabidopsis.org/search/schemas.html This isn't always the case, but it's fairly close most of the time. Each exportable row in Pub has an attribute called 'date_last_synchronized', and the criterion for sending a row off can be generalized to the SQL condition: select * from [some table] where date_last_synchronized > date_last_updated assuming that any action that updates a record should touch the date_last_updated timestamp. We maintain a global list of SQLs that the pipeline scripts use to access the database, in: pubsearch/maint/pipeline/scripts/all_sqls.new.txt. The naming of the file is historical, as there used to be an older file called 'all_sqls.txt'; when we moved from an older version of the pub schema to a new one, we made a new sql library, and called it 'all_sqls.new.txt'. We try to limit our access of the database to the SQL queries in that file, and have written a small Perl library called 'sql_library.pm' to access any of them.