Plant Transcription Factor Database
Pipeline to construct comprehensive protein dataset
Species with genome annotation
From version 3.0, we did not construct a protein dataset for species whose genome annotation were available any more. For these species, protein sequences from genome annotation were used after filtering out putative pseudogenes (those have * within protein sequences)
Species without genome annotation
For species whose genome sequences were not available, EST-based data from PlantGDB and UniGene were used as the main sources to construct protein data set (see datasource). Following steps were used to get a non-redundant protein data set:
  1. Identifying coding sequence (CDS) and corresponding peptide sequence by ESTScan with CDS length>=150 and score >=200.
  2. Filtering out those proteins whose 'x' content is greater than 0.05.
  3. Clustering proteins by blastclust (identity >= 0.95 and coverage >= 0.9), and the resulted protein set is called PUset.
Pipeline for species with genome sequence