Navigation





Resume

Objective 
An in-depth study as a Ph.D student in Computer Science or relevant area that will utilize my research and hands-on experiences.

Research Interests
HPC, Cloud Computing, System Architecture, Data Mining and Bioinformatics are major areas I'm following. In particular, my research interests focus on some novel theories and technologies in the infrastructure for Data Center and HPC, including Data-intensive and Datacenter-Scale Computing, Modeling Supercomputer and Scientific Applications, MapReduce Framework, Distributed File System, High Performance Network, Realtime Stream Data Mining, Data Center Virtualization, Cyberinfrastructure for e-Science and Next Generation Sequencing.

Education
Selected Projects 
  • Eukaryotic Automated Structural Annotation Workflow - pic
    Designed and implemented an automated genome structural annotation pipeline based on Kepler workflow and HPC platform, including stages of transcript QA, repeat analysis, transcript mapping, gene prediction, combined consensus prediction and EST based refinement.  
    Tools: Kepler, SeqClean, RepeatMasker, PASA, Augustus, GeneMark, EVidenceModeler, Java, Perl

  • The Glycan Array QSAR Tool - website  
    Designed and implemented a quantitative structure-activity relationship (QSAR) online tools to analyse the glycan array data based on the regression coefficients of PLS. The tool will report the sub-trees that determine the binding specificities of glycan-binding proteins.
    Tools: Drupal, Webform, MATLAB, PHP
  • Cotton Marker Database - website
    Designed and implemented two new features for CMD website: Train View and Primer Redundancy Analysis; Applied Xen and KVM virtualization solution to the system platform of CMD; Ported all computational and data-intensive jobs to Clemson Palmetto HPC; Developed a suite of web-based bioinformatics and data mining tools to help the biological scientists without a background in computer science can easily handle big sequence data. This project funded by Cotton Incorporated and website has been accessed by users from 101 countries and 48 states in USA.
    Tools: mpiBLAST, mpiFASTA, BioPerl, CMapCAP3
    , MySQL, PostgreSQL, Perl, CPAN, SQL, Bash, HTML, JavaScript

  • Transposable Element De Novo Annotation - poster
    Deployed two bioinformatics pipelines (TEdenovo and TEannot) on HPC based on Sun Grid Engine (SGE) platform; Detected, annotated and analyzed transposable elements (TEs) repeats in genomic sequences. Worked as a cooperator with Dr. Véronique Decroocq (Project Coordinator, INRA, France). 
    Tools: REPET, NCBI-BLAST, WU-BLAST, RECON, PILER, CENSOR, RepeatMasker, TRF, Mreps, hmmer3, SGE, Perl, Python

  • Improvement of SSRs Redundancy Identification by Machine Learning Approach - poster
    Improved the accuracy of the SSRs (Microsatellites) redundant detection and reduced the cost of expert intervention in polymorphism discovery by Support Vector Machine (SVM) approach in the project of Cotton Marker Database (for Dr. Anna Blenda, PI, Cotton Marker Database) and the project of genetic map integration (for Dr. Jean-Marc Lacape, Project Coordinator, UMR-DAP, France). 
    Tools: LIBSVM, Weka, Perl
  • FHI Chinese Chestnut Assembly - website
    Deployed different assembler programs on Clemson Palmetto HPC based on MPI and MapReduce parallel models; Resolved performance and overflow problems caused by the big data; Optimized the I/O performance between bioinformatcs program and PVFS file system; Assembled the whole genome sequence of Chinese Chestnut using NGS data (20x 454 data, 47x Illumina data and limited Sanger sequences); Worked as a HPC specialist with Dr. Meg Staton (Bioinformaticist, Clemson University Genomics Institute, USA).
    Tools: Celera AssemblerAbyssVelvetSOAPdenovoNewblerQuake
  • VoIP Assessment Tool Base on Autonomous NAT Traversal (Course Project) - reportslides
    Designed and deployed an experimental platform for VoIP assessment to capture different behaviors of NAT Traversal protocol; Evaluated and analyzed the VoIP voice quality based upon observed transport measurements. 
    Tools: Iperf, tc, Netfilter, C, Bash

  • Hybrid Parallel Algorithms for Constructing bi-directed de Bruijn Graphs (Course Project) - report
    Designed and implemented a set of NGS sequence assembly algorithms based on bi-directed de Bruijn graph using MapReduce programming framework; Evaluated the parallel performance with differences size of computing nodes and two differences of network environments; Designed a solution to enable Hadoop running on the traditional HPC platform.
    Tools: Hadoop streaming, Perl, Java, Bash, PBS Script 

  • Compiler Design (Course Project)
    Designed and implemented a compiler using C and Intel x86 assembly language, including handling and manipulating integer and real numbers, procedures, recursions, matrices, integer arithmetic, parameter passing, variable scoping, etc.
    Tools: C 
  • Course Project: Introduction of HPC Toolkit OSCAR (Course Project) - slides
    Gave an introduction for design, building and maintain HPC using OSCAR toolkit.
Work and Research Experiences
Other Certifications / Professional Training 
  • Microsoft Certified System Engineer (MCSE) Certification, 2002
  • Microsoft Certified Database Administrators (MCDBA) Certification, 2002
  • Cisco Certified Network Associate (CCNA) Training Courses, 2003
  • Oracle 9i DBA/OCP Training Courses, 2005
Skills
  • Operating Systems: Windows Server (12 years), Linux Server (11 years)
  • Languages: C, Java, Perl, PHP, Sparc Assembly and (Learning: Go, Scala)
  • Parallelism Model: MapReduce, MPI and Multithreading
  • Platforms/APIs: Hadoop, KeplerGWT, DrupalGMOD, LabVIEW, MATLAB and Coloured Petri Nets
  • Development Tools: Vim, Eclipse, GNU GCC/GDB, Intel Cluster Toolkit 
Papers/Posters
Pengfei Xuan, David Camak, Feng Luo, Don Jones, and Anna Blenda. CMD: a cotton marker database resource for gossypium genetics and genomics research. 2011. (paper preparing)

Pengfei Xuan, Yuehua Zhang, Tzuen-rong Jeremy Tzeng, Xiu-Feng Wan, Feng Luo. A quantitative structure-activity relationship (QSAR) study on glycan array data to determine the specificities of glycan-binding proteinsOxford Journals Glycobiology. 2011. (paper)

Pengfei Xuan, Justin Bartanus, David Camak, Feng Luo, Don C. Jones, Anna Blenda. Recent Updates of the Cotton Marker Database (CMD)Plant & Animal Genomes XX Conference. 2012. (poster)

Pengfei Xuan, Feng Luo, Albert Abbott, Don Jones, Anna Blenda. Improvement of SSR redundancy identification with machine leaning approach using dataset from cotton marker databasePlant & Animal Genomes XIX Conference. 2011. (poster)

Justin Bartanus, Pengfei Xuan, Anna Blenda. Collection, Annotation and Public Database Display of the Agronomically Important Traits and QTLs in Cotton. ASB Tri-Beta. 2011. (poster)

David Camak, Pengfei Xuan, Anna Blenda. Analysis of the Phenotypic Traits in Cotton Linked to the Genetically Mapped Molecular Markers. ASB Tri-Beta. 2010. (poster)

V. Decroocq, Pengfei Xuan, T. Zhebentyayeva, S. Scalabrin, I. Verde, B. Sosinski, A. Abbott. Transposable element annotation and the development of insertion site-based polymorphism markers in Prunus species5th International Rosaceae Genomics Conference (RGC5). 2010. (poster)

Lingyun Zhu, Xia Yang, Yuanping Yi, Pengfei Xuan, Zhigang Shuai, Jean-Luc Brédas. Three-photon absorption in anthracene-porphyrin-anthracene triads: A quantum-chemical study. The Journal of Chemical Physics. 2004, 121:11060. (paper)

Shiwei Yin, Liping Chen, Pengfei Xuan, Ke-Qiu Chen, Zhigang Shuai. Field effect on the singlet and triplet exciton formation in organic/polymeric light-emitting diodes. The Journal of Physical Chemistry B. 2004, 108(28):9608-13. (paper)