Data Mining. A. B.

Data Mining. (Section A)

 

124A. Genomewide Analysis of Bkm Sequences (GATA repeats): Predominant Association with Sex Chromosomes and Potential Role in Higher Order Chromatin Organization and Function. 43

125A. Hierarchical Machine Learning for Characterising Protein Families. 43

126A. In Silico Comparison of the Transcriptome Derived from Purified Normal Breast Cells and Breast Tumor Cell Lines Reveals Candidate Upregulated Genes in Breast Tumor Cells. 43

127A. Extraction and Dynamic View of Biomolecular Interactions in Large Biomedical Text Database. 43

128A. Mining the literature for enzyme-disease associations. 44

129A. Search for Gene Regulatory cis-Elements in Arabidopsis thaliana. 44

130A. Semantic Similarity Measures Across the Gene Ontology: Relating Sequence to Annotation. 44

131A. Patterns, Pairings and Predictions of Catalytic DNA. 44

132A. GIMS a Data Warehouse for Management and Analysis of Complex Biological Data. 44

133A. A Simple Statistical Test for Evaluating Differences between Database Retrieval Methods. 45

134A. Proteome Databases: An Information Source for Bacterial Immunology. 45

135A. RED: a web-based system for the analysis, management, and dissemination of expressed sequence tags. 45

136A. Searching Microarray Time Series Data for Yeast Cell-Cycle Regulatory Genes. 45

137A. Application of Relational Database Tools for the Analysis of Large Proteomic Data Sets from Tandem Mass Spectrometry. 45

138A. Hierarchical Cluster Analysis and Classification of SAGE data. 46

139A. GEA:  a Toolkit for Gene Expression Analysis. 46

140A. A Method for Detecting Protein-Protein Interaction Rules. 46

141A. G-language Genome Analysis Environment. 46

142A. Data Handling for Detailed Phenotypic Characterization of Novel Mouse Phenotypes. 46

143A. Willo and Wisp: Data Management Systems for Mouse Genome Mapping and Sequencing. 47

144A. Novel Opportunities and Challenges in the Human Proteome: A Bioinformatics Strategy to Identify Splice Variants of Druggable Gene Targets. 47

145A. DrugBank: An Integrated Database for Drug Discovery and Pharmacogenomics. 47

146A. The Genomics Unified Schema (GUS). 47

147A. Compensation for Nucleotide Bias in a Genome by Representation as a Discrete Channel with Noise. 48

148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?  48

149A. The CyberCell Database (CCDB). 48

150A. Functional Database System of Olfactory Receptors. 48

151A. A Standard Corpus for Evaluating Extraction of Molecular Interaction Pathway Information from Scientific Abstracts. 48

152A. A First Study of the Central Role of the Analyst in the Knowledge Discovery Process in Biology. 49

153A. Assessing the Compactness and Isolation of Individual Clusters Observed in Microarray Data. 49

154A. An Amino Acid Centered Database to Facilitate Protein Crystallisation. 49

155A. In silico reconstruction of metabolic network from unannotated raw genome sequences  49

156A. AFLP® Nucleotide Sequence Quality Assessment and Improvement Tool. 49

157A(i). Schema Mapping and Data Integration with Clio. 50

157A(ii). The GENIA Corpus: an Annotated Corpus in Molecular Biology Domain. 50

 

124A. Genomewide Analysis of Bkm Sequences (GATA repeats): Predominant Association with Sex Chromosomes and Potential Role in Higher Order Chromatin Organization and Function.

Subbaya Subramanian, R.K. Mishra and L. Singh. Centre for Cellular and Molecular Biology, W413 CCMB, Uppal Road, Hyderabad, Andra Pradesh, 500007, India.

subree@gene.ccmbindia.org

 

Genomewide analysis of GATA repeats revealed that GATA repeats are absent in prokaryotes and have been gradually accumulated in higher organisms during the course of evolution. In humans, the Y chromosome has the highest GATA repeat density, which is predominantly present in the Yq pericentric region. GATA repeats along the Y-chromosome and their close proximity to Matrix Associated Regions (GATA-MAR) may be demarking chromatin domains.

Long Abstract

 

 

125A. Hierarchical Machine Learning for Characterising Protein Families.

Aik Choon Tan and David Gilbert. Bioinformatics Research Centre, Department of Computer Science, University of Glasgow, Glasgow, U.K.

actan@brc.dcs.gla.ac.uk

 

The aim of this research is to construct a novel approach to induce comprehensive patterns from various data sources using knowledge discovery and hierarchical machine learning approach. We have applied this technique to characterise several protein families and our classifiers show higher accuracy and are more informative compared to the conventional methods.

Long Abstract

 

 

126A. In silico Comparison of the Transcriptome Derived from Purified Normal Breast Cells and Breast Tumor Cell Lines Reveals Candidate Upregulated Genes in Breast Tumor Cells.

Leerkes MR, Caballero OL, Mackay A, Torloni H, O'Hare MJ, Simpson AJ, and de Souza SJ. Ludwig Institute for Cancer Research, Rua Prof. Antonio Prudente, 109, 4 andar, Sao Paulo, SP, CEP 01509-010, Brazil.

leerkes@compbio.ludwig.org.br

 

We report here the combined use of ORESTES sequences generated in the FAPESP/LICR Human Cancer Genome Project and information available in the UniGene and SAGE databases to characterize the transcriptome of normal and breast tumor cells. We have identified 154 genes as candidates for overexpression in breast tumor cells.

Long Abstract

 

 

127A. Extraction and Dynamic View of Biomolecular Interactions in a Large Biomedical Text Database.

Yoshihiro Ohta1 and Shigeo Ihara2. 1Hitachi Central Research Laboratory and 2Research Center for Advanced Science and Technology, University of Tokyo.

yoh@crl.hitachi.co.jp

 

We constructed a biomolecular interaction detection system which is practical to handle the recent massive increase in literature on molecular biology. We comprehensively considered every needed elements, large-scale dictionary construction, biomolecular name detection, interaction detection and effective user-interface of network viewer. Our system can extract over 550,000 interactions with these elements.

Long Abstract

 

 

128A. Mining the literature for enzyme-disease associations.

Hofmann O. and Schomburg D. Department of Biochemistry, University of Cologne, Germany.

o.hofmann@smail.uni-koeln.de

 

A network of enzyme and disease correlations was built by automatically extracting relevant information from the abstracts of biomedical literature. The concept-based data and implemented visualization techniques allow easy navigation by researchers to explore knowledge available in literature databases and develop new theories.

Long Abstract

 

 

129A. Search for Gene Regulatory cis-Elements in Arabidopsis thaliana.

Judith Lucia Gomez, Ingo Dreyer and Bernd Mueller-Roeber. University of Potsdam, Institute for Biochemistry and Biology, Dept. Molecular Biology, Karl-Liebknechtstrasse 24/25, Haus 20, 14476 Golm, Germany.

jgomez@rz.uni-potsdam.de

 

The regulation of gene expression in plants is thought to result from the binding of different sets of transcription factors to promoter cis-elements. We tested HMM based methods to search for target genes in the model plant Arabidopsis thaliana, harbouring putative binding sites for transcription factors in their promoter regions.

Long Abstract

 

 

130A. Semantic Similarity Measures Across the Gene Ontology: Relating Sequence to Annotation.

P.W. Lord, R.D. Stevens, A. Brass and C.A.Goble. Dept. of Computer Science, Manchester University.

p.lord@russet.org.uk

 

The Gene Ontology (GO) represents knowledge of a gene product's function, process and location in a computationally amenable form. We present metrics for measuring the similarity between GO terms, and therefore semantic similarity of gene products annotated with them. We validate these metrics by comparing them with measures of sequence similarity, and show several uses for the measure.

Long Abstract

 

 

131A. Patterns, Pairings and Predictions of Catalytic DNA.

Gopinath Ganji 1,2, Yingfu Li 3, T. Chiang 2 and A. Jamie Cuticchia 1. 1 Department of Medical Biophysics, University of Toronto, 610 University Avenue, Toronto, Ontario, CANADA M5G 2M9, 2Center for Computational Biology, Hospital for Sick Children, 555 University Avenue, Toronto, Ontario, CANADA M5G 1X8, 3Department of Biochemistry and Department of Chemistry, McMaster University, 1200 Main St. W., Hamilton, Ontario, CANADA L8N 3Z5.

gopi.ganji@utoronto.ca

 

We hypothesize catalytic nucleic acids containing characteristic structural/functional sequence features can be probabilistically modeled and experimentally verified. By employing pattern discovery algorithms, structure prediction tools and machine learning methods, we have attempted to characterize various classes of SELEX-generated 'DNA kinases' (self-phosphorylating DNA) that recruit specific divalent metal cations and NTPs.

Long Abstract

 

 

132A. GIMS: a Data Warehouse for Management and Analysis of Complex Biological Data.

Michael Cornell, Paul Kirby, Cornelia Hedeler and Norman W Paton. Dept of Computer Science, Kilburn Building, University Of Manchester, M13 9PL.

mcornell@cs.man.ac.uk

 

GIMS is an object database that integrates genome sequence data with functional data (transcriptome, metabolome, metabolic pathway, proteome and protein-protein interactions) in a single data warehouse. GIMS can be browsed or analysed using canned queries. GIMS can be queried remotely using a Java application that can be downloaded from www.cs.man.ac.uk/~norm/gims.

Long Abstract

 

 

133A. A Simple Statistical Test for Evaluating Differences between Database Retrieval Methods.

John L Spouge and Eva Czabarka. National Center for Biotechnology Information, National Institutes of Health, Bethesda MD USA.

spouge@ncbi.nlm.nih.gov

 

One key problem in designing intelligent systems for molecular biology is to determine which of two database retrieval methods is better. We give a simple statistical test based on z-scores to calculate the significance of differences in ROC[n] scores and apply the method to assess putative improvements to PSI-BLAST.

Long Abstract

 

 

134A. Proteome Databases: An Information Source for Bacterial Immunology.

Klaus-Peter Pleissner1, Till Eifert2, Frank Schmidt1, Stefan H.E. Kaufmann1 and Peter R. Jungblut1. 1Max Planck Institute for Infection Biology, 2Algorithmus GmbH.

pleissner@mpiib-berlin.mpg.de

 

A collection of proteome databases which comprises 2-D gel proteins, Isotope Coded Affinity Tag (ICAT) and functional classification databases for Mycobacterium tuberculosis and Helicobacter pylori is presented. Information about genes, proteins and metabolic pathways serves as an information source for bacterial immunology. http://www.mpiib-berlin.mpg.de/2D-PAGE.

Long Abstract

 

 

135A. RED: a Web-based System for the Analysis, Management, and Dissemination of Expressed Sequence Tags.

Everitt R.#, Minnema S.E.#, Koster C.S., Olson R.A., Wride M.A. and Rancourt D.E. Department of Biochemistry and Molecular Biology, University of Calgary, Alberta, Canada.

#These authors contributed equally to this work.

seminnem@ucalgary.ca, reveritt@ucalgary.ca

 

The Rancourt EST Database (RED) is a web-based system for the analysis, management, and dissemination of expressed sequence tags (ESTs). RED represents a flexible template DNA sequence database that can be easily manipulated to suit the needs of other labs undertaking mid-size sequencing projects. Source code for RED and the associated tools is available from reveritt@ucalgary.ca. RED is publicly accessible via www.ucalgary.ca/~rancourt.

Long Abstract

 

 

136A. Searching Microarray Time Series Data for Yeast Cell-Cycle Regulatory Genes.

Holger Hoos, Andrew Kwon and Raymond Ng. Department of Computer Science, University of British Columbia.

tjkwon@cs.ubc.ca

 

We propose a new method for analyzing microarray time series data. We apply the method on yeast cell-cycle time series data to find potential regulatory pairs.The results indicate that our algorithm is able to find different true positive pairs from correlation and edge detection method by Filkov et al.

Long Abstract

 

 

137A. Application of Relational Database Tools for the Analysis of Large Proteomic Data Sets from Tandem Mass Spectrometry.

Ioannis K. Moutsatsos, Yongchang Qiu, Rod Hewick, Joseph Wooters, Steve Howes, Gary Van Domselaar and Patrick Cody. Wyeth Research Inc. 35 Cambridgepark Drive, Cambridge, MA02140, USA.

gvandomselaar@Wyeth.com

 

TurboSEQUEST is a search engine used for protein prediction from MS/MS spectra of protein digests. We have developed a custom application, SequestOnOracle, that extends TurboSEQUEST with the data management and analysis tools of a relational database. SequestOnOracle’s unique capabilities derive from its ability to summarize and compare the protein and peptide content from multiple TurboSEQUEST searches.

Long Abstract

 

 

138A. Hierarchical Cluster Analysis and Classification of SAGE data.

Raymond T. Ng, Jorg Sander, Monica C. Sleumer and Man Saint Yuen. University of British Columbia.

myuen@cs.ubc.ca

 

Under the assumption that although cells can look morphologically similar they may behave very differently at a molecular level, we present method for clustering and classifying SAGE libraries to detect the similarities and differences between various tissue types and neoplastic states.

Long Abstract

 

 

139A. GEA:  a Toolkit for Gene Expression Analysis.

Jessica M. Phan, Raymond Ng and Steve Jones. University of British Columbia.

myuen@cs.ubc.ca

 

We demonstrate the toolkit for Gene Expression Analyzer (GEA) used particularly with high dimensional data such as SAGE. GEA provides a graphical interface with operations for clustering, comparing and contrasting gene expressions in different SAGE clusters. GEA would eventually be linked to various bioinformatics databases for integrated genomic analysis.

Long Abstract

 

 

140A. A Method for Detecting Protein-Protein Interaction Rules.

Takuya Oyama1,4, Kagehiko Kitano1,4, Kenji Satou 2,4 and Takashi Ito3,4. 1INTEC Web and Genome Informatics Corporation, 2School of Knowledge Science, Japan Advanced Institute of Science and Technology, 3Cancer Research Institute, Kanazawa University and 4Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Corporation (JST).

oyama@isl.intec.co.jp

 

We studied a method that can discover rules related to protein-protein interactions from accumulated protein-protein interaction data using data mining. The method reveals the relation between the features of mutually interacting proteins like that the protein having the feature F1 interacts with the protein having the feature F2.

Long Abstract

 

 

141A. G-language Genome Analysis Environment.

Kazuharu Arakawa1,2, Koya Mori11,3 and Masaru Tomita1,2. 1 Institute for Advanced Biosciences, Keio University, 2 Department of Environmental Information and 3Graduate School of Media and Governance.

gaou@g-language.org

 

G-language Genome Analysis Environment (G-language GAE) is a generic software package aimed for higher efficiency in bioinformatics analysis. G-language GAE has an interface as a set of Perl libraries for software development, and a graphical user interface for easy manipulation. It is distributed freely under GPL at http://www.g-language.org/.

Long Abstract

 

 

142A. Data Handling for Detailed Phenotypic Characterization of Novel Mouse Phenotypes.

E. C. J. Green1, J. Airey1, R. Cox1, Y. Hashim1, T. Hough1, Z. Lalanne1, K. E. Logan1, P.Nolan1, L.Visor1, A-M. Mallon1, P. Jones1, R. Selley1, A. Blake1, S. Greenaway1, H. J. Kirkbride1, J. Hunter2 and S. D. M. Brown1. 1Mouse Genome Center and Mammalian Genetics Unit, MRC, Harwell, Oxfordshire, OX11 0RD, UK and 2GlaxoSmithKline, New Frontiers Science Park, Harlow, CM19 5AW, UK.

e.green@har.mrc.ac.uk

 

A system is described for the management of data produced from the characterization of novel phenotypes, observed from a large scale ENU mutagenesis programme. A diversity of data is being produced from sources such as microarray technology, in situ hybridization studies, animal husbandry, candidate gene identification, DHPLC and sequencing.

Long Abstract

 

 

143A. Willo and Wisp: Data Management Systems for Mouse Genome Mapping and Sequencing.

M. Simon, S. Greenaway, A-M. Mallon, R. Selley, P. Jones, Z. Tymowska-Lalanne, S. Breeds, S. Smythe, H. Kirkbride, S. Webb, A. Blake, J. Weekes, E. Green, E. Mollison, P. Denny, P. Nolan, M. Goldsworthy, M. Strivens and S.D.M. Brown. Medical Research Council, Harwell, Oxon, Ox11 0RD, England.

m.simon@har.mrc.ac.uk

 

A vital element of high-throughput genetics is to capture the data generated from experimental procedures and to integrate and disseminate these results. Two data management systems have been developed to capture this data at the point of generation - Wisp and Willo. These capture data specifically generated from sequencing and genotyping.

Long Abstract

 

 

144A. Novel Opportunities and Challenges in the Human Proteome: A Bioinformatics Strategy to Identify Splice Variants of Druggable Gene Targets.

Chandra Ramanathan1, Shuba Gopal2, Bob Bruccoleri1, John Feder1, Gabe Mintier1 and Terry Gaasterland2. 1Bristol-Myers Squibb and 2The Rockefeller University.

Chandra.Ramanathan@bms.com

 

Identification, verification and biological characterization of splice variants are challenging tasks but essential to understand the observed biological complexity in humans. A systematic bioinformatics methods is being developed to mine the human genomic and EST data for identifying splice variant forms of druggable gene targets and correlate these variants with disease/tissue expression information available in various proprietary databases.

Long Abstract

 

 

145A. DrugBank: An Integrated Database for Drug Discovery and Pharmacogenomics.

Kavoos Basmenji, Zhan Chang, Bahram Habibi-Nazhad and David Wishart. Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, AB, T6G 2N8.

zchang@ualberta.ca

 

DrugBank is a web-enabled database developed to facilitate drug discovery and drug analysis. It combines drug information with drug target information to allow users the possibility of linking small molecule data with protein sequence/structure data. DrugBank can be accessed freely at http://redpoll.pharmacy.ualberta.ca/~zchang/cgi-bin/welcome.cgi.

 

Long Abstract

 

 

146A. The Genomics Unified Schema (GUS).

V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug and C. Stoeckert. Center for Bioinformatics, University of Pennsylvania.

stevef@pcbi.upenn.edu

 

GUS is a comprehensive strongly typed relational schema and object-based software platform for integration, analysis, curation and presentation of sequence based genomics information. It has been used to model and/or mine human, mouse, plasmodium and the pancreas, and is suitable for model organisms in general. It is freely available.

Long Abstract

 

 

147A. Compensation for Nucleotide Bias in a Genome by Representation as a Discrete Channel with Noise.

Mark Schreiber1,2 and Chris Brown1. 1AgResearch NZ, PO Box 50034, Dunedin, New Zealand and 2Dept of Biochemistry, University of Otago, PO Box 56 Dunedin, New Zealand.

mark.schreiber@agresearch.co.nz

 

Calculation of the information content of motifs in genomes highly biased in nucleotide composition leads to overestimates of the amount of useful information in the motif. By treating a biased genome as a discrete channel with noise, in accordance with Shannon Information Theory, we were able to remove both ‘Distortion’ and ‘Noise’ from the motif and recover a more instructive biological ‘signal'.

Long Abstract

 

 

148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?

Li Li, Brian Brunk, Christian J. Stoeckert Jr and David S. Roos. Department of Biology, University of Pennsylvania, Philadelphia, USA and Center for Bioinformatics, University of Pennsylvania, Philadelphia, USA.

lili4@sas.upenn.edu

 

To integrate eukaryotic sequence data with information on biological process we sought to identify orthologous groups by combining sequence similarity comparisons with graph clustering algorithms. Queries based on user-defined species distribution provide a snapshot of shared/diversified processe, facilitating (for example) the identification of targets for broad-spectrum antibiotics targeting apicomplexan parasites.

Long Abstract

 

 

149A. The CyberCell Database (CCDB).

Bahram Habibi-Nazhad, Melania Ruaini, Kavoos Basmenji and David S. Wishart. Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton AB T6G 2N8, Canada.

bahram@redpoll.pharmacy.ualberta.ca

 

The CyberCell Database (CCDB) is a web-enabled, user-friendly database containing previously published and electronically archived information on nearly every aspect of E. coli molecular biology and enzymology. We have also constructed CC3D which contains E. coli structural proteomic data and CCMD which contains the chemical database of metabolites and other small molecules used to support metabolic analysis.

Long Abstract

 

 

150A. Functional Database System of Olfactory Receptors.

Kazunori Miyazaki and Satoshi Itoh. Advanced Materials and Devices Laboratory, Corporate Research and Development Center, TOSHIBA CORPORATION.

kazun.miyazaki@toshiba.co.jp

 

We have developed a Java/XML-based functional database system of olfactory receptors (OR) from databases which can be accessed via Internet. The feature of our system is analyzing the XML data for OR by using predictive tools on the Web, and then accumulating annotated data in the analyzed one semi-automatically.

Long Abstract

 

 

151A. A Standard Corpus for Evaluating Extraction of Molecular Interaction Pathway Information from Scientific Abstracts.

Soon Heng Tan and See-Kiong Ng. Laboratories for Information Technology, Singapore.

soonheng@lit.org.sg

 

Vast amounts of molecular interaction pathway information can be extracted automatically from MEDLINE's abstracts using natural language processing, but progress has been hindered by a lack of a standard corpus for evaluation. We describe a test corpus we have created from our Pathweaver project that is suitable for such evaluation.

Long Abstract

 

 

152A. A First Study of the Central Role of the Analyst in the Knowledge Discovery Process in Biology.

Sandy Maumus1,2, Amedeo Napoli2, Rafik Taouil2 and Sophie Visvikis1. 1INSERM U525, Université Henri Poincaré (Nancy 1) – Faculté de Pharmacie, 30 rue Lionnois, 54000 Nancy, France and 2LORIA – UMR 7503, B.P. 239, 54506 Vandoeuvre-Lès-Nancy, France.

sandy.maumus@nancy.inserm.fr

 

Based on an application of symbolic data mining methods on a test database, we underline the role played by the analyst in the knowledge discovery process. Encouraged by positive results, we plan to apply these methods on a large database for investigating the relationships between gene polymorphisms and cardiovascular diseases intermediate phenotypes.

Long Abstract

 

 

153A. Assessing the Compactness and Isolation of Individual Clusters Observed in Microarray Data.

Per-Olof Fjallstrom. Affibody.

perfj@affibody.com

 

The ”clusters” returned by standard clustering methods applied to microarray data are not necessarily biologically relevant. We present a method for assessing if such clusters are unusually compact and isolated. The method has been successfully applied to several microarray data sets. It does not require estimates of the variance of experimental error.

Long Abstract

 

 

154A. An Amino Acid Centered Database to Facilitate Protein Crystallisation.

K. MacLeod and E. Westwick. Astex TechnologyLtd.

e.westwick@astex-technology.com

 

An amino acid centered relational database has been designed to store sequences of P450 proteins that have been engineered in order to optimise crystallisation behavior. Amino acids are stored as individual entities, allowing the physical and chemical properties of the residues to be correlated with experimental outcome, using SQL queries.

Long Abstract

 

 

155A. In silico reconstruction of metabolic network from unannotated raw genome sequences

Jibin Sun and An-Ping Zeng. Microbial Systems and Genome Analysis, GBF.

AZE@GBF.de

 

A method is proposed to in silico reconstruct metabolic network directly from unannotated genome sequences. A comparison of data from different sequencing stages (3.9 vs. 7.9 time coverage) for one

organism revealed that a 3.9 time coverage of the genome is sufficient (with 99.3% identity) for reconstructing the metabolic network.

Long Abstract

 

 

156A. AFLP® Nucleotide Sequence Quality Assessment and Improvement Tool.

Antoine Janssen1, Jan van Oeveren1, Pieter Vos1, Gert Vriend2, Roland Siezen2, Rene van Schaik3 and Jack Leunissen2. 1Keygene N.V., Wageningen, The Netherlands and 2Center for Molecular and Biomolecular Informatics, University of Nijmegen, Nijmegen, The Netherlands and 3Organon, Oss, The Netherlands.

antoine.janssen@keygene.com

 

The Keygene/CMBI AFLP® quality assessment and improvement tool is a web based application that automates quality assessment and visualization of (cDNA-)AFLP® data. It improves proprietary data by use of public data. The analysis includes coverage / redundancy calculation, internal contig building, full length discovery and potential SNP discovery. http://www.cmbi.nl/kg_bin/dataset_annotator.pl.

Long Abstract

 

 

157A(i). Schema Mapping and Data Integration with Clio.

Barbara Eckman, Mauricio Hernández, Howard Ho, Felix Naumann and Lucian Popa. IBM.

felix@us.ibm.com

 

Bioinformatics data sources typically have large, complex structures, reflecting the richness of the scientific concepts they model. Clio is an information integration tool the helps users define mappings between disparate schemas, thus providing an integrated view of all related data sources and enabling data transformations between the sources.

Long Abstract

 

 

157A(ii). The GENIA Corpus: an Annotated Corpus in Molecular Biology Domain.

Tomoko Ohta1, Yuka Tateisi2, Jin-Dong Kim2 and Jun-ichi Tsujii1,2. 1Univ. of Tokyo and 2CREST, JST.

okap@is.s.u-tokyo.ac.jp

 

We are developing the necessary resources including domain ontology and annotated corpus from MEDLINE abstracts. We have already annotated 2,500 abstracts with 31 different semantic classes. Part-of-speech annotation to the same set of abstracts annotated for named entities is under way using Penn Treebank set. In this poster, we report on the current status of our corpus.

Long Abstract

 

Data Mining. (Section B)

 

124B. Analyzing Brain and Breast Cancer SAGE Libraries. 94

125B. Homophila: A Database of Human Disease Gene Cognates in Drosophila. 94

126B. BioMiner: An Integrated Framework for Data Mining in Functional Genomics. 94

127B. Motif Informatics: Integration of Sequence Information into Gene Expression Data Mining. 95

128B. Integration of Exon Predictions Using Multilayer Perceptron and Mixture of Experts Neural Networks. 95

129B. An Evaluation of Dimensionality Reduction Methods for Bio-Medical Spectra. 95

130B. Building a Database of Medium Resolution Electron Density Properties of Chemical Functions Applicable to all Biopharmacological Domains. 95

131B. Identification of ORFs from Organelle Genomes: A Data Mining Approach. 96

132B. Automatically Extracting Keyphrases for Clusters of Genes. 96

133B. A Novel Bayesian Clustering Approach for Predicted Regulatory Binding Sites. 96

134B. Software Development for High-Throughput DNA Sequencing. 96

135B. Ubiquitin and Ubiquitin-Like Pathway Proteins. 96

136B. Galaxy: a System for Flexible Data Tracking and High Throughput Analysis Pipeline. 97

137B. Search for Structure in Long Introns. 97

138B. Correlation between Intron Length and Its Base Composition. 97

139B. Method for the Best Model Selection of Paired Motifs in Promoter Regions of Genes. 97

140B.Intragenomic Reiterations Detection Using Hidden Markov Models. 97

141B. SVM-Decide: Towards a Decision Support System for Molecular Genetic Data. 98

142B. Easy Click: An SQL-Generating Application to Assist Researchers. 98

143B. An Adaptive Meta-Clustering Approach for Bioinformatics Applications. 98

144B. Discriminant Analysis of Multi-center Microarray Data. 98

145B. GeneBeans: a Bioinformatics Workflow and Data Management System. 98

146B. BAG: A Graph Theoretic Sequence Clustering Algorithm. 99

147B. Metabolic Cartography. 99

148B. Mining Three-Dimensional Chemical Structure Data. 99

149B. FAST: Functional Annotation of Sequence through Text. 99

150B. Assessing the Reliability of Self Organizing Maps. 99

151B. Finding Biological Themes in Microarray-derived Gene Lists with EASE: the Expression Analysis Systematic Explorer. 100

152B. Improving Literature-Based Discovery Support by Background Knowledge Inclusion for better Disease Candidate Gene Identification. 100

153B. Detecting Salient Changes in Gene Profiles. 100

154B. DiscoveryLink: IBM’s Data Integration Solution for the Life Sciences. 100

155B. Relationships between Alternative Splicing and Protein Structure. 101

156B. Understanding Scrambled Genes in Ciliates - Reverse Engineering a Biological Computer. 101

157B(i). BNS: A DNS-Inspired Biomolecule Naming Service. 101

157B(ii). OLAPOP-Online Analytical Processing of Proteins. 101

 

124B. Analyzing Brain and Breast Cancer SAGE Libraries.

Byron Kuo, Timothy Chan and Raymond Ng. Department of Computer Science University of British Columbia.

bkuo@cs.ubc.ca

 

The intent of the experiment is to attempt to characterize and find any similarities between seemingly different cancers (breast and brain) at the sub-cellular level. Based on publicly available SAGE libraries of cancerous and normal breast and brain tissues, we obtained a list of candidate cancer-related genes by applying the two-sample t-test and then analyzed their similarities at the gene expression level.

Long Abstract

 

 

125B. Homophila: A Database of Human Disease Gene Cognates in Drosophila.

Samson Chien, Lawrence T. Reiter, Ethan Bier and Michael Gribskov. University of California, San Diego / San Diego Supercomputer Center.

schien@sdsc.edu

 

Homophila is a database of human disease genes associated with their counterparts in Drosophila.

Homophila provides a comprehensive linkage between OMIM and Flybase in order to stimulate functional genomic studies in Drosophila that address questions concerning human genetic diseases. Homophila is available at http://homophila.sdsc.edu

Long Abstract

 

 

126B. BioMiner: An Integrated Framework for Data Mining in Functional Genomics.

Fazel Famili1, Roy Waker2, Alan Barton1, Qing-Yan Liu2, Ziying Liu1, Julio Valdes1, Youlian Pan1, Brandon Smith2, Junjun Ouyang1, Melanie Lehman2, Lynn Wei1 and Weiling Xu. 1Institute for Information Technology, 2Institute for Biological Sciences, National Research Council of Canada, Montreal Road, Ottawa, Ontario, K10 6R0, Canada.

fazel.famili@nrc.ca

 

This poster explains the role of integrated data mining systems in functional genomics. We will describe all stages of data preprocessing, the type of data and some of the useful knowledge that may be discovered from functional genomics data. The BioMiner architecture and its main functionalities along with some advantages of integrated architectures are explained.

Long Abstract

 

 

127B. Motif Informatics: Integration of Sequence Information into Gene Expression Data Mining.

Youlian Pan1, Roy Walker2, A (Fazel) Famili1 and Qing_Yan Liu2. 1Institute for Information Technology and 2Institute for Biological Sciences, National Research Council Canada, 1200 Montreal Road, Ottawa Ontario, K1A 0R6, Canada.

youlian.pan@nrc.ca

 

This paper rationalises the necessity of integrating information, such as patterns of transcription factor binding sites and transcription factors themselves, into the gene expression data mining processes. The paper also demonstrates the advantage of incorporating symbolic sequence data with numerical gene expression data analysis using an application in the BioMine project.

Long Abstract

 

 

128B. Integration of Exon Predictions Using Multilayer Perceptron and Mixture of Experts Neural Networks.

Youlian Pan1,2,3, Christoph W. Sensen4, Malcolm Heywood2 and Michael A. Shepherd2. 1Faculty of Computer Science, Dalhousie University, 6050 University Avenue Halifax, NS, Canada B3H 1W5; 2Canadian Bioinformatics Resources, NRC, 1411 Oxford St. Halifax, NS, Canada B3H 3Z1; 3Institute for Information Technology, NRC, 1200 Montreal Rd, Bldg. M-50, Ottawa, Ont. Canada K1A 0R6 and 4Department of Biochemistry and Molecular Biology, University of Calgary, 3330 Hospital Drive N.W., HSC 1150, Calgary, Alberta, Canada, T2N 4N1.

youlian.pan@nrc.ca

 

This paper investigates the potential of improving exon predictions by integrating GrailExp, GenScan, and MZEF using Multilayer Perceptron and Mixture of Experts neural networks. For human exon prediction, this integration system has significantly better recovery, by 25%, than any individual prediction engine alone. This system is available at http://www.cbr.nrc.ca/pany/integ.html.

Long Abstract

 

 

129B. An Evaluation of Dimensionality Reduction Methods for Bio-Medical Spectra.

Christopher Bowman and Richard Baumgartner. Institute for Biodiagnostics, National Research Council Canada, Winnipeg, Manitoba, Canada.

Christopher.Bowman@nrc.ca

 

We compare linear and nonlinear techniques, for identifying the intrinsic dimensionality of a data set, including local and global principal component analysis, and a novel implementation of the Whitney reduction network. The performance of these techniques is evaluated using independent training and validation sets drawn from magnetic resonance and mass spectroscopy.

Long Abstract

 

 

130B. Building a Database of Medium Resolution Electron Density Properties of Chemical Functions Applicable to all Biopharmacological Domains.

John Binamé1, Laurence Leherte1, Janice I. Glasgow 2, Suzanne Fortier3 and Daniel P. Vercauteren2. 1Laboratoire de Physico-Chimie Informatique, Facultés Universitaire Notre-Dame de la Paix, Namur, Belgium, 2 School of Computing, Queen's University, Kingston, ON, Canada, 3Department of Chemistry, Queen's University, Kingston, ON, Canada

daniel.vercauteren@fundp.ac.be

 

This work concerns the building of a database of topological properties of electron density functions of organic molecules at medium resolution to develop an automated way to reduce molecules to few relevant points. These points are further used in similarity search and pharmacophore proposition procedures applicable to all pharmacological domains.

Long Abstract

 

 

131B. Identification of ORFs from Organelle Genomes: A Data Mining Approach.

Sivakumar Kannan, Genevieve Boucher and Gertraud Burger. Canadian Institute for Advanced Research, Program in Evolutionary Biology, Départment de Biochimie, Université de Montréal, Montréal, Québec H3C 3J7, Canada.

siva@bch.umontreal.ca

 

Genomes of mitochondria and chloroplasts from diverse organisms carry on average 5 to 20 ORFs without assigned functions. In order to understand the biological role of these ORFs, we have developed a comprehensive analysis procedure using data mining methods. The approach and the predicted data will be presented.

Long Abstract

 

 

132B. Automatically Extracting Keyphrases for Clusters of Genes.

Rich Maclin1 and Mark Craven2. 1Computer Science Department, University of Minnesota, Duluth and 2Biostatistics and Medical Informatics Department, University of Wisconsin, Madison.

rmaclin@d.umn.edu

 

We present a tool for annotating high-throughput experiments by automatically extracting keyphrases to characterize clusters of genes. Our method autonomously associates genes with PubMed abstracts, extracts keyphrases that are statistically associated with gene clusters, and attempts to organize both genes and keyphrases into informative subclusters.

Long Abstract

 

 

133B. A Novel Bayesian Clustering Approach for Predicted Regulatory Binding Sites.

Zhaohui S. Qin1, Lee Ann McCue2, William Thompson2, Linda Mayerhofer2, Charles E. Lawrence2,3, and Jun S. Liu1. 1Department of Statistics, Harvard University, Cambridge, MA 02138, 2The Wadsworth Center, New York State Department of Health, Albany, NY 12201, 3Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180.

qin@stat.harvard.edu

The availability of complete genome sequences has made possible the computational identification of hundreds of binding sites via cross-species comparisons. We describe a novel Bayesian motif clustering algorithm that predicts the number of clusters among these sites, and identifies the sites belonging to each cluster.

Long Abstract

 

 

134B. Software Development for High-Throughput DNA Sequencing.

Yaron Butterfield, Ran Guin, Ursula Skalska, Duane Smailus, Angelique Schnerch, Kevin Teague, Jacquie Schein, Marco Marra, Steven Jones and the Genome Sciences Centre. (http://www.bcgsc.bc.ca), British Columbia Cancer Research Centre, Vancouver, BC, Canada, V5Z 4E6.

ybutterf@bcgsc.bc.ca

 

We have established a bioinformatics pipeline to handle the large amount of DNA sequence data generated in our laboratory. We have created a laboratory information system where data is stored in a central relational database and in conjunction with Perl software, allows for efficient, high-throughput sequencing and processing.

Long Abstract

 

 

135B. Ubiquitin and Ubiquitin-Like Pathway Proteins.

Yanmei Lu, Nan Lin, Betty Huang, Donald G. Payan and Kunbin Qu. Rigel Pharmaceuticals, Inc., 240 E. Grand Ave, South San Francisco, CA 94080.

ylu@rigel.com

 

The ubiquitin pathway is involved in many important cellular processes. The NR database was mined for ubiquitin pathway related proteins using Gibbs Sampling and HMM. We identified around 900 proteins in the ubiquitin reaction cascade. Our results demonstrate diverse protein domain structure compositions and functions in ubiquitin-domain containing proteins and ubiquitin ligases.

Long Abstract

 

 

136B. Galaxy: a System for Flexible Data Tracking and High Throughput Analysis Pipeline.

Nan Lin, Davidson Wan, Jiao He, Yanmei Lu, Ying Huang, Donald G. Payan and Kunbin Qu. Rigel,Inc.

nlin@rigel.com

 

Galaxy is an enterprise solution system built at Rigel to construct and organize flexible “high throughput informatics pipelines” for data tracking, analysis and integration from diverse public sources with the overlay of the internal experiment data. It integrates functional platforms used by biologists with programming applications that are continuously updated.

Long Abstract

 

 

137B. Search for Structure in Long Introns.

Hideo Bannai1, Satoru Miyano1, Kenta Nakai1, Sascha Ott1 and Yoshinori Tamada2. 1University of Tokyo and 2Tokai University.

tamada@ims.u-tokyo.ac.jp

 

We use a program predicting short introns to analyse the processing of long introns (some thousand bases or longer). The focus is whether long introns contain a structure of short introns, such that the long introns can be processed by a series of splicing reactions rather than by one single reaction.

Long Abstract

 

 

138B. Correlation between Intron Length and Its Base Composition.

Hideo Bannai, Yoshinori Tamada, Sascha Ott, Kim Sunyong, Kenta Nakai and Satoru Miyano. Human Genome Center, Institute of Medical Science, University of Tokyo, 4-8-1 Minato-ku, Tokyo, 108-8639 Japan.

bannai@ims.u-tokyo.ac.jp

 

We analyzed the base compositions of introns available from Ensembl for H. sapiens, M. musculus, D. melanogaster and D. rerio, and have discovered a notable correlation between intron length and its base composition for H. sapiens and M. musculus. The tendency was not observed in D. melanogaster and D. rerio.

Long Abstract

 

 

139B. Method for the Best Model Selection of Paired Motifs in Promoter Regions of Genes.

Daisuke Shinozaki and Osamu Maruyama. Faculty of Mathematics, Kyushu University, Japan.

om@math.kyushu-u.ac.jp

 

We propose a method for the best model selection of paired motifs in promoter regions of a given set of genes. We apply our method to yeast data like sets of co-regulated genes and report the experimental result.

Long Abstract

 

 

140B.Intragenomic Reiterations Detection Using Hidden Markov Models.

Sébastien Hergalant, Bertrand Aigle, Bernard Decaris, Pierre Leblond and Jean-François Mari. LORIA (équipe Orpailleur, BP 239, 54506 Vandoeuvre-lès-Nancy, France) and Laboratoire de Génétique et Microbiologie (UMR UHP-INRA 1128, IFR 110, 54506 Vandoeuvre-lès-Nancy, France).

hergalan@loria.fr

 

We present a genomic data mining method in which the user describes a signal worked out by a second order HMM. This signal representing the probability to classify a nucleotidic residue or a group of residues in a particular state, allows the localization of repetitions in a complete bacterial genomic sequence.

Long Abstract

 

 

141B. SVM-Decide: Towards a Decision Support System for Molecular Genetic Data.

Jasmin Müller, Falk Schubert and Roland Eils. Intelligent Bioinformatics Systems, German Cancer Research Center.

j.mueller@dkfz-heidelberg.de

 

SVM-Decide is an approach to adapt knowledge from gene expression and genomic profiles for clinical decision support systems. Therefore we combine a support vector machine classifier with an explanation component for the physician. Furthermore an explicit competence model enables our system to classify only cases within its competence area.

Long Abstract

 

 

142B. Easy Click: An SQL-Generating Application to Assist Researchers.

Eric J. Grant and Dale L. Preston. Radiation Effects Research Foundation.

egrant@rerf.or.jp

 

Retrieving data from a relational database can be complex. ‘Easy Click’, a desktop PC application, dynamically generates SQL statements via a point-and-click interface shielding researchers from writing SQL, guaranteeing easy and consistent access to research data. An initialization file supplies Easy Click with variable definitions giving a completely customizable application.

Long Abstract

 

 

143B. An Adaptive Meta-Clustering Approach for Bioinformatics Applications.

Y. Zeng, J. Garcia-Frias, J. Tang and G. Gao. Department of Electrical and Computer Engineering, University of Delaware.

zeng@eecis.udel.edu

 

Because of the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. This poster proposes a meta-clustering approach, which can extract the information from results of different clustering techniques adaptively and provides a better interpretation of the data patterns.

Long Abstract

 

 

144B. Discriminant Analysis of Multi-center Microarray Data.

Taesung Park1, Sung-Gon Yi1, Hosik Choi1, Seung-Yeoun Lee2, Kee-Ho Lee3, Jung Kyoon Choi4, Sangsoo Kim4, Yeom Young Il,4, Choi Jong Young5 and Daeghon Kim Chonbuk6. 1Department of Statistics, Seoul National University, Seoul, Korea 2Department of Applied Mathematics, Sejong University, Seoul, Korea , 3Laboratory of Molecular Oncology, Korea Cancer Center Hospital , 4Korea Research Institute of Bioscience and Biotechnology, Taejon, Korea , 5The Catholic University of Korea, Seoul, Korea and 6National University, Jeonju, Chonbuk, Korea.

pinebud2@snu.ac.kr

 

For the case when the same type of microarrarys from different clinical centers are collected , we propose new discrimination methods which account for variability caused by different clinical centers. The proposed methods are illustrated using the microarray data for liver cancer patients from three different clinical centers in Korea.

Long Abstract

 

 

145B. GeneBeans: a Bioinformatics Workflow and Data Management System.

Jeffrey L. Brown, Thomas C. Hudson and Kenisha V. Johnson. University of North Carolina at Wilmington.

hudsont@uncwil.edu

 

GeneBeans uses Enterprise Java Beans to provide biologists a graphical dataflow interface for constructing queries and analyses of gene index databases without the use of a textual query language. The tool is intended to make bioinformatics more generally practicable.

Long Abstract

 

 

146B. BAG: A Graph Theoretic Sequence Clustering Algorithm.

Sun Kim. School of Informatics Center for Genomics and Bioinformatics Indiana University, Bloomington.

sunkim@bio.informatics.indiana.edu

 

As more sequences become available in an exponential rate, sequence analysis on a large number of sequences becomes increasingly important. Sequence clustering algorithms are computational tools for that purpose. In this paper, we present our clustering algorithm BAG that uses two graph properties, biconnected components and articulation points.

Long Abstract

 

 

147B. Metabolic Cartography.

Daniel McShan, Shilpa Rao and Imran Shah. University of Colorado, Health Science Center, 4200 E 9th Ave, C-245Denver, CO80120, USA.

 

Daniel.McShan@uchsc.edu

 

We present a novel metabolic cartography approach for representing the metabolic space based on the biochemical properties of molecules. Summaries and visualizations of this space are presented, offering a quantitative and qualitative overview of the metabolome. We are using metabolic cartography in our research on pathway inference methods.

Long Abstract

 

 

148B. Mining Three-Dimensional Chemical Structure Data.

Sean McIlwain1, Arno F. Spatola2, David Vogel2, Slyvie Blondelle3 and David Page4.

1Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53706, U.S.A., 2Institute for Molecular Diversity and Drug Design, Department of Chemistry, University of Louisville, Louisville, KY 40292, U.S.A., 3Torey Pines Research Institute for Molecular Studies, LaJolla, CA 92037, U.S.A. and 4Department of Biostatistics and Medical Informatics and Department of Computer Sciences, University of Wisconsin, WI 53706, U.S.A.

mcilwain@hotmail.com, spatola@louisville.edu, spatola@louisville.edu, sblondelle@tpims.org and page@biostat.wisc.edu

 

We apply inductive logic programming to the task of predicting anti-microbial activity, specifically, the ability of certain molecules to inhibit growth of Pseudomonas aeruginosa. This is done by taking into account the three-dimensional structure and biological activities from a database of tested molecules.

Long Abstract

 

 

149B. FAST: Functional Annotation of Sequence through Text.

Michael Elkaim and Chris Ponting. MRC Functional Genetics Unit, University of Oxford, Department of Human Anatomy and Genetics, South Parks Road, Oxford, OX1 3QX, United Kingdom.

michael.elkaim@anat.ox.ac.uk

 

The manual annotation of protein domains is an arduous task. We present FAST, a program that automatically annotates protein domains by extracting functional information, including key word stems and key quotes, from literature that is relevant to protein sequences containing these domains.

Long Abstract

 

 

150B. Assessing the Reliability of Self Organizing Maps.

Fajar Restuhadi1, Andrew Hayes2, Simon J. Hubbard1 and Stephen G. Oliver2. 1Dept. Biomolecular Sciences, UMIST, PO BOX 88, Manchester M60 1QD and 2School of Biological Sciences, Univ. of Manchester, Manchester M13 9PT.

adi@bms.umist.ac.uk

 

Self Organizing Maps (SOM) approaches were used to analyse our unique transcriptome data from high-throughput northern hybridisations. Objective function associated with the SOM algorithm for a constant size of neighbourhood and finite data set is the sum of squares intra-classes (SSIntra) extended to neighbour classes. At the end of its convergence, the SOM algorithm thus exactly minimizes the SSIntra function. We applied the bootstrap method to allow us to estimate the variability of SSIntra. If the SOM is computed several times according to the bootstrap principle, then we can calculate the mean and standard deviation of SSIntra of the distortion. Further, the variability of SSIntra can be estimated to asses the stability of the quantization error in the SOM.

Long Abstract

 

 

151B. Finding Biological Themes in Microarray-derived Gene Lists with EASE: the Expression Analysis Systematic Explorer.

Douglas A. Hosack and Richard A. Lempicki. Laboratory of Immunopathogenesis and Bioinformatics, SAIC Frederick.

Doug!@nih.gov

 

EASE is a software package that finds biological themes over-represented in any list of genes derived from microarray experiments or other high-throughput screening methods. It enables researchers utilizing these technologies to quickly find the interesting biological stories in their results.

Long Abstract

 

 

152B. Improving Literature-Based Discovery Support by Background Knowledge Inclusion for Better Disease Candidate Gene Identification.

Dimitar Hristovski1 and Borut Peterlin2. 1National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894 USA and 2Department of Human Genetics, Clinical Center Ljubljana Zaloska, 1000 Ljubljana, Slovenia.

dimitar.hristovski@mf.uni-lj.si and borut.peterlin@guest.arnes.si

 

We describe an interactive literature based discovery support system extended with background knowledge about disease/gene chromosomal or expression location. The goal of the system is to discover new, potentially meaningful relations between biomedical concepts (e.g. a gene candidate for a disease). The system is available at http://www.mf.uni-lj.si/bitola/.

Long Abstract

 

 

153B. Detecting Salient Changes in Gene Profiles.

Tanveer Syeda-Mahmood. IBM Almaden Research Center, 650 Harry Road, San Jose CA 95120.

stf@almaden.ibm.com

 

The functional state of an organism is determined largely by the pattern of expression of its genes. Salient changes in variation in expression of genes can give clues about important events, such as the onset of a disease. In this paper, we address, for the first time, the problem of salient change detection in gene profiles. In particular, we look for salient inflection points in the time-series of the gene profiles. Such inflection points are places where there is a significant change in the curvature of the gene profile, as revealed by a zero-crossing of the second derivative that is preserved over multiple scales of smoothing. By automatically selecting an optimal scale of description, we derive salient change points in genomic signals. The utility of salient chnage detection is demonstrated in the automatic identification of regulatory phase for genes active in the mitotic cell cyle of budding yeast.

Long Abstract

 

 

154B. DiscoveryLink: IBM’s Data Integration Solution for the Life Sciences.

Barbara Eckman, Laura Haas, Prasad Kodali, Eileen Lin, Julia Rice and Peter Schwarz. IBM Life Sciences.

baeckman@us.ibm.com

 

Integration of a widely diverse set of databases and applications is needed to carry out post-genomic bioinformatics research. We describe DiscoveryLink, IBM’s federated database offering, and illustrate how it is being used to provide integrated access to life sciences data, irrespective of where it is stored and its format.

Long Abstract

 

 

155B. Relationships between Alternative Splicing and Protein Structure.

Richard E. Green and and Steven E. Brenner. Univ. California, Berkeley.

ed@compbio.berkeley.edu

 

Alternative splicing may have an enormous impact on the protein coding diversity encoded by the human genome. We set out to investigate one aspect of this impact, the effect of alternative splicing on the domain organization of affected proteins. Surprisingly, there seems to be little correlation between domain organization and alternative splicing.

Long Abstract

 

 

156B. Understanding Scrambled Genes in Ciliates - Reverse Engineering a Biological Computer.

Andre Cavalcanti and Laura F. Landweber. Princeton University.

 

Scrambled genes are surprisingly common in spirotrichous ciliates. During cell development, these microorganisms must reorder the permuted pieces of such genes, tackling an intrinsically computational problem. We are developing tools for the identification and analysis of scrambled genes with the final goal of understanding the rules driving this biological process.

Long Abstract

 

 

157B(i). BNS: A DNS-Inspired Biomolecule Naming Service.

Robert Kincaid. Life Science Technologies Laboratory, Agilent Technologies.

robert_kincaid@agilent.com

 

BNS is prototype biomolecule naming service using Lightweight Directory Access Protocol (LDAP) to provide high-performance access to data derived from LocusLink. Gene and protein names and accessions can be resolved into various equivalents with very low latency. This enables a number of novel processes involving rapid conversions between accession schemes.

Long Abstract

 

 

157B(ii). OLAPOP-Online Analytical Processing of Proteins.

Deendayal Dinakarpandian and Vijay Kumar. School of Interdisciplinary Computing and Engineering, University of Missouri-Kansas City, Kansas City, MO 64110, USA.

dinakard@umkc.edu

 

OLAPOP refers to the online analytical processing of proteins, beyond flat-file retrieval and sequence analysis. This is based on a data warehouse approach to proteins with an emphasis on the provision of analytic facilities that allow for the study of protein properties in multiple dimensions.

Long Abstract