Data Mining. A. B.
Data Mining. (Section A)
Subbaya Subramanian, R.K. Mishra and L. Singh. Centre for Cellular and Molecular Biology, W413 CCMB, Uppal Road, Hyderabad, Andra Pradesh, 500007, India.
Genomewide analysis of GATA repeats revealed that GATA repeats are absent in prokaryotes and have been gradually accumulated in higher organisms during the course of evolution. In humans, the Y chromosome has the highest GATA repeat density, which is predominantly present in the Yq pericentric region. GATA repeats along the Y-chromosome and their close proximity to Matrix Associated Regions (GATA-MAR) may be demarking chromatin domains.
Aik Choon Tan and David Gilbert. Bioinformatics Research Centre, Department of Computer Science, University of Glasgow, Glasgow, U.K.
The aim of this research is to construct a novel approach to induce comprehensive patterns from various data sources using knowledge discovery and hierarchical machine learning approach. We have applied this technique to characterise several protein families and our classifiers show higher accuracy and are more informative compared to the conventional methods.
Leerkes MR, Caballero OL, Mackay A, Torloni H, O'Hare MJ, Simpson AJ, and de Souza SJ. Ludwig Institute for Cancer Research, Rua Prof. Antonio Prudente, 109, 4 andar, Sao Paulo, SP, CEP 01509-010, Brazil.
We report here the combined use of ORESTES sequences generated in the FAPESP/LICR Human Cancer Genome Project and information available in the UniGene and SAGE databases to characterize the transcriptome of normal and breast tumor cells. We have identified 154 genes as candidates for overexpression in breast tumor cells.
Yoshihiro Ohta1 and Shigeo Ihara2. 1Hitachi Central Research Laboratory and 2Research Center for Advanced Science and Technology, University of Tokyo.
We constructed a biomolecular interaction detection system which is practical to handle the recent massive increase in literature on molecular biology. We comprehensively considered every needed elements, large-scale dictionary construction, biomolecular name detection, interaction detection and effective user-interface of network viewer. Our system can extract over 550,000 interactions with these elements.
Hofmann O. and Schomburg D. Department of Biochemistry, University of Cologne, Germany.
A network of enzyme and disease correlations was built by automatically extracting relevant information from the abstracts of biomedical literature. The concept-based data and implemented visualization techniques allow easy navigation by researchers to explore knowledge available in literature databases and develop new theories.
Judith Lucia Gomez, Ingo Dreyer and Bernd Mueller-Roeber. University of Potsdam, Institute for Biochemistry and Biology, Dept. Molecular Biology, Karl-Liebknechtstrasse 24/25, Haus 20, 14476 Golm, Germany.
The regulation of gene expression in plants is thought to result from the binding of different sets of transcription factors to promoter cis-elements. We tested HMM based methods to search for target genes in the model plant Arabidopsis thaliana, harbouring putative binding sites for transcription factors in their promoter regions.
P.W. Lord, R.D. Stevens, A. Brass and C.A.Goble. Dept. of Computer Science, Manchester University.
The Gene Ontology (GO) represents knowledge of a gene product's function, process and location in a computationally amenable form. We present metrics for measuring the similarity between GO terms, and therefore semantic similarity of gene products annotated with them. We validate these metrics by comparing them with measures of sequence similarity, and show several uses for the measure.
Gopinath Ganji 1,2, Yingfu Li 3, T. Chiang 2 and A. Jamie Cuticchia 1. 1 Department of Medical Biophysics, University of Toronto, 610 University Avenue, Toronto, Ontario, CANADA M5G 2M9, 2Center for Computational Biology, Hospital for Sick Children, 555 University Avenue, Toronto, Ontario, CANADA M5G 1X8, 3Department of Biochemistry and Department of Chemistry, McMaster University, 1200 Main St. W., Hamilton, Ontario, CANADA L8N 3Z5.
We hypothesize catalytic nucleic acids containing characteristic structural/functional sequence features can be probabilistically modeled and experimentally verified. By employing pattern discovery algorithms, structure prediction tools and machine learning methods, we have attempted to characterize various classes of SELEX-generated 'DNA kinases' (self-phosphorylating DNA) that recruit specific divalent metal cations and NTPs.
Michael Cornell, Paul Kirby, Cornelia Hedeler and Norman W Paton. Dept of Computer Science, Kilburn Building, University Of Manchester, M13 9PL.
GIMS is an object database that integrates genome sequence data with functional data (transcriptome, metabolome, metabolic pathway, proteome and protein-protein interactions) in a single data warehouse. GIMS can be browsed or analysed using canned queries. GIMS can be queried remotely using a Java application that can be downloaded from www.cs.man.ac.uk/~norm/gims.
John L Spouge and Eva Czabarka. National Center for Biotechnology Information, National Institutes of Health, Bethesda MD USA.
One key problem in designing intelligent systems for molecular biology is to determine which of two database retrieval methods is better. We give a simple statistical test based on z-scores to calculate the significance of differences in ROC[n] scores and apply the method to assess putative improvements to PSI-BLAST.
Klaus-Peter Pleissner1, Till Eifert2, Frank Schmidt1, Stefan H.E. Kaufmann1 and Peter R. Jungblut1. 1Max Planck Institute for Infection Biology, 2Algorithmus GmbH.
A collection of proteome databases which comprises 2-D gel proteins, Isotope Coded Affinity Tag (ICAT) and functional classification databases for Mycobacterium tuberculosis and Helicobacter pylori is presented. Information about genes, proteins and metabolic pathways serves as an information source for bacterial immunology. http://www.mpiib-berlin.mpg.de/2D-PAGE.
Everitt R.#, Minnema S.E.#, Koster C.S., Olson R.A., Wride M.A. and Rancourt D.E. Department of Biochemistry and Molecular Biology, University of Calgary, Alberta, Canada.
#These authors contributed equally to this work.
The Rancourt EST Database (RED) is a web-based system for the analysis, management, and dissemination of expressed sequence tags (ESTs). RED represents a flexible template DNA sequence database that can be easily manipulated to suit the needs of other labs undertaking mid-size sequencing projects. Source code for RED and the associated tools is available from email@example.com. RED is publicly accessible via www.ucalgary.ca/~rancourt.
Holger Hoos, Andrew Kwon and Raymond Ng. Department of Computer Science, University of British Columbia.
We propose a new method for analyzing microarray time series data. We apply the method on yeast cell-cycle time series data to find potential regulatory pairs.The results indicate that our algorithm is able to find different true positive pairs from correlation and edge detection method by Filkov et al.
Ioannis K. Moutsatsos, Yongchang Qiu, Rod Hewick, Joseph Wooters, Steve Howes, Gary Van Domselaar and Patrick Cody. Wyeth Research Inc. 35 Cambridgepark Drive, Cambridge, MA02140, USA.
TurboSEQUEST is a search engine used for protein prediction from MS/MS spectra of protein digests. We have developed a custom application, SequestOnOracle, that extends TurboSEQUEST with the data management and analysis tools of a relational database. SequestOnOracle’s unique capabilities derive from its ability to summarize and compare the protein and peptide content from multiple TurboSEQUEST searches.
Raymond T. Ng, Jorg Sander, Monica C. Sleumer and Man Saint Yuen. University of British Columbia.
Under the assumption that although cells can look morphologically similar they may behave very differently at a molecular level, we present method for clustering and classifying SAGE libraries to detect the similarities and differences between various tissue types and neoplastic states.
Jessica M. Phan, Raymond Ng and Steve Jones. University of British Columbia.
We demonstrate the toolkit for Gene Expression Analyzer (GEA) used particularly with high dimensional data such as SAGE. GEA provides a graphical interface with operations for clustering, comparing and contrasting gene expressions in different SAGE clusters. GEA would eventually be linked to various bioinformatics databases for integrated genomic analysis.
Takuya Oyama1,4, Kagehiko Kitano1,4, Kenji Satou 2,4 and Takashi Ito3,4. 1INTEC Web and Genome Informatics Corporation, 2School of Knowledge Science, Japan Advanced Institute of Science and Technology, 3Cancer Research Institute, Kanazawa University and 4Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Corporation (JST).
We studied a method that can discover rules related to protein-protein interactions from accumulated protein-protein interaction data using data mining. The method reveals the relation between the features of mutually interacting proteins like that the protein having the feature F1 interacts with the protein having the feature F2.
Kazuharu Arakawa1,2, Koya Mori11,3 and Masaru Tomita1,2. 1 Institute for Advanced Biosciences, Keio University, 2 Department of Environmental Information and 3Graduate School of Media and Governance.
G-language Genome Analysis Environment (G-language GAE) is a generic software package aimed for higher efficiency in bioinformatics analysis. G-language GAE has an interface as a set of Perl libraries for software development, and a graphical user interface for easy manipulation. It is distributed freely under GPL at http://www.g-language.org/.
E. C. J. Green1, J. Airey1, R. Cox1, Y. Hashim1, T. Hough1, Z. Lalanne1, K. E. Logan1, P.Nolan1, L.Visor1, A-M. Mallon1, P. Jones1, R. Selley1, A. Blake1, S. Greenaway1, H. J. Kirkbride1, J. Hunter2 and S. D. M. Brown1. 1Mouse Genome Center and Mammalian Genetics Unit, MRC, Harwell, Oxfordshire, OX11 0RD, UK and 2GlaxoSmithKline, New Frontiers Science Park, Harlow, CM19 5AW, UK.
A system is described for the management of data produced from the characterization of novel phenotypes, observed from a large scale ENU mutagenesis programme. A diversity of data is being produced from sources such as microarray technology, in situ hybridization studies, animal husbandry, candidate gene identification, DHPLC and sequencing.
M. Simon, S. Greenaway, A-M. Mallon, R. Selley, P. Jones, Z. Tymowska-Lalanne, S. Breeds, S. Smythe, H. Kirkbride, S. Webb, A. Blake, J. Weekes, E. Green, E. Mollison, P. Denny, P. Nolan, M. Goldsworthy, M. Strivens and S.D.M. Brown. Medical Research Council, Harwell, Oxon, Ox11 0RD, England.
A vital element of high-throughput genetics is to capture the data generated from experimental procedures and to integrate and disseminate these results. Two data management systems have been developed to capture this data at the point of generation - Wisp and Willo. These capture data specifically generated from sequencing and genotyping.
Chandra Ramanathan1, Shuba Gopal2, Bob Bruccoleri1, John Feder1, Gabe Mintier1 and Terry Gaasterland2. 1Bristol-Myers Squibb and 2The Rockefeller University.
Identification, verification and biological characterization of splice variants are challenging tasks but essential to understand the observed biological complexity in humans. A systematic bioinformatics methods is being developed to mine the human genomic and EST data for identifying splice variant forms of druggable gene targets and correlate these variants with disease/tissue expression information available in various proprietary databases.
Kavoos Basmenji, Zhan Chang, Bahram Habibi-Nazhad and David Wishart. Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, AB, T6G 2N8.
DrugBank is a web-enabled database developed to facilitate drug discovery and drug analysis. It combines drug information with drug target information to allow users the possibility of linking small molecule data with protein sequence/structure data. DrugBank can be accessed freely at http://redpoll.pharmacy.ualberta.ca/~zchang/cgi-bin/welcome.cgi.
V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug and C. Stoeckert. Center for Bioinformatics, University of Pennsylvania.
GUS is a comprehensive strongly typed relational schema and object-based software platform for integration, analysis, curation and presentation of sequence based genomics information. It has been used to model and/or mine human, mouse, plasmodium and the pancreas, and is suitable for model organisms in general. It is freely available.
Mark Schreiber1,2 and Chris Brown1. 1AgResearch NZ, PO Box 50034, Dunedin, New Zealand and 2Dept of Biochemistry, University of Otago, PO Box 56 Dunedin, New Zealand.
Calculation of the information content of motifs in genomes highly biased in nucleotide composition leads to overestimates of the amount of useful information in the motif. By treating a biased genome as a discrete channel with noise, in accordance with Shannon Information Theory, we were able to remove both ‘Distortion’ and ‘Noise’ from the motif and recover a more instructive biological ‘signal'.
Li Li, Brian Brunk, Christian J. Stoeckert Jr and David S. Roos. Department of Biology, University of Pennsylvania, Philadelphia, USA and Center for Bioinformatics, University of Pennsylvania, Philadelphia, USA.
To integrate eukaryotic sequence data with information on biological process we sought to identify orthologous groups by combining sequence similarity comparisons with graph clustering algorithms. Queries based on user-defined species distribution provide a snapshot of shared/diversified processe, facilitating (for example) the identification of targets for broad-spectrum antibiotics targeting apicomplexan parasites.
Bahram Habibi-Nazhad, Melania Ruaini, Kavoos Basmenji and David S. Wishart. Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton AB T6G 2N8, Canada.
The CyberCell Database (CCDB) is a web-enabled, user-friendly database containing previously published and electronically archived information on nearly every aspect of E. coli molecular biology and enzymology. We have also constructed CC3D which contains E. coli structural proteomic data and CCMD which contains the chemical database of metabolites and other small molecules used to support metabolic analysis.
Kazunori Miyazaki and Satoshi Itoh. Advanced Materials and Devices Laboratory, Corporate Research and Development Center, TOSHIBA CORPORATION.
We have developed a Java/XML-based functional database system of olfactory receptors (OR) from databases which can be accessed via Internet. The feature of our system is analyzing the XML data for OR by using predictive tools on the Web, and then accumulating annotated data in the analyzed one semi-automatically.
Soon Heng Tan and See-Kiong Ng. Laboratories for Information Technology, Singapore.
Vast amounts of molecular interaction pathway information can be extracted automatically from MEDLINE's abstracts using natural language processing, but progress has been hindered by a lack of a standard corpus for evaluation. We describe a test corpus we have created from our Pathweaver project that is suitable for such evaluation.
Sandy Maumus1,2, Amedeo Napoli2, Rafik Taouil2 and Sophie Visvikis1. 1INSERM U525, Université Henri Poincaré (Nancy 1) – Faculté de Pharmacie, 30 rue Lionnois, 54000 Nancy, France and 2LORIA – UMR 7503, B.P. 239, 54506 Vandoeuvre-Lès-Nancy, France.
Based on an application of symbolic data mining methods on a test database, we underline the role played by the analyst in the knowledge discovery process. Encouraged by positive results, we plan to apply these methods on a large database for investigating the relationships between gene polymorphisms and cardiovascular diseases intermediate phenotypes.
Per-Olof Fjallstrom. Affibody.
The ”clusters” returned by standard clustering methods applied to microarray data are not necessarily biologically relevant. We present a method for assessing if such clusters are unusually compact and isolated. The method has been successfully applied to several microarray data sets. It does not require estimates of the variance of experimental error.
K. MacLeod and E. Westwick. Astex TechnologyLtd.
An amino acid centered relational database has been designed to store sequences of P450 proteins that have been engineered in order to optimise crystallisation behavior. Amino acids are stored as individual entities, allowing the physical and chemical properties of the residues to be correlated with experimental outcome, using SQL queries.
Jibin Sun and An-Ping Zeng. Microbial Systems and Genome Analysis, GBF.
A method is proposed to in silico reconstruct metabolic network directly from unannotated genome sequences. A comparison of data from different sequencing stages (3.9 vs. 7.9 time coverage) for one
organism revealed that a 3.9 time coverage of the genome is sufficient (with 99.3% identity) for reconstructing the metabolic network.
Antoine Janssen1, Jan van Oeveren1, Pieter Vos1, Gert Vriend2, Roland Siezen2, Rene van Schaik3 and Jack Leunissen2. 1Keygene N.V., Wageningen, The Netherlands and 2Center for Molecular and Biomolecular Informatics, University of Nijmegen, Nijmegen, The Netherlands and 3Organon, Oss, The Netherlands.
The Keygene/CMBI AFLP® quality assessment and improvement tool is a web based application that automates quality assessment and visualization of (cDNA-)AFLP® data. It improves proprietary data by use of public data. The analysis includes coverage / redundancy calculation, internal contig building, full length discovery and potential SNP discovery. http://www.cmbi.nl/kg_bin/dataset_annotator.pl.
Barbara Eckman, Mauricio Hernández, Howard Ho, Felix Naumann and Lucian Popa. IBM.
Bioinformatics data sources typically have large, complex structures, reflecting the richness of the scientific concepts they model. Clio is an information integration tool the helps users define mappings between disparate schemas, thus providing an integrated view of all related data sources and enabling data transformations between the sources.
Tomoko Ohta1, Yuka Tateisi2, Jin-Dong Kim2 and Jun-ichi Tsujii1,2. 1Univ. of Tokyo and 2CREST, JST.
We are developing the necessary resources including domain ontology and annotated corpus from MEDLINE abstracts. We have already annotated 2,500 abstracts with 31 different semantic classes. Part-of-speech annotation to the same set of abstracts annotated for named entities is under way using Penn Treebank set. In this poster, we report on the current status of our corpus.
Data Mining. (Section B)
Byron Kuo, Timothy Chan and Raymond Ng. Department of Computer Science University of British Columbia.
The intent of the experiment is to attempt to characterize and find any similarities between seemingly different cancers (breast and brain) at the sub-cellular level. Based on publicly available SAGE libraries of cancerous and normal breast and brain tissues, we obtained a list of candidate cancer-related genes by applying the two-sample t-test and then analyzed their similarities at the gene expression level.
Samson Chien, Lawrence T. Reiter, Ethan Bier and Michael Gribskov. University of California, San Diego / San Diego Supercomputer Center.
Homophila is a database of human disease genes associated with their counterparts in Drosophila.
Homophila provides a comprehensive linkage between OMIM and Flybase in order to stimulate functional genomic studies in Drosophila that address questions concerning human genetic diseases. Homophila is available at http://homophila.sdsc.edu
Fazel Famili1, Roy Waker2, Alan Barton1, Qing-Yan Liu2, Ziying Liu1, Julio Valdes1, Youlian Pan1, Brandon Smith2, Junjun Ouyang1, Melanie Lehman2, Lynn Wei1 and Weiling Xu. 1Institute for Information Technology, 2Institute for Biological Sciences, National Research Council of Canada, Montreal Road, Ottawa, Ontario, K10 6R0, Canada.
This poster explains the role of integrated data mining systems in functional genomics. We will describe all stages of data preprocessing, the type of data and some of the useful knowledge that may be discovered from functional genomics data. The BioMiner architecture and its main functionalities along with some advantages of integrated architectures are explained.
Youlian Pan1, Roy Walker2, A (Fazel) Famili1 and Qing_Yan Liu2. 1Institute for Information Technology and 2Institute for Biological Sciences, National Research Council Canada, 1200 Montreal Road, Ottawa Ontario, K1A 0R6, Canada.
This paper rationalises the necessity of integrating information, such as patterns of transcription factor binding sites and transcription factors themselves, into the gene expression data mining processes. The paper also demonstrates the advantage of incorporating symbolic sequence data with numerical gene expression data analysis using an application in the BioMine project.
Youlian Pan1,2,3, Christoph W. Sensen4, Malcolm Heywood2 and Michael A. Shepherd2. 1Faculty of Computer Science, Dalhousie University, 6050 University Avenue Halifax, NS, Canada B3H 1W5; 2Canadian Bioinformatics Resources, NRC, 1411 Oxford St. Halifax, NS, Canada B3H 3Z1; 3Institute for Information Technology, NRC, 1200 Montreal Rd, Bldg. M-50, Ottawa, Ont. Canada K1A 0R6 and 4Department of Biochemistry and Molecular Biology, University of Calgary, 3330 Hospital Drive N.W., HSC 1150, Calgary, Alberta, Canada, T2N 4N1.
This paper investigates the potential of improving exon predictions by integrating GrailExp, GenScan, and MZEF using Multilayer Perceptron and Mixture of Experts neural networks. For human exon prediction, this integration system has significantly better recovery, by 25%, than any individual prediction engine alone. This system is available at http://www.cbr.nrc.ca/pany/integ.html.
Christopher Bowman and Richard Baumgartner. Institute for Biodiagnostics, National Research Council Canada, Winnipeg, Manitoba, Canada.
We compare linear and nonlinear techniques, for identifying the intrinsic dimensionality of a data set, including local and global principal component analysis, and a novel implementation of the Whitney reduction network. The performance of these techniques is evaluated using independent training and validation sets drawn from magnetic resonance and mass spectroscopy.
John Binamé1, Laurence Leherte1, Janice I. Glasgow 2, Suzanne Fortier3 and Daniel P. Vercauteren2. 1Laboratoire de Physico-Chimie Informatique, Facultés Universitaire Notre-Dame de la Paix, Namur, Belgium, 2 School of Computing, Queen's University, Kingston, ON, Canada, 3Department of Chemistry, Queen's University, Kingston, ON, Canada
This work concerns the building of a database of topological properties of electron density functions of organic molecules at medium resolution to develop an automated way to reduce molecules to few relevant points. These points are further used in similarity search and pharmacophore proposition procedures applicable to all pharmacological domains.
Sivakumar Kannan, Genevieve Boucher and Gertraud Burger. Canadian Institute for Advanced Research, Program in Evolutionary Biology, Départment de Biochimie, Université de Montréal, Montréal, Québec H3C 3J7, Canada.
Genomes of mitochondria and chloroplasts from diverse organisms carry on average 5 to 20 ORFs without assigned functions. In order to understand the biological role of these ORFs, we have developed a comprehensive analysis procedure using data mining methods. The approach and the predicted data will be presented.
Rich Maclin1 and Mark Craven2. 1Computer Science Department, University of Minnesota, Duluth and 2Biostatistics and Medical Informatics Department, University of Wisconsin, Madison.
We present a tool for annotating high-throughput experiments by automatically extracting keyphrases to characterize clusters of genes. Our method autonomously associates genes with PubMed abstracts, extracts keyphrases that are statistically associated with gene clusters, and attempts to organize both genes and keyphrases into informative subclusters.
Zhaohui S. Qin1, Lee Ann McCue2, William Thompson2, Linda Mayerhofer2, Charles E. Lawrence2,3, and Jun S. Liu1. 1Department of Statistics, Harvard University, Cambridge, MA 02138, 2The Wadsworth Center, New York State Department of Health, Albany, NY 12201, 3Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180.
The availability of complete genome sequences has made possible the computational identification of hundreds of binding sites via cross-species comparisons. We describe a novel Bayesian motif clustering algorithm that predicts the number of clusters among these sites, and identifies the sites belonging to each cluster.
Yaron Butterfield, Ran Guin, Ursula Skalska, Duane Smailus, Angelique Schnerch, Kevin Teague, Jacquie Schein, Marco Marra, Steven Jones and the Genome Sciences Centre. (http://www.bcgsc.bc.ca), British Columbia Cancer Research Centre, Vancouver, BC, Canada, V5Z 4E6.
We have established a bioinformatics pipeline to handle the large amount of DNA sequence data generated in our laboratory. We have created a laboratory information system where data is stored in a central relational database and in conjunction with Perl software, allows for efficient, high-throughput sequencing and processing.
Yanmei Lu, Nan Lin, Betty Huang, Donald G. Payan and Kunbin Qu. Rigel Pharmaceuticals, Inc., 240 E. Grand Ave, South San Francisco, CA 94080.
The ubiquitin pathway is involved in many important cellular processes. The NR database was mined for ubiquitin pathway related proteins using Gibbs Sampling and HMM. We identified around 900 proteins in the ubiquitin reaction cascade. Our results demonstrate diverse protein domain structure compositions and functions in ubiquitin-domain containing proteins and ubiquitin ligases.
Nan Lin, Davidson Wan, Jiao He, Yanmei Lu, Ying Huang, Donald G. Payan and Kunbin Qu. Rigel,Inc.
Galaxy is an enterprise solution system built at Rigel to construct and organize flexible “high throughput informatics pipelines” for data tracking, analysis and integration from diverse public sources with the overlay of the internal experiment data. It integrates functional platforms used by biologists with programming applications that are continuously updated.
Hideo Bannai1, Satoru Miyano1, Kenta Nakai1, Sascha Ott1 and Yoshinori Tamada2. 1University of Tokyo and 2Tokai University.
We use a program predicting short introns to analyse the processing of long introns (some thousand bases or longer). The focus is whether long introns contain a structure of short introns, such that the long introns can be processed by a series of splicing reactions rather than by one single reaction.
Hideo Bannai, Yoshinori Tamada, Sascha Ott, Kim Sunyong, Kenta Nakai and Satoru Miyano. Human Genome Center, Institute of Medical Science, University of Tokyo, 4-8-1 Minato-ku, Tokyo, 108-8639 Japan.
We analyzed the base compositions of introns available from Ensembl for H. sapiens, M. musculus, D. melanogaster and D. rerio, and have discovered a notable correlation between intron length and its base composition for H. sapiens and M. musculus. The tendency was not observed in D. melanogaster and D. rerio.
Daisuke Shinozaki and Osamu Maruyama. Faculty of Mathematics, Kyushu University, Japan.
We propose a method for the best model selection of paired motifs in promoter regions of a given set of genes. We apply our method to yeast data like sets of co-regulated genes and report the experimental result.
Sébastien Hergalant, Bertrand Aigle, Bernard Decaris, Pierre Leblond and Jean-François Mari. LORIA (équipe Orpailleur, BP 239, 54506 Vandoeuvre-lès-Nancy, France) and Laboratoire de Génétique et Microbiologie (UMR UHP-INRA 1128, IFR 110, 54506 Vandoeuvre-lès-Nancy, France).
We present a genomic data mining method in which the user describes a signal worked out by a second order HMM. This signal representing the probability to classify a nucleotidic residue or a group of residues in a particular state, allows the localization of repetitions in a complete bacterial genomic sequence.
Jasmin Müller, Falk Schubert and Roland Eils. Intelligent Bioinformatics Systems, German Cancer Research Center.
SVM-Decide is an approach to adapt knowledge from gene expression and genomic profiles for clinical decision support systems. Therefore we combine a support vector machine classifier with an explanation component for the physician. Furthermore an explicit competence model enables our system to classify only cases within its competence area.
Eric J. Grant and Dale L. Preston. Radiation Effects Research Foundation.
Retrieving data from a relational database can be complex. ‘Easy Click’, a desktop PC application, dynamically generates SQL statements via a point-and-click interface shielding researchers from writing SQL, guaranteeing easy and consistent access to research data. An initialization file supplies Easy Click with variable definitions giving a completely customizable application.
Y. Zeng, J. Garcia-Frias, J. Tang and G. Gao. Department of Electrical and Computer Engineering, University of Delaware.
Because of the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. This poster proposes a meta-clustering approach, which can extract the information from results of different clustering techniques adaptively and provides a better interpretation of the data patterns.
Taesung Park1, Sung-Gon Yi1, Hosik Choi1, Seung-Yeoun Lee2, Kee-Ho Lee3, Jung Kyoon Choi4, Sangsoo Kim4, Yeom Young Il,4, Choi Jong Young5 and Daeghon Kim Chonbuk6. 1Department of Statistics, Seoul National University, Seoul, Korea 2Department of Applied Mathematics, Sejong University, Seoul, Korea , 3Laboratory of Molecular Oncology, Korea Cancer Center Hospital , 4Korea Research Institute of Bioscience and Biotechnology, Taejon, Korea , 5The Catholic University of Korea, Seoul, Korea and 6National University, Jeonju, Chonbuk, Korea.
For the case when the same type of microarrarys from different clinical centers are collected , we propose new discrimination methods which account for variability caused by different clinical centers. The proposed methods are illustrated using the microarray data for liver cancer patients from three different clinical centers in Korea.
Jeffrey L. Brown, Thomas C. Hudson and Kenisha V. Johnson. University of North Carolina at Wilmington.
GeneBeans uses Enterprise Java Beans to provide biologists a graphical dataflow interface for constructing queries and analyses of gene index databases without the use of a textual query language. The tool is intended to make bioinformatics more generally practicable.
Sun Kim. School of Informatics Center for Genomics and Bioinformatics Indiana University, Bloomington.
As more sequences become available in an exponential rate, sequence analysis on a large number of sequences becomes increasingly important. Sequence clustering algorithms are computational tools for that purpose. In this paper, we present our clustering algorithm BAG that uses two graph properties, biconnected components and articulation points.
Daniel McShan, Shilpa Rao and Imran Shah. University of Colorado, Health Science Center, 4200 E 9th Ave, C-245Denver, CO80120, USA.
We present a novel metabolic cartography approach for representing the metabolic space based on the biochemical properties of molecules. Summaries and visualizations of this space are presented, offering a quantitative and qualitative overview of the metabolome. We are using metabolic cartography in our research on pathway inference methods.
Sean McIlwain1, Arno F. Spatola2, David Vogel2, Slyvie Blondelle3 and David Page4.
1Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53706, U.S.A., 2Institute for Molecular Diversity and Drug Design, Department of Chemistry, University of Louisville, Louisville, KY 40292, U.S.A., 3Torey Pines Research Institute for Molecular Studies, LaJolla, CA 92037, U.S.A. and 4Department of Biostatistics and Medical Informatics and Department of Computer Sciences, University of Wisconsin, WI 53706, U.S.A.
firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com and firstname.lastname@example.org
We apply inductive logic programming to the task of predicting anti-microbial activity, specifically, the ability of certain molecules to inhibit growth of Pseudomonas aeruginosa. This is done by taking into account the three-dimensional structure and biological activities from a database of tested molecules.
Michael Elkaim and Chris Ponting. MRC Functional Genetics Unit, University of Oxford, Department of Human Anatomy and Genetics, South Parks Road, Oxford, OX1 3QX, United Kingdom.
The manual annotation of protein domains is an arduous task. We present FAST, a program that automatically annotates protein domains by extracting functional information, including key word stems and key quotes, from literature that is relevant to protein sequences containing these domains.
Fajar Restuhadi1, Andrew Hayes2, Simon J. Hubbard1 and Stephen G. Oliver2. 1Dept. Biomolecular Sciences, UMIST, PO BOX 88, Manchester M60 1QD and 2School of Biological Sciences, Univ. of Manchester, Manchester M13 9PT.
Self Organizing Maps (SOM) approaches were used to analyse our unique transcriptome data from high-throughput northern hybridisations. Objective function associated with the SOM algorithm for a constant size of neighbourhood and finite data set is the sum of squares intra-classes (SSIntra) extended to neighbour classes. At the end of its convergence, the SOM algorithm thus exactly minimizes the SSIntra function. We applied the bootstrap method to allow us to estimate the variability of SSIntra. If the SOM is computed several times according to the bootstrap principle, then we can calculate the mean and standard deviation of SSIntra of the distortion. Further, the variability of SSIntra can be estimated to asses the stability of the quantization error in the SOM.
Douglas A. Hosack and Richard A. Lempicki. Laboratory of Immunopathogenesis and Bioinformatics, SAIC Frederick.
EASE is a software package that finds biological themes over-represented in any list of genes derived from microarray experiments or other high-throughput screening methods. It enables researchers utilizing these technologies to quickly find the interesting biological stories in their results.
Dimitar Hristovski1 and Borut Peterlin2. 1National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894 USA and 2Department of Human Genetics, Clinical Center Ljubljana Zaloska, 1000 Ljubljana, Slovenia.
email@example.com and firstname.lastname@example.org
We describe an interactive literature based discovery support system extended with background knowledge about disease/gene chromosomal or expression location. The goal of the system is to discover new, potentially meaningful relations between biomedical concepts (e.g. a gene candidate for a disease). The system is available at http://www.mf.uni-lj.si/bitola/.
Tanveer Syeda-Mahmood. IBM Almaden Research Center, 650 Harry Road, San Jose CA 95120.
The functional state of an organism is determined largely by the pattern of expression of its genes. Salient changes in variation in expression of genes can give clues about important events, such as the onset of a disease. In this paper, we address, for the first time, the problem of salient change detection in gene profiles. In particular, we look for salient inflection points in the time-series of the gene profiles. Such inflection points are places where there is a significant change in the curvature of the gene profile, as revealed by a zero-crossing of the second derivative that is preserved over multiple scales of smoothing. By automatically selecting an optimal scale of description, we derive salient change points in genomic signals. The utility of salient chnage detection is demonstrated in the automatic identification of regulatory phase for genes active in the mitotic cell cyle of budding yeast.
Barbara Eckman, Laura Haas, Prasad Kodali, Eileen Lin, Julia Rice and Peter Schwarz. IBM Life Sciences.
Integration of a widely diverse set of databases and applications is needed to carry out post-genomic bioinformatics research. We describe DiscoveryLink, IBM’s federated database offering, and illustrate how it is being used to provide integrated access to life sciences data, irrespective of where it is stored and its format.
Richard E. Green and and Steven E. Brenner. Univ. California, Berkeley.
Alternative splicing may have an enormous impact on the protein coding diversity encoded by the human genome. We set out to investigate one aspect of this impact, the effect of alternative splicing on the domain organization of affected proteins. Surprisingly, there seems to be little correlation between domain organization and alternative splicing.
Andre Cavalcanti and Laura F. Landweber. Princeton University.
Scrambled genes are surprisingly common in spirotrichous ciliates. During cell development, these microorganisms must reorder the permuted pieces of such genes, tackling an intrinsically computational problem. We are developing tools for the identification and analysis of scrambled genes with the final goal of understanding the rules driving this biological process.
Robert Kincaid. Life Science Technologies Laboratory, Agilent Technologies.
BNS is prototype biomolecule naming service using Lightweight Directory Access Protocol (LDAP) to provide high-performance access to data derived from LocusLink. Gene and protein names and accessions can be resolved into various equivalents with very low latency. This enables a number of novel processes involving rapid conversions between accession schemes.
Deendayal Dinakarpandian and Vijay Kumar. School of Interdisciplinary Computing and Engineering, University of Missouri-Kansas City, Kansas City, MO 64110, USA.
OLAPOP refers to the online analytical processing of proteins, beyond flat-file retrieval and sequence analysis. This is based on a data warehouse approach to proteins with an emphasis on the provision of analytic facilities that allow for the study of protein properties in multiple dimensions.