Bioinformatics for Glycobiology and Glycomics

An Introduction

John Wiley & Sons

Copyright © 2009 John Wiley & Sons, Ltd
All right reserved.

ISBN: 978-0-470-01667-1


Chapter One

Glycobiology, Glycomics and (Bio)Informatics

Claus-Wilhelm von der Lieth

Formerly at the Central Spectroscopic Unit, Deutsches Krebsforschungszentrum (German Cancer Research Center), 69120 Heidelberg, Germany

1.1 The Role of Carbohydrates in Life Sciences Research

Despite their nearly complete neglect in databases and 'traditional' bioinformatics projects, carbohydrates are the most abundant and structurally diverse biopolymers formed in nature. Historically, the chemistry, biochemistry, and biology of carbohydrates were very prominent areas of research over a long period of time during the beginning and the middle of the last century. However, during the initial phase of the development of molecular biology, focusing on DNA, RNA, and proteins, studies of carbohydrates lagged far behind. Among the main reasons for this were the inherent structural complexity of carbohydrates, the difficulty in easily determining their structure, the fact that their biosynthesis cannot be directly predicted from the DNA template, and that no methods are available to amplify complex carbohydrate sequences. The more recent development of a variety of new and highly sensitive analytical tools for exploring the structures of oligosaccharides and for producing larger amounts of pure complex carbohydrates has opened up a new frontier in molecular biology. The term glycobiology, which was introduced in the late 1980s, reflects the coming together of the traditional disciplines of carbohydrate chemistry and biochemistry, with modern understanding of the cellular and molecular biology of complex carbohydrates, which are often named glycans in this context. The more recently introduced term "glycomics" describes an integrated systems approach to study structure-function relationships of complex carbohydrates - the glycome - produced by an organism such as human or mouse. The glycome can be described as the glycan complement of the cell or tissue as expressed by a genome at a certain time and location. It includes all types of glycoconjugates: glycoproteins, proteoglycans, glycolipids, peptidoglycans, lipopolysaccharides, and so on. The aim of glycomics projects is to create a cell-by-cell catalog of glycosyltransferase (GT) expression and detected glycan structures using high-throughput techniques such as DNA glycogene chips, glycan microarray screening and mass spectrometric (MS) glycan profiling, combined with efficient bioinformatics tools.

Until recently, the role of complex carbohydrates to function as carriers and/or mediators of biological information was a widely neglected and unexplored area in science. However, with the awareness that the human genome encodes for a significantly smaller number of genes than was estimated from genomes of lower organisms such as yeast, it became obvious that each gene can be used in a variety of different ways depending on how it is regulated. Consequently, the study of post-translational protein modifications, which can alter the functions of proteins, came increasingly into scientific focus. Since then, with glycosylation being the most complex and most frequently occurring co- and post-translational modification, glycobiology research has attracted increasing attention.

About 70% of all sequences deposited in the SWISS-PROT protein sequence databank include the potential N-glycosylation consensus sequence Asn-X-Ser/Thr (where X can be any amino acid except proline) and thus may be glycoproteins. However, it is well known that not all potential sites are actually glycosylated. Based on an analysis of well-annotated and characterized glycoproteins in SWISS-PROT, it was concluded that more than half of all proteins are glycosylated. However, this number should be regarded as a very crude estimation since this study was hampered by the paucity of reliable, experimentally determined, and carefully assigned glycosylation sites.

The glycans are exposed on the surface of biomolecules and cells. They form flexible, branched structures that can extend 30 Å or further into the solvent. With a molecular weight of up to 3 kDa each, the oligosaccharide groups of mammalian glycoproteins frequently make up a sizable proportion of the mass of a glycoprotein and can cover a large fraction of its surface. The carbohydrate moiety of "proteins" may amount to a few percent of the molecular weight, but can be as much as 90% in some cases. O-Linked mucin-type glycoproteins are usually large (more than 200 kDa) with attached O-glycan chains at a high density. As many as one in three amino acids may be glycosylated and 5080% of the total mass is due to carbohydrates. An analysis of the available three-dimensional structures of glycoproteins contained in the PDB revealed that the glycan and the protein parts of glycoproteins behave like semi-independent moieties. This behavior has several important biological consequences:

N-Glycans can be modified without appreciable effects on the protein. Every N-linked glycan is subject to extensive modifications. This allows cells to fine-tune the biophysical and biological properties of glycoproteins and to generate the microheterogeneity that is so characteristic of glycoproteins.

The semi-independent nature of glycans also allows cell types and cells in different stages of differentiation and transformation to imprint on their glycoprotein pool their own specific biochemical characteristics, and thus give their exposed surface a "corporate identity."

This "corporate identity" exposed on their surface makes cells recognizable to other cells in a multicellular environment. It allows self-recognition and provides a central theme in development, differentiation, physiology, and disease.

1.2 Glycogenes, Glycoenzymes and Glycan Biosynthesis

The biosynthesis of carbohydrates attached to proteins or to lipids - called glycoconjugates - is fundamentally different to the expression of proteins. Whereas the enzymes required for the translation of the genetic information into a polypeptide chain in the ribosome are always the same for all proteins and amino acids, the subsequent glycosylation is a non-template-driven process where dozens of different enzymes are involved in the synthesis of the sugar chains attached to proteins or lipids. Depending on which of these enzymes are expressed in the cell that synthesizes a glycoprotein, various different glycan chains can be attached to the protein or lipid. Glycoproteins generally exist as populations of glycosylated variants - called glycoforms - of a single polypeptide. Although the same glycosylation machinery is available to all proteins in a given cell, most glycoproteins emerge with a characteristic glycosylation pattern and heterogeneous populations of glycans at each glycosylation site.

Glucose and fructose are the major carbon and energy sources for organisms as diverse as yeast and human beings (see, e.g.,: Monosaccharide Metabolism chapter). Organisms can derive the other monosaccharides needed for glycoconjugate synthesis from these major suppliers. It is important to appreciate that not all of the biosynthetic pathways are equally active in all types of cells.

The biosynthesis of oligosaccharides is primarily determined by sequentially acting enzymes, the glycosyltransferases (GTs), which assemble monosaccharides into linear and branched sugar chains. For this purpose, the monosaccharides must be either imported into the cell or derived from other sugars within the cell. However, a common factor is that all glycoconjugate syntheses require activated sugar nucleotide donors. It has long been known that a nucleotide triphosphate such as uridine triphosphate (UTP) reacts with a glycosyl-1-P to form a high-energy donor sugar nucleotide that can participate in glycoconjugate synthesis. Once the sugar nucleotides have been synthesized in the cytosol (or, in the case of CMP-Neu5Ac, in the cell nucleus), they are topologically translocated, since most glycosylation occurs in the endoplasmic reticulum (ER) and Golgi apparatus. As the negative charge of the sugar nucleotides prevents them from simply diffusing across membranes into these compartments, eukaryotic cells have devised no-energy-requiring sugar nucleotide transporters that deliver sugar nucleotides into the lumen of these organelles.

1.2.1 Biosynthetic Pathways

In eukaryotes, more than 10 biosynthetic pathways that link glycans to proteins and lipids are known. The KEGG PATHWAY resource - a collection of pathway maps representing current biochemical knowledge of the molecular interaction and reaction networks - has encoded 18 pathways for the biosynthesis of complex carbohydrates and their metabolism (see Figure 1.1), and 20 pathways for metabolism where carbohydrates are involved. More than 200 enzymes are involved in the biosynthesis of carbohydrate structures found on proteins and lipids. More than 30 different enzymes may participate directly in the synthesis of a single glycan. One of the best-characterized pathways is the biosynthesis of complex oligosaccharides that are subsequently attached to a protein through the side-chain nitrogen atom of the amino acid aspagarine (Asn) to give glycoproteins (described in Section 8.1 in Chapter 8). Glycosylation of proteins occurs in all eukaryotes and in many archaea but only exceptionally in bacteria.

O-Linked glycosylation, where carbohydrates are attached to serine (Ser) and threonine (Thr), takes place post-translationally in the Golgi apparatus. The monosaccharides are added one by one in a stepwise series of reactions (Figure 1.2). This is in contrast to the N-linked glycosylation pathway where a preformed oligosaccharide is transferred en bloc to Asn. A second important difference is that there are no known consensus sequence motifs that define an O-linked glycosylation site analogous to the AsnXSer/Thr motif for N-linked glycosylation.

1.2.2 The Role of Bioinformatics in Identifying Glyco-related Genes

The enzymes required for the biosynthesis of complex carbohydrates can be classified into those needed for the conversion of monosaccharide building blocks to activated sugar nucleotides and their transport within the cell, and those which are used to build (glycosyltransferases) and remodel (glycosidases) glycoconjugates. Many, but not all, of the latter enzymes are found within the ER-Golgi pathway for export of newly synthesized glycoconjugates.

The first mammalian GT gene was reported in 1986. The progress in identifying new GT genes at that time was slow because they had to be cloned by identifying the partial amino acid sequence of the purified enzyme, which was the limiting step. Thereafter, from the beginning of the 1990s when methods of expression cloning and PCR cloning with degenerated primers were employed, several novel GT genes were detected each year. It became obvious that GTs can be classified into several subfamilies which contain well-conserved sequence motifs. Based on this knowledge and the increasing availability of gene sequences and the development of appropriate bioinformatics searching algorithms, the in silico identification of GT genes could be successfully applied. During the middle of the 1990s, the number of newly reported GT genes began to increase significantly, reaching a peak in 1999 (Figure 1.3). This was due to the substantial increase in sequenced genes and the ease of finding new GT genes by homology searching using well-known BLASTN searches. The number of newly identified GT genes began to decrease gradually after 1999 to only five by August 2006, suggesting that mammalian GT gene cloning seems to be approaching its completion. During the past two decades, more than 180 human glycogenes have been cloned and their substrate specificities analyzed using biochemical approaches. The current status of knowledge compiled for these human GT genes and their links with orthologous genes in other species is summarized in the GlycoGene database.

As demonstrated for the identification of GT genes, the application of classical bioinformatics tools and also the use of genomic databases had and will continue to have a significant impact on the rapid development of glycobiology research. The same is true when searching for all lectins with similar binding affinity for a specific carbohydrate, which was also significantly accelerated through systematic analysis of gene sequences for the corresponding sequence motifs.

However, the use of (bio)informatics in glycobiology research has to be divided between those applications where an explicit description of the glycan structure is required, and those where the proteins to which carbohydrates are attached, the enzymes which build and modify carbohydrates, or the lectins which recognize a certain sugar epitope, are analyzed. The latter type of applications can be performed using well-known bioinformatics tools such as sequence alignment techniques and attempts to understand the evolutionary relationships through phylogenetic analysis. Where an encoding of the carbohydrate structure is required, however, for example when looking at carbohydrate specificity of a lectin or classification of the glycome of an organism, classical bioinformatics approaches cannot be directly applied.

1.3 Intrinsic Problems of Glycobiology Research

Glycobiologists have to deal with several intrinsic problems, making their research difficult and time consuming, as well as ambitious.

1.3.1 Carbohydrates Have to Be Analyzed at Physiological Concentrations

The first major challenge is to develop highly sensitive analytical methods. Since the biosynthesis of complex carbohydrates requires a variety of enzymes, which have to act in a defined and consecutive way, there are currently no methods available to amplify glycans readily in the sense that DNA is amplified using polymerase chain reaction (PCR) techniques. Consequently, highly sensitive analytical methods have to be applied, which are able to detect the small amounts of material found in cells. The chapters on experimental methods will discuss the central analytical methods - mass spectrometry, HPLC and NMR - which are used in different areas of glycobiology to identify glycan structures.

1.3.2 Complexity of Glycan Structures

The second major challenge lies in the complexity of glycan structures: each pair of monosaccharide residues can be linked in several ways, and one residue can be connected to three or four others (giving branched structures). The information content which can be potentially encoded by glycans in a given sequence is therefore high. The four nucleotides in DNA can be combined to give 256 four-unit structures, and the 20 amino acids in proteins yield 160 000 four-unit configurations. However, the number of naturally occurring residues is much larger for glycans which have the potential to assemble into more than 15 million four-unit arrangements.

Although oligosaccharides potentially carry this high capacity to store biological information, only a small part thereof is actually used in nature. A recent analysis of the KEGG glycan database containing 4107 unique glycan entries, which consist of nine frequently occurring monosaccharides (glucose, galactose, mannose, N-acetylglucosamine, N-acetylgalactosamine, fucose, xylose, glucuronic acid, and sialic acid) showed, that only 302 (54%) of the 558 (nine monosaccharides, two anomers, 31 substitution possibilities) theoretically possible disaccharides appear in the database. Furthermore, while an enormous number of reaction pattern combinations are theoretically possible, only 2178 of these combinations actually appear in the database. These numbers suggest that the structural diversity of glycans is indeed large, but that the combination of reaction patterns which actually exist in a given cellular environment is limited by the availability of the glyco-related enzymes which build and modify the glycan structures.

(Continues...)



Excerpted from Bioinformatics for Glycobiology and Glycomics Copyright © 2009 by John Wiley & Sons, Ltd. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.