当前位置:首页 >> 英语学习 >>

analysis of high throughput protein expression in Escherichia coli


Research

Analysis of High Throughput Protein Expression in Escherichia coli*
Yair Benita, Michael J. Wise§, Martin C. Lok, Ian Humphery-Smith , and Ronald S. Oosting**
The ability to efficiently produce hundreds of proteins in parallel is the most basic requirement of many aspects of proteomics. Overcoming the technical and financial barriers associated with high throughput protein production is essential for the development of an experimental platform to query and browse the protein content of a cell (e.g. protein and antibody arrays). Proteins are inherently different one from another in their physicochemical properties; therefore, no single protocol can be expected to successfully express most of the proteins. Instead of optimizing a protocol to express a specific protein, we used sequence analysis tools to estimate the probability of a specific protein to be expressed successfully using a given protocol, thereby avoiding a priori proteins with a low success probability. A set of 547 proteins, to be used for antibody production and selection, was expressed in Escherichia coli using a high throughput protein production pipeline. Protein properties derived from sequence alone were correlated to successful expression, and general guidelines are given to increase the efficiency of similar pipelines. A second set of 68 proteins was expressed to investigate the link between successful protein expression and inclusion body formation. More proteins were expressed in inclusion bodies; however, the formation of inclusion bodies was not a requirement for successful expression. Molecular & Cellular Proteomics 5:1567–1580, 2006. they interact with each other, and what they do. The difficulty of studying proteins is that they are each distinctively different from the other and are usually present in tissue in very low amounts. In the absence of a PCR equivalent, it has been suggested to call upon affinity ligands, such as monoclonal antibodies, for detection and identification of proteins (1). Regardless of the specific affinity ligand used, purified proteins must first be acquired in large quantities for generation and/or selection of specific affinity ligands. Thus, there is a need to define expression and purification conditions that are amenable to hundreds or even thousands of proteins in parallel. However, because proteins differ significantly in their physicochemical properties, the success rate of high throughput protein production is often too low, increasing the financial and technical constraints on such projects. Several groups have previously attempted high throughput expression of proteins or protein fragments. High throughput is defined as the ability to automate protein production, often using a 96-well format. Braun et al. (2) expressed 336 randomly selected human cDNAs in Escherichia coli and purified successfully 60% under denaturing conditions using His6 constructs and 50% under non-denaturing conditions using GST constructs. Luan et al. (3) expressed 10,176 Caenorhabditis elegans proteins using a robotic pipeline and observed an overall expression of 50% (15% in soluble form). Agaton et al. (4) reported a success rate of 76% for the expression of 142 human proteins in E. coli. Other groups reported success rates in the range of 60 – 80% (5–7). The three-dimensional structure of a protein can often provide functional clues, primarily by detecting structural homology with a protein of known function (8, 9). Structural proteomics attempts to determine protein structure on a genome-wide scale. It not only requires high throughput expression of target proteins but also that the proteins be produced in a form that is soluble, correctly folded, and suitable for x-ray crystallography or NMR studies. Previous attempts to produce proteins on a large scale for structural studies resulted in success rates of 10% (10, 11). This low success rate motivated studies that attempted to link the primary sequence of a protein to its propensity to be soluble upon overexpression in E. coli (10 –13). On the other hand, protein production for affinity ligands does not necessarily require the heterologous protein to be soluble. Agaton et al. (4) reported a success rate of 56% for eliciting affinity-purified antibodies

Downloaded from www.mcponline.org by on May 1, 2007

The completion of the human genome project and the biotechnical advances in the field of genomics have radically transformed biological and medical research. We now have the ability to monitor the mRNA expression of thousands of genes simultaneously in cells and tissues. However, it is the proteins encoded by these genes that carry out most biological functions. The proteome is much more daunting in size and complexity than the genome, and to understand how cells work we must study which proteins are present, how
From the Departments of Psychopharmacology and Pharmaceutics, Utrecht Institute of Pharmaceutical Sciences (UIPS), Utrecht University, Sorbonnelaan 16, 3584 CA Utrecht, The Netherlands, §The University of Western Australia, 35 Stirling Highway, Crawley, Western Australia 6009, Australia, and Biosystems Informatics Institute, Newcastle upon Tyne, NE1 4EP Newcastle, United Kingdom Received, April 17, 2006, and in revised form, June 1, 2006 Published, MCP Papers in Press, July 4, 2006, DOI 10.1074/ mcp.M600140-MCP200

2006 by The American Society for Biochemistry and Molecular Biology, Inc. This paper is available on line at http://www.mcponline.org

Molecular & Cellular Proteomics 5.9

1567

Analysis of High Throughput Protein Expression

against proteins that were expressed in E. coli and purified under denaturing conditions. In this respect protein production for affinity ligands is significantly less demanding than production for structural studies. To better cope with the financial constraints of high throughput protein production, it would be beneficial to identify a priori proteins that are likely to fail expression in a pipeline designed for affinity ligand target generation. Although prediction of protein solubility upon overexpression has drawn scientific attention, prediction of successful expression has been largely disregarded. Prediction of protein expression is bound to be more complicated because expression can fail in any of several different steps from plasmid construct stability to the final purified protein. Many of those steps, such as mRNA decay, are not necessarily related to the primary protein sequence or to the physicochemical properties of the amino acids. Solubility, on the other hand, is more likely to be dependent on the amino acid composition of the protein. In this study we present results on the expression of 547 recombinant proteins, produced as targets for affinity ligand generation, and investigate the link between their DNA and protein sequences and successful expression. Finally we investigate the relationship between solubility and expression level on a set of 68 human proteins.
EXPERIMENTAL PROCEDURES

Selection of Genes—We randomly selected 615 human ORFs, 547 for high throughput expression and 68 for inclusion body analysis, from disease-related genes available in publicly accessible clone libraries in late 2001 and retrieved their DNA coding sequence from GenBankTM (www.ncbi.nlm.nih.gov). Coding sequences were compared with the human genome (GenBankTM build 25) using BLAST, and the exons were extracted and set in-frame. For ORFs containing multiple exon, the first was discarded to reduce the likelihood of a signal peptide, and from the remaining exons the longest was chosen. Primer selection criteria, genomic template, and PCR protocols for our protein production pipeline have been described previously (14). Protein Expression and Purification—The plasmid construct used, named HZS, contained a His6 tag, a ZZ domain, a Gateway-compatible insert, and a Strep-tag. The ZZ domain is the tandem repeat dimer of the modified immunoglobulin binding domain of protein A of Staphylococcus aureus (15). The Strep-tag (16) was constructed using custom oligos. Plasmid construction, gene cloning, and bacterial transformation and induction have been described previously by our group (17). As expression host the E. coli BL21 codon-plus RP strain (Stratagene) was used. These cells contain extra copies of the argU and proL tRNA to enable expression of genes restricted by either AGG/AGA or CCC codons. Protein purification for the high throughput protein pipeline was done under denaturing conditions. The bacteria were grown in 24 deep well plates. Each well contained 5 ml of LB medium supplemented with 50 g/ml ampicillin and chloramphenicol. At the end of the 4-h isopropyl -D-thiogalactopyranoside induction period, bacterial plates were centrifuged at 3500 rpm for 15 min. The supernatants were removed, and the bacterial pellets were resuspended each in 1 ml of lysis buffer containing 8 M urea (lysis buffer: 100 mM NaH2PO4, 20 mM Tris, 10% glycerol, 0.1% Tween 20, pH 8.0, 20 mM -mercaptoethanol plus one tablet of Complete protease inhibitor (Roche Applied Science)). The content of each well was sonicated two times for

15 s with 10 s in between. Then the plates were centrifuged for 20 min at 3500 rpm. The next steps in the protein purification protocol were done using the Biorobot 8000 (Qiagen). Aliquots of 800 l of the supernatants were transferred to a 96-well filter plate (Qiagen) containing 200 l of Ni-NTA1 Superflow that was washed once with 500 l of lysis buffer containing 8 M urea before applying the supernatant. Then vacuum of 900 millibars was applied for 3 min. The resin was successively washed with 4, 2, 1, and 0 M solutions of urea in lysis buffer. After each wash step vacuum was applied for 1.5 min at 900 millibars, and the flow-through was discarded. Finally 1 ml of elution buffer (50 mM NaH2PO4, 300 mM NaCl, 250 mM imidazole, pH 8.0) was added to each well. After 10 min, vacuum was applied for 2 min at 700 millibars, and the eluate was collected in a deep well 96-well block. Protein purification for inclusion body analysis was performed separately for the soluble and insoluble fractions of the bacterial lysate. The bacterial pellet from 10 ml of induced bacteria was resuspended in 600 l of B-PER (Bacterial Protein Extraction Reagent, Pierce) containing one tablet of Complete protease inhibitor (Roche Applied Science)/25 ml, vortexed for 1 min at 3000 rpm, and centrifuged for 10 min in a standard tabletop microcentrifuge at 13,000 rpm and 4 °C. The supernatant was removed and placed on a custom made column containing 100 l of Ni-NTA Superflow. Columns were washed twice with 500 l of wash buffer (50 mM NaH2PO4, 300 nM NaCl, and 20 mM imidazole, pH 8.0) and eluted with 500 l of elution buffer. The remaining pellet of the lysed bacteria containing the insoluble fraction was resuspended by sonication (2 5 s) in 1 ml of 8 M urea and centrifuged for 10 min at 13,000 rpm. The supernatant was then placed on columns containing 100 l of Ni-NTA Superflow and washed with 4, 2, 1, and 0 M solutions of urea in BR buffer (0.1% Tween-20, 10% glycerol, 100 mM NaH2PO4, 20 mM Tris, pH 8). Proteins were eluted with 500 l of elution buffer. All proteins were visualized on Criterion Tris-HCl 12.5% precast polyacrylamide gels (Bio-Rad) with Coomassie staining. Sequence Analysis—A sequence analysis module was written in Python (Python Software Foundation) for this study and is being distributed as part of the SeqUtils module of Biopython (18). An aromaticity score was calculated according to Lobry and Gautier (19), and a protein instability index was calculated according to Guruprasad et al. (20). Isoelectric point, charge, and amino acid content (aliphatic, aromatic, polar, non-polar, charged, basic, acidic, small, and tiny) were calculated using pepstats (21) from the EMBOSS package (22). Average and maximum protein flexibility were calculated according to Vihinen et al. (23). Protein disorder was calculated using FoldIndex (24), and from the output the longest disorder segment and the total number of residues in disorder segments were extracted. DNA sequence complexity was calculated using both nSEG (25) and G1 (26). Protein secondary structure was assessed using garnier (27) from the EMBOSS package, and the fractions of -helix, -sheet, coil, and turn were calculated from the output. The garnier method is considered to have a low reliability in predicting the secondary structure of a protein; however, here we are not interested in the accuracy position by position but in the overall -helix, -sheet, coil, and turn propensities of the protein based on the tendency of individual amino acids to be present in one structure or another. Secondary structure of mRNA was predicted using mfold (28), and the most stable structure was selected (lowest G). Protein low complexity was calculated using 0j.py (29). Local GC content was calculated as previously described (14) (see also Fig. 1). The GRAVY
1

Downloaded from www.mcponline.org by on May 1, 2007

The abbreviations used are: Ni-NTA, nickel-nitrilotriacetic acid; aa, amino acids; AUC, area under the curve; CAI, codon adaptation index; POPPs, Protein or Oligonucleotide Probability Profiles; ANOVA, analysis of variance; 1D, one-dimensional.

1568

Molecular & Cellular Proteomics 5.9

Analysis of High Throughput Protein Expression

Downloaded from www.mcponline.org by on May 1, 2007

FIG. 1. GC content of an exon that failed to be expressed. The GC content plot was generated using a sliding window approach of 21 bp. The area above 65% (light gray) and below 35% (dark gray) was calculated to locate high and low GC content regions. The fraction is the percentage of sequence that is above or below the threshold, i.e. 21% of the sequence has a GC content above 65%. The calculated areas of several large regions are shown. was calculated according to Kyte and Doolittle (30). The GRAVY calculates an average value for the entire protein, which in many cases may be misleading. For example, on average a protein could be hydrophilic but still have a large internal hydrophobic region. Therefore, we further calculated local hydrophilic and hydrophobic regions along the protein using the Kyte and Doolittle hydrophobicity plot generated with a sliding window of 11 amino acids. The area under the curve (AUC) and the area above/below 0 were calculated using the trapezoid method as shown in Fig. 1 for GC content. The area above 0 was labeled “sum hydrophobic AUC,” and the area below 0 was labeled “sum hydrophilic AUC.” The single largest local hydrophobic or hydrophilic regions were located and labeled “max hydrophobic AUC” and “max hydrophilic AUC,” respectively. The hydrophobic AUC and hydrophilic AUC were also normalized by dividing the area by the total number of amino acids in the entire sequence. In addition, the hydrophobic to hydrophilic ratio was calculated, i.e. the ratio of the sum of all hydrophobic regions and the sum of all hydrophilic regions. Codon usage was calculated according to Sharp and Li (31). A set of 121 highly expressed E. coli proteins were selected from Swiss-2D PAGE (www.expasy.org/ch2d). All selected proteins were identified on a two-dimensional protein gel and were present in large amounts with percent volume average above 0.2 as calculated using the software Melanie (Version 2, Swiss Institute of Bioinformatics, GeneBio, Geneva, Switzerland). This set of proteins is available upon request. The codon usage index was generated using a Python codon usage module (available through Biopython) and the CAI for each gene was calculated. Regionalized CAI values were calculated using a CAI plot that was generated using a sliding window of 4 codons. The area below a threshold and above the curve was calculated for several thresholds (Fig. 2). Both the sum of all areas and the single largest area were used for the analysis. A modified version of the CAI, AAcai (amino acid codon adaptation index), was introduced by taking into account amino acid shortage due to overexpression of a protein with an amino acid content different from the average E. coli protein. This attribute is based on the observation that ribosomes translating a heterologous mRNA may stall at positions calling for a tRNA that is largely deacylated because of the heavier than normal drain of its amino acid into protein (32). In other words, the ribosome may stall even at an optimal codon if not enough amino acid is available to be loaded onto the tRNA. The average amino acid content of E. coli was calculated using the same set of 121 highly expressed E. coli proteins mentioned above. The amino acid content of each overexpressed sequence and the deviation from the average protein content were calculated. For each amino acid that was used more frequently than average the proportion of average usage to specific usage was calculated, and the index used to calculate the CAI value was adjusted accordingly. For instance, if a specific protein had 20% alanine and the average E. coli protein had 10% alanine, the most abundant alanine codon was rescaled from 1 to 0.5, and all other alanine codons were adjusted accordingly. The probability of finding a loaded alanine tRNA was reduced by 2-fold due to the 2-fold increase in usage of alanine in the heterologous protein. Once the index values were calibrated to the amino acid usage, the exact same methods that were described above for CAI were used. Protein compositional bias was assessed using POPPs (Protein or Oligonucleotide Probability Profiles) (33), a suite of inter-related software tools that enable the user to discover statistically “unusual” peptides. POPPs were created for each protein sequence versus the

Molecular & Cellular Proteomics 5.9

1569

Analysis of High Throughput Protein Expression

FIG. 2. Plot of CAI value across a randomly selected human protein sequence using a sliding window of four codons moving one codon at a time. Regionalized low CAI areas were calculated using the trapezoid method for regions below the threshold and above the CAI plot (light gray regions). Both the sum of all regions and the single largest region were used in the sequence analysis.

Downloaded from www.mcponline.org by on May 1, 2007

E. coli and human proteomes, scaled to a sequence length of 100 amino acids. The E. coli proteome was created using E. coli K-12 genome annotations (GenBankTM accession number NC_000913). The human proteome was fetched from the International Protein Index (34). Redundant proteins with more than 99 and 98% similarity to another in the respective databases were removed using nrdb90 (35). t tests and one-way ANOVA were performed using the stats.py and pstat.py modules (Version 0.6).2 Decision trees are predictive models that are often used in machine learning. Here we used decision trees to classify proteins into expression groups based on DNA and protein sequence attributes. Inner nodes in the tree represented decision variables, e.g. degree of aromaticity, and leaf nodes represented the predicted expression groups. Decision trees were previously applied to similar data, linking sequence attributes to successful expression for structural analysis (10, 11). Decision trees were generated using the rpart module of the R statistical package (36). The decision trees were pruned using a complexity value of 0.05. To minimize the effect of unequal group sizes, a subset of randomly selected proteins were selected from the larger group equal in number to the small group. The process was repeated three times, generating three decision trees.
RESULTS

An initial set of 547 human exons, each representing a different gene, were transferred using Gateway high throughput cloning system into the HZS vector construct. The protein fragments were a relatively small part of the entire recombinant protein. The average length of the protein insert was 76 29 amino acids, corresponding to 8.6 3.3 kDa. The constant part of the protein (HisZZ on the amino-terminal end and Strep-tag on the carboxyl-terminal end) was in total 170 aa long with a molecular mass of 19.45 kDa. The final HZS vectors containing the inserts were confirmed to be correct by observing the expected fragments on agarose gel after re2

FIG. 3. Summary of protein expression for 547 human proteins as visualized on precast XT gels (Bio-Rad). A band that was observed to be less than 10% smaller or larger than expected was labeled correct. Gray bands were labeled “faint,” and black bands were labeled “strong.”

G. Strangman, unpublished software.

striction enzyme digestion. The proteins were classified into one of five groups: I, no visible bands; II, faint bands with correct size; III, faint bands with wrong size; IV, strong bands with correct size; and V, strong bands with wrong size (Fig. 3). Classification into faint/strong was performed visually on a scanned gel image. Gray bands were labeled faint, and black bands were labeled strong. In 77% of the proteins a band was visible on the gel, and overall in 58.5% of the proteins the

1570

Molecular & Cellular Proteomics 5.9

Analysis of High Throughput Protein Expression

TABLE I Top 10 over-represented (A) and under-represented (B) peptides in each expression group compared to the average E. coli protein, the average human protein, and to the average group IV protein Peptides were detected using POPPs with probability value of 0.005 and scaling protein length to 100 aa. Peptides are sorted from the most over- or under-represented peptide to the least.

Downloaded from www.mcponline.org by on May 1, 2007

expected size was observed. In all cases where a protein band was visible on the gel, the band was larger or equal to 17 kDa, the molecular mass of HisZZ, the constant amino terminus of the recombinant protein. Single amino acids and peptides of up to 3 aa that were significantly over- or under-represented in each group of proteins (p 0.005) compared with E. coli and human proteomes were analyzed (Table I). Almost every protein is unusual to some extent; the question is really to which extent. The set of human proteins expressed here contained over-represented

peptides that were rich in isoleucine, aspartic acid, phenylalanine, and glutamic acid compared with the average human protein (Table I). The over-represented peptides are hydrophilic and flexible. Peptides that were under-represented compared with the average human protein were rich in glycine, alanine, proline, and leucine. These peptides have a tendency to be located in coil or turn protein structures and are not charged and not polar. A more uniform bias was seen across all groups when compared with the average E. coli protein (Table I). There was a clear over-representation of

Molecular & Cellular Proteomics 5.9

1571

Analysis of High Throughput Protein Expression

TABLE II Mean values and standard errors of DNA and protein attributes of the five expression groups P-values on the right column were acquired using a one-way ANOVA test across all five groups. A significant ANOVA p-value (p 0.05) indicates that the mean across all groups is not equal. The mean values of groups I, II, III, and V were further compared to the mean value of group IV using a t-test, and significant differences (p 0.05) are indicated in bold.

Downloaded from www.mcponline.org by on May 1, 2007

peptides rich in serine, lysine, and glutamic acid. These peptides are polar and hydrophilic and tend to occur in coil or turn protein structures. Peptides that were under-represented

were rich in alanine, leucine, and glycine; are non-polar, not charged, and not flexible; and mostly occur in -helix protein structures. Interestingly the amino acid cysteine is highly

1572

Molecular & Cellular Proteomics 5.9

Analysis of High Throughput Protein Expression

TABLE II—continued

Downloaded from www.mcponline.org by on May 1, 2007

Molecular & Cellular Proteomics 5.9

1573

Analysis of High Throughput Protein Expression

Downloaded from www.mcponline.org by on May 1, 2007

FIG. 4. Decision tree classification of each expression group and group IV (correct strong bands): group I (no visible bands) and group IV (A), group II (faint correct bands) and group IV (B), group III (wrong faint bands) and group IV (C), and group V (wrong strong bands) and group IV (D). Group IV was the largest expression group in our data set containing 198 proteins. To avoid bias due to unequal

1574

Molecular & Cellular Proteomics 5.9

Analysis of High Throughput Protein Expression

over-represented in group IV compared with the average E. coli protein (Table I). This amino acid tends to form disulfide bonds with another cysteine, a modification that cannot occur in the E. coli cytosol. Therefore, over-representation of cysteine is usually associated with difficulties to express cysteine-containing proteins correctly because they cannot fold correctly. In our set of human proteins, no expression difficulties were observed for proteins rich in cysteine. Group IV, containing proteins with high expression levels and expected molecular weight, was considered the most optimal for our pipeline. All other groups were compared with it by using an independent sample t test with assumed equal variance for all sequence analysis methods followed by oneway ANOVA across all five groups (Table II). Groups III and V contained slightly longer DNA sequences than the other groups, corresponding to a difference of 1 kDa in average protein mass. mRNA folding stability appears to be similar for all groups. Codon usage was not significantly different throughout the five protein groups except for group I, which had a less optimal codon usage than group IV when calculating the area under a 0.2 threshold. DNA GC content was not significantly different throughout the five groups. Group I was the only group with a significantly higher aromaticity score compared with group IV. However, there was no significant difference in the number of aromatic amino acids across all groups. Group I had the lowest protein flexibility score (most rigid), and group III had the highest (most flexible), both significantly different from group IV. Hydrophobicity was evaluated using several methods. Using the GRAVY score, all groups had an average score below 0 (hydrophilic), and only group I was significantly different from group IV with a higher score (more hydrophobic). The analysis of the area under the hydrophobicity curve revealed that group I contained significantly more hydrophobic regions, nearly 2-fold compared with group IV, and that groups III and V contained significantly larger hydrophilic regions compared with group IV. Groups II and IV had a balanced ratio of hydrophobic to hydrophilic area, whereas in group I the ratio was 3.7. Group I and group V had the highest and lowest isoelectric point of all groups, respectively; and both were significantly different from group IV. Groups II and IV had a protein charge close to 0, whereas group I had the highest positive charge, and group III had the highest negative charge. The amino acid content as calculated by pepstats was significantly different from group IV only for group I. Group I had more aliphatic and non-polar amino acids and fewer polar, basic, and acidic amino acids. These observations are in agreement with the above difference in hydrophobicity and charge. The content of -helix, -sheets,

coil, and turn from the predicted secondary structure was not significantly different across the five groups except for a higher coil content in group III compared with group IV. Protein sequence complexity score was significantly higher only for group I, both for total protein complexity and the largest low complexity sequence in the protein. The score was nearly 2-fold higher than that for group IV. Protein disorder was evaluated using several parameters. Group I had the highest disorder score, significantly higher than group IV. When considering the longest disorder segment, only groups II and III had a significantly higher score than group IV. POPPs were used to identify peptides that were over- or under-represented in each group compared with group IV (Table I). The most obvious observations were (i) the low complexity of over-represented peptides in group I, (i) the high complexity of over-represented peptides in groups II and V, and (iii) most of the over-represented peptides in all groups were 3 aa long, whereas in the under-represented peptides only a few were 3 aa long. In general the POPPs analysis was in line with the above sequence analysis; for instance, overrepresented peptides in group III were rich in aspartic acid, glutamic acid, serine, and proline, supporting the observation that group III is mainly characterized by a strong hydrophilic content. Decision trees were used to extract sequence attributes that were the most useful for classification purposes (Fig. 4). In each decision tree one of the protein groups is classified compared with group IV. The largest local hydrophobic region and the isoelectric point were the most useful attributes for classification of proteins into group I or IV (Fig. 4A). The best tree classified correctly 73% of the proteins, 85% correctly in group IV and 62% correctly in group I (Fig. 4, A2). Decision trees for other groups were not consistent in the sequence attributes used, and the classification performance was lower than those of group I. Sequence attributes that appeared in more than one tree included isoelectric point for group II (Fig. 4B), coil propensity for group III (Fig. 4C), and protein flexibility and instability for group V (Fig. 4D). The major attributes that were different between groups were previously associated with inclusion body formation (12, 37, 38). Therefore, 68 human exons were expressed using the same procedure except for the protein purification part. Instead of purifying the protein under denaturing condition, proteins were purified separately from the soluble and insoluble phases (Fig. 5). The purified proteins were visualized on 1D gels and classified into one of five groups as described above. The majority of the proteins from the insoluble fractions were visualized as strong bands (groups IV and V),

Downloaded from www.mcponline.org by on May 1, 2007

group sizes a random selection of proteins was sampled from group IV equal in size to the other group. To increase confidence in the classification process, three trees were constructed, each time sampling randomly from group IV. The length of the branch is proportional to the classification error. Next to each leaf node the predicted group is indicated, and the proportion of the number of cases that were classified correctly to the total number of cases predicted for that leaf is shown. For instance, in the lowest leaf of decision tree D1, 29 proteins were classified into group IV, but only 23 of those were actually from group IV, and the remaining six were from group V.

Molecular & Cellular Proteomics 5.9

1575

Analysis of High Throughput Protein Expression

Downloaded from www.mcponline.org by on May 1, 2007

FIG. 5. Two 1D SDS gels from the soluble (blue) and insoluble (red) fractions. The gels were scanned separately, and their images were tinted and combined using a graphical image editing program. Each lane contains the exact same sample from both phases, except the right marker lane, which was switched to distinguish between the gels. The soluble protein with high molecular weight that is present in all samples is a native E. coli protein that binds to Ni-NTA and should be ignored.

FIG. 6. Classification of expressed proteins from the soluble (A) and insoluble fractions (B). Protein purification was performed separately from the soluble and insoluble fractions of the bacterial lysate. Proteins were visualized on 1D SDS gel and classified into one of five categories: no bands, faint correct/wrong bands, or strong correct/wrong bands.

1576

Molecular & Cellular Proteomics 5.9

Analysis of High Throughput Protein Expression

DISCUSSION

FIG. 7. Mean A550 nm for 68 expressed human exons after overnight growth at 30 °C prior to dilution and induction. Protein classification into groups was done by considering the combination of the soluble and insoluble fractions. The error bars represent the S.E. value. Group I was significantly smaller (**, p 0.002) than groups III V and II IV. Group III V was significantly smaller (*, p 0.01) than group II IV.

whereas the majority of the proteins from the soluble fraction were visualized as faint bands (groups II and III) (Fig. 6). Forty-three percent of the proteins that were purified from the insoluble fraction were observed to be of expected size compared with 34% in the soluble fraction. Furthermore the number of proteins with no visual band on gel was higher for the soluble fraction. Combining both fractions, no visible protein bands were observed on gel for seven proteins, and a band of expected molecular weight was observed for 37 proteins (54%). Of these, 16 were present in both the soluble and insoluble fractions, 13 were present only in the insoluble fraction, and eight were present only in the soluble fraction. Proteins that were expressed correctly in the soluble fraction were compared with proteins that were expressed correctly in the insoluble fraction. The only two parameters that were significantly different were GC content and -sheet propensity. Proteins that were expressed correctly in the soluble fraction had a higher GC content (61 versus 51%, p 0.03) and a lower -sheet propensity (0.11 versus 0.26, p 0.02). Bacterial growth for these proteins was monitored at A550. The A550 was measured after overnight growth at 30 °C prior to dilution and induction. The average A550 for all 68 proteins was 2.6 0.6. Due to the relatively small number of samples, bacterial growth was compared across three different expression groups, namely, group I (no detectable protein), groups II IV (expected sizes), and groups III V (unexpected sizes). A protein was considered correct if the protein band was of the expected size in either the soluble or insoluble fractions. A protein was classified into group I if no visible band was observed on gel in both the soluble and insoluble fractions. Bacterial growth for group I was significantly lower than the growth observed for groups III V, which was significantly lower than that for groups II IV (Fig. 7).

The field of genomics has been revolutionized by the ability to use high throughput technology. DNA arrays are now affordable to most laboratories, and genomic information accumulates faster than it can be analyzed. The field of proteomics, despite many efforts and advances, still lacks effective high throughput technology. The work presented here demonstrates the technical difficulties in scaling up heterologous protein expression. We expressed small protein fragments in E. coli for generating antibodies against the native protein. Those fragments were expressed as part of a recombinant protein with a large HisZZ fusion protein on the amino-terminal end and a smaller Strep-tag on the carboxyl-terminal end. Although the selected DNA insert was on average a third of the entire recombinant DNA sequence, significant differences were observed in expression level in E. coli. The success rate reported here of 60% is similar to previous reports where a pipeline approach was applied (2–7). All protein bands that were observed on gel were equal to or larger than the HisZZ domain size, suggesting that this domain is very stable and probably folds independently. We have shown elsewhere that this domain is also beneficial for eliciting antibodies (17). Here we attempted to detect, based on sequence analysis, those proteins that are suitable for our specific pipeline protocol. We observed several significant differences between the five expression groups. The group that is most different from all others is group I, which contains the genes whose products failed to produce visible bands on the gel. The proteins produced by this group were the most hydrophobic and had the highest positive charge, highest isoelectric point, highest low complexity score, lowest flexibility, highest -sheet propensity and the lowest protein disorder. Hydrophobicity and -sheet propensity have been implicated previously in the formation of inclusion bodies (12, 38). However, the low flexibility and low protein disorder stand in contrast to inclusion body formation. Inclusion bodies have been shown to be the result of an increased population of partially folded intermediates (37), and reduced flexibility and disorder are likely to reduce the formation of such intermediates (38). Furthermore there is a conflict between the high positive charge observed in group I, which increases solubility in an aqueous environment, and the large hydrophobic regions, which decrease solubility. Therefore, it is possible that the combination of high hydrophobicity and charge together with low flexibility and low protein disorder generates a protein that is not likely to form inclusion bodies in the bacterial host. Such a protein is potentially toxic to the host because large nonfolded hydrophobic regions have a tendency to stick to other proteins in an aqueous environment. This possibility is supported by the lower growth rate observed for group I. It is also possible that only those cells that expressed the wrong ORF survived. Proteins of group I also had a significantly lower protein complexity and a higher aromaticity score. Low pro-

Downloaded from www.mcponline.org by on May 1, 2007

Molecular & Cellular Proteomics 5.9

1577

Analysis of High Throughput Protein Expression

tein complexity is likely to cause amino acid shortage and trigger a stringent response resulting in increased protease activity and protein degradation (39 – 41). Ramrez and Bentley (40) showed experimentally that the addition of phenylalanine to bacteria overexpressing chloramphenicol acetyltransferase reduced the cellular stress and resulted in a proportional increase in production. The lower complexity and higher aromaticity score for proteins in group I suggests that proteins in this group posed a higher burden on the host; this is another form of toxicity and strengthens the previous statement. The decision trees distinguished group I from group IV by the single largest hydrophobic AUC region and the isoelectric point of the protein, supporting the above conclusion that these proteins were most likely expressed below detection level due to toxicity to the bacterial host. The two groups of proteins with expected sizes (groups II and IV) were undistinguishable one from the other using sequence analysis. Despite the relatively high number of proteins in each group, the decision trees generated to classify proteins in one of the two groups were inconsistent and performed poorly, emphasizing further the difficulty of separating these two groups. Final low quantities of purified protein can be explained either by proteolysis, inefficient protein production, inefficient purification, or low bacterial growth. Because the ZZ domain was shown to be stable, it is unlikely that the His6 tag was unavailable for purification. It is also unlikely that bacterial growth was lower for group II because no evidence was found to support it in the other set of 68 proteins where bacterial growth was monitored. POPPs analysis detected over-represented peptides rich in serine, a hydrophilic amino acid. Serine-rich peptides were overexpressed in all groups compared with the E. coli average protein; however, in group II the over-representation was even higher than in group IV. The solubility experiment showed that soluble proteins were present in low amounts whereas the insoluble proteins accumulated to larger amounts. Therefore, difference in quantity is most likely due to differences in solubility of the protein or difference in resistance to proteolysis. Proteins that had a visible band on gel but of incorrect size (groups III and V) were either produced as full proteins that were later cleaved or degraded, or the translation simply stopped shortly after the HisZZ domain. Those proteins were harder to distinguish from group IV than proteins from group I. The most distinctive property of proteins with incorrect size is their negative charge compared with the neutral charge of proteins of expected size and the higher coil propensity. Those proteins also have larger local hydrophilic regions. Decision trees created to classify proteins were inconsistent. However, several attributes that were repeated include percentage of amino acids occurring in coil structures, hydrophobicity, and flexibility. Given those properties it is reasonable to assume that the cause for production of proteins with

wrong size was either cleavage or degradation of the proteins and not translation interference. Protein expression can fail in many different steps from the stability of the plasmid encoding the protein to the specific protein generated and its interactions with the E. coli proteins. Using a pipeline approach, we were not able to determine experimentally whether expression failure occurred at DNA, mRNA, or protein levels; only the end result was known. None of the DNA attributes that were derived from sequence analysis methods were significantly different between groups. Therefore, it is more likely that differences in expression were attributed to the properties of the produced protein, whereas the DNA and mRNA constructs were similarly stable for all groups. A DNA or protein sequence is often characterized using a plot generated by a numerical value assigned to each base or amino acid (42), such as the GC content plot shown in Fig. 1. Although these plots may be useful when analyzing a few sequences, they are difficult to use when comparing hundreds of sequences because there is no easy way to convert them into a number preserving the information displayed. Here we used the area under the curve and above a threshold or the area above the curve and under a threshold depending on the context of the attribute. This method was more useful than using the average value of the plot and emphasized the differences between protein groups. Using the GRAVY method as described by Kyte and Doolittle (30), for instance, suggested that only the mean value of group I was significantly different from group IV and that all protein groups were hydrophilic. However, using the mean of the largest hydrophobic and hydrophilic areas in each protein and the mean of the sum of all hydrophobic and hydrophilic areas revealed the significantly larger regionalized hydrophilic content in groups III and V compared with group IV and the extent of local hydrophobic regions in group I. Furthermore attributes that use the area under the curve were shown to be more frequently used in the decision trees compared with attributes that use average values. Therefore, we recommend using these sequence analysis methods when comparing DNA or protein sequences. Inclusion body formation was tested on 68 proteins. Clearly more pure protein product could be obtained from proteins that form inclusion bodies compared with soluble proteins. The majority of the proteins were present in both the soluble and insoluble fractions; however, some were present exclusively in one of the two. The exclusive expression in the soluble phase was probably due to the stability and solubility of the HisZZ domain, which is the largest part of the protein. Despite the relatively higher number of proteins with expected size in the insoluble phase, many proteins could clearly be expressed correctly in a soluble form. Therefore, inclusion body formation as such is not essential for correct expression. -Sheet propensity was significantly lower for soluble proteins. Increased -sheet formations were shown previously to

Downloaded from www.mcponline.org by on May 1, 2007

1578

Molecular & Cellular Proteomics 5.9

Analysis of High Throughput Protein Expression

increase the likelihood of inclusion body formation (12, 38). The biological significance of the higher percentage of GC content observed in soluble proteins is not clear. The ability to predict successful expression was limited. Decision trees, which have been used previously for similar purposes (10, 11), were not consistent except when comparing groups I and IV. Although in some cases the same attributes were used in all trees, in many cases, different attributes were used for classification in each tree. This demonstrates the difficulty in determining a small set of attributes that cause expression failure and emphasizes that failure can occur at different levels and due to a different combination of attributes. At this point we are unable to produce an exact algorithm for predicting successful expression with a reasonably good sensitivity and specificity. However, for our specific pipeline efficiency is likely to be increased simply by avoiding proteins with (i) a strong positive or negative charge, (ii) a ratio of hydrophobic AUC to hydrophilic AUC different from 1 0.5, (iii) an isoelectric point below 6.5 or above 7.5, (iv) high aliphatic or aromatic content, (v) protein complexity above 1, (vi) high -sheet or coil content, and (vii) low flexibility. Other protocols will be developed to handle those proteins for which specific strategies need to be devised. Many groups within the academia and industry attempt to produce proteins in a high throughput manner with varying success rates of 40 – 80%. The success rate is often accepted as is with no further inquiry into the reasons for which protein expression failed for such a large group of proteins. We would like to encourage all groups using high throughput protein expression to investigate the link between DNA and protein sequences and successful expression, thereby leading to efficient and affordable protein expression platforms that are essential for proteomic research.
* The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ** To whom correspondence should be addressed: Dept. of Psychopharmacology, Sorbonnelaan 16, 3584 CA, Utrecht, The Netherlands. Tel.: 31-30-2533730; Fax: 31-30-2537900; E-mail: R.S. Oosting@Pharm.uu.nl.
REFERENCES
1. Humphery-Smith, I. (2004) A human proteome project with a beginning and an end. Proteomics 4, 2519 –2521 2. Braun, P., Hu, Y., Shen, B., Halleck, A., Koundinya, M., Harlow, E., and LaBaer, J. (2002) Proteome-scale purification of human proteins from bacteria. Proc. Natl. Acad. Sci. U. S. A. 99, 2654 –2659 3. Luan, C. H., Qiu, S., Finley, J. B., Carson, M., Gray, R. J., Huang, W., Johnson, D., Tsao, J., Reboul, J., Vaglio, P., Hill, D. E., Vidal, M., Delucas, L. J., and Luo, M. (2004) High-throughput expression of C. elegans proteins. Genome Res. 14, 2102–2110 4. Agaton, C., Galli, J., Hoiden Guthenberg, I., Janzon, L., Hansson, M., ¨ Asplund, A., Brundell, E., Lindberg, S., Ruthberg, I., Wester, K., Wurtz, D., Hoog, C., Lundeberg, J., Sthl, S., Ponten, F., and Uhlen, M. (2003) ¨¨ Affinity proteomics for systematic protein profiling of chromosome 21

gene products in human tissues. Mol. Cell. Proteomics 2, 405– 414 5. Christendat, D., Yee, A., Dharamsi, A., Kluger, Y., Gerstein, M., Arrowsmith, C. H., and Edwards, A. M. (2000) Structural proteomics: prospects for high throughput sample preparation. Prog. Biophys. Mol. Biol. 73, 339 –345 6. Pizza, M., Scarlato, V., Masignani, V., Giuliani, M. M., Arico, B., Coman` ducci, M., Jennings, G. T., Baldi, L., Bartolini, E., Capecchi, B., Galeotti, C. L., Luzzi, E., Manetti, R., Marchetti, E., Mora, M., Nuti, S., Ratti, G., Santini, L., Savino, S., Scarselli, M., Storni, E., Zuo, P., Broeker, M., Hundt, E., Knapp, B., Blair, E., Mason, T., Tettelin, H., Hood, D. W., Jeffries, A. C., Saunders, N. J., Granoff, D. M., Venter, J. C., Moxon, E. R., Grandi, G., and Rappuoli, R. (2000) Identification of vaccine candidates against serogroup B meningococcus by whole-genome sequencing. Science 287, 1816 –1820 7. Dobrovetsky, E., Lu, M. L., Andorn-Broza, R., Khutoreskaya, G., Bray, J. E., Savchenko, A., Arrowsmith, C. H., Edwards, A. M., and Koth, C. M. (2005) High-throughput production of prokaryotic membrane proteins. J. Struct. Funct. Genomics 6, 33–50 8. Cort, J. R., Koonin, E. V., Bash, P. A., and Kennedy, M. A. (1999) A phylogenetic approach to target selection for structural genomics: solution structure of YciH. Nucleic Acids Res. 27, 4018 – 4027 9. Zarembinski, T. I., Hung, L. W., Mueller-Dieckmann, H. J., Kim, K. K., Yokota, H., Kim, R., and Kim, S. H. (1998) Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. Proc. Natl. Acad. Sci. U. S. A. 95, 15189 –15193 10. Bertone, P., Kluger, Y., Lan, N., Zheng, D., Christendat, D., Yee, A., Edwards, A. M., Arrowsmith, C. H., Montelione, G. T., and Gerstein, M. (2001) SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics. Nucleic Acids Res. 29, 2884 –2898 11. Goh, C. S., Lan, N., Echols, N., Douglas, S. M., Milburn, D., Bertone, P., Xiao, R., Ma, L. C., Zheng, D., Wunderlich, Z., Acton, T., Montelione, G. T., and Gerstein, M. (2003) SPINE 2: a system for collaborative structural proteomics within a federated database framework. Nucleic Acids Res. 31, 2833–2838 12. Idicula-Thomas, S., and Balaji, P. V. (2005) Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli. Protein Sci. 14, 582–592 13. Shimada, K., Nagano, M., Kawai, M., and Koga, H. (2005) Influences of amino acid features of glutathione S-transferase fusion proteins on their solubility. Proteomics 5, 3859 –3863 14. Benita, Y., Oosting, R. S., Lok, M. C., Wise, M. J., and Humphery-Smith, I. (2003) Regionalized GC content of template DNA as a predictor of PCR success. Nucleic Acids Res. 31, e99 15. Nilsson, B., Moks, T., Jansson, B., Abrahmsen, L., Elmblad, A., Holmgren, E., Henrichson, C., Jones, T. A., and Uhlen, M. (1987) A synthetic IgG-binding domain based on staphylococcal protein A. Protein Eng. 1, 107–113 16. Skerra, A., and Schmidt, T. G. (2000) Use of the Strep-Tag and streptavidin for detection and purification of recombinant proteins. Methods Enzymol. 326, 271–304 17. Zhao, Y., Benita, Y., Lok, M., Kuipers, B., van der Ley, P., Jiskoot, W., Hennink, W. E., Crommelin, D. J., and Oosting, R. S. (2005) Multi-antigen immunization using IgG binding domain ZZ as carrier. Vaccine 23, 5082–5090 18. Chapman, B., and Chang, J. (2000) Biopython: python tools for computational biology. ACM SIGBIO Newslett. 20, 15–19 19. Lobry, J. R., and Gautier, C. (1994) Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes. Nucleic Acids Res. 22, 3174 –3180 20. Guruprasad, K., Reddy, B. V., and Pandit, M. W. (1990) Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng. 4, 155–161 21. Harrison, R. G. (2000) Expression of soluble heterologous proteins via fusion with NusA protein. inNovations 11, 4 –7 22. Rice, P., Longden, I., and Bleasby, A. (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 16, 276 –277 23. Vihinen, M., Torkkila, E., and Riikonen, P. (1994) Accuracy of protein flexibility predictions. Proteins 19, 141–149 24. Prilusky, J., Felder, C. E., Zeev-Ben-Mordehai, T., Rydberg, E. H., Man, O.,

Downloaded from www.mcponline.org by on May 1, 2007

Molecular & Cellular Proteomics 5.9

1579

Analysis of High Throughput Protein Expression

25. 26.

27.

28. 29. 30. 31.

32. 33. 34.

Beckmann, J. S., Silman, I., and Sussman, J. L. (2005) FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics 21, 3435–3438 Wootton, J. C., and Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 266, 554 –571 Wan, H., and Wootton, J. C. (2000) A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. Comput. Chem. 24, 71–94 Garnier, J., Osguthorpe, D. J., and Robson, B. (1978) Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120, 97–120 Zuker, M. (2003) Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31, 3406 –3415 Wise, M. J. (2001) 0j.py: a software tool for low complexity proteins and protein domains. Bioinformatics 17, Suppl. 1, S288 –S295 Kyte, J., and Doolittle, R. F. (1982) A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105–132 Sharp, P. M., and Li, W. H. (1987) The Codon Adaptation Index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 Kurland, C., and Gallant, J. (1996) Errors of heterologous protein expression. Curr. Opin. Biotechnol. 7, 489 – 493 Wise, M. J. (2002) The POPPs: clustering and searching using peptide probability profiles. Bioinformatics 18, Suppl. 1, S38 –S45 Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., and

35. 36.

37. 38. 39.

40.

41.

42.

Apweiler, R. (2004) The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 1985–1988 Holm, L., and Sander, C. (1998) Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14, 423– 429 R Development Core Team (2006) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria Baneyx, F., and Mujacic, M. (2004) Recombinant protein folding and misfolding in Escherichia coli. Nat. Biotechnol. 22, 1399 –1408 Ventura, S. (2005) Sequence determinants of protein aggregation: tools to increase protein solubility. Microb. Cell Fact. 4, 11 Harcum, S. W., and Bentley, W. E. (1999) Heat-shock and stringent responses have overlapping protease activity in Escherichia coli. Implications for heterologous protein yield. Appl. Biochem. Biotechnol. 80, 23–37 Ramrez, D. M., and Bentley, W. E. (1995) Fed-batch feeding and induction policies that improve foreign synthesis and stability by avoiding stress responses. Biotechnol. Bioeng. 47, 596 – 608 Ramrez, D. M., and Bentley, W. E. (1999) Characterization of stress and protein turnover from protein overexpression in fed-batch E. coli cultures. J. Biotechnol. 71, 39 –58 Gasteiger, E., Hoogland, C., Gattiker, A., Duvaud, S., Wilkins, M. R., Appel, R. D., and Bairoch, A. (2005) in The Proteomics Protocols Handbook (Walker, J. M., ed.) pp. 571– 607, Humana Press, Totowa, NJ

Downloaded from www.mcponline.org by on May 1, 2007

1580

Molecular & Cellular Proteomics 5.9


相关文章:
更多相关标签: