Research Report

Complexity of WGBS Data Caused by Cellular Heterogeneity and Multiple Cytosine Modifications  

Zixu Wang , Weiming Zhao
College of Basic Medical, Harbin Medical University, Harbin, 150081, China
Author    Correspondence author
Cancer Genetics and Epigenetics, 2018, Vol. 6, No. 5   doi: 10.5376/cge.2018.06.0005
Received: 09 Oct., 2018    Accepted: 15 Nov., 2018    Published: 30 Nov., 2018
© 2018 BioPublisher Publishing Platform
This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Preferred citation for this article:

Wang Z.X., and Zhao W.M., 2018, Complexity of WGBS data caused by cellular heterogeneity and multiple cytosine modifications, Cancer Genetics and Epigenetics, 6(5): 33-39 (doi: 10.5376/cge.2018.06.0005)


DNA methylation is an important epigenetic modification that plays an important role in many biological processes such as transcriptional regulation, gene imprinting, X chromosome inactivation, transposon silencing, and embryonic development. With the development of next-generation sequencing technology, a large number of high-throughput methylation data are constantly emerging, and the processing and analysis of these data is an urgent problem to be solved. This review discussed the difficulties and challenges encountered in the analysis of WGBS methylation data from four levels: (i) Cytosines to Reads: Technology based on bisulfite conversion; (ii) Reads to Methylation level: BS-seq Sequence alignment; (iii) Methylation level to Region: Characteristics of the methylation group; (iv) Muticle methylomes: Differential methylation. In particular, we discussed the effects of cellular heterogeneity, other cytosine modifications at the site, region, and multiple methylation levels on WGBS methylation data.

DNA methylation; Epigenetic modification; WGBS; Differential methylation; Cell heterogeneity


In eukaryotes, DNA methylation refers to the addition of a methyl group to the fifth carbon atom of cytosine (i.e. 5-methylcytosine). In mammals, DNA methylation usually occurs on the dinucleotide sequence CpG (the cytosine is linked to the guanine by a phosphate bond) and is also referred to as CpG methylation. As an important epigenetic modification, DNA methylation has been shown to be involved in a variety of biological processes, such as silencing of transposable elements (Robinson and Smyth, 2008), regulation of gene expression (Feng et al., 2014), gene imprinting (Zhang et al., 2011), X Chromosomal inactivation (Varshney et al., 2016) and embryonic development and cell differentiation (Chen et al., 2015). Abnormal DNA methylation changes are found in many diseases such as cancer. For example, ultra-hypomethylation of proto-oncogenes and ultra-hypermethylation of tumor suppressor genes promote tumorigenesis (Pennisi, 2013).


Conventional molecular techniques do not distinguish between methylated cytosine and unmethylated cytosine, so DNA needs to be pre-treated prior to detection of DNA methylation. DNA methylation detection techniques are classified into three categories according to pretreatment methods, including restriction enzyme digestion, affinity enrichment, and bisulfite conversion. Compared to restriction enzyme digestion and affinity enrichment, only regional level methylation data can be generated, and the technology based on bisulfite conversion can accurately detect the methylation status at the single base level.


Since the next-generation sequencing has been widely used, combined with bisulfite conversion and high-throughput sequencing is possible to detect the methylation status of almost all cytosines on the whole genome. In recent years, with the reduction of sequencing cost, Whole Genome Bisulfite Sequencing (WGBS) (Wong et al., 2016) has been applied in the detection of embryonic development, adult group, disease and other physiological/of methylation of pathological conditions, such as the Roadmap program for detecting stem cell lines and in vitro tissues, the BLUEPRINT program for detecting blood cell types and related diseases, and the TCGA program for detecting various cancer tissues. WGBS data derived from a great deal of samples can be obtained from the data centers of these biological programs and the GEO database. Compared to gene expression data, genome-wide methylation data is computationally intensive and contains more complex information.


1 From Cytosine to Reads: Second Generation Sequencing Based on Bisulfite Conversion

Conventional sequencing methods cannot distinguish between 5mC and unmethylated cytosines. Therefore, the first step in detecting methylated cytosine in a DNA sequence is to distinguish 5mC from cytosine. Since the discovery that sodium bisulfite can convert ordinary cytosine into uracil (the intermediate product is 5,6-dihydrouracil-6-sulfonate), the DNA methylation detection technology based on bisulfite conversion is considered the gold standard for methylation detection, because it can accurately quantify the methylation status of a single cytosine.


The bisulfite sequencing developed by Formmer et al. in 1992 can accurately recognize methylated cytosine on the target sequence (Noble, 2009), but it usually only detects methylation status at dozens of CpG sites. The development of next-generation sequencing has made it possible to detect the methylation status of whole-genome cytosine. In March 2008, Shawn J. Cokus et al. examined the genome-wide DNA methylation profile of Arabidopsis thaliana, and its detection technique was named BS-seq. In May of the same year, Ryan Lister et al. also published a whole genome DNA methylation sequencing of Arabidopsis thaliana, and its detection technology was named MethylC-seq (Dudley and Butte, 2009). In addition to the slightly different methods of building library between these two technologies, the core of these technologies is bisulfite conversion combined with next-generation sequencing. Because it detects genome-wide cytosine methylation, it is collectively referred to as Whole Genome Bisulfite Sequencing (WGBS), which is distinguished from RRBS (Reduced Representation) which tends to detect CpG-rich regions (Bisulfite Sequencing) (Hackett et al., 2013).


There are four main steps in the detection of methylated cytosine by the WGBS technique: (i) Using ultrasound to break DNA into short fragments. (ii) Treatment of the fragment with bisulfite: Bisulfite converts C to C while 5mC remains unchanged. (iii) PCR amplification of the fragment. (iv) The sequence of the base of the fragment was detected using a sequencer. The methylation status of cytosine is inferred by comparison of the reads produced by the sequencer with the reference genome.


In mammals, 5mC is the most, but not the only, cytosine modification. Recent studies have found that 5-methylcytosine (5mC) can be oxidized by Tet protein to 5-hydroxymethylcytosine (5hmC). While 5-hydroxymethylcytosine can be further oxidized by Tet protein into 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC). These three cytosine modifications have been shown to be involved in the active DNA demethylation process, and the TDG protein (thymine DNA glycosylase)-mediated base-excision repair process can Reconvert 5fC/5caC to normal cytosine.


Conventional WGBS technology cannot detect three other cytosine modifications: bisulfite conversion does not work for 5hmC and 5mC, while both 5fC and 5caC are converted to uracil. Considering these, the researchers designed a series of variants of bisulfite sequencing technology to detect these three cytosine modifications associated with DNA demethylation. For example, oxBS-seq (Lu et al., 2015) and TAB-seq are capable of detecting whole-genome 5-hydroxymethylcytosine. fCAB-seq (Blaschke et al., 2013) and redBS-seq (Hu et al., 2014) are capable of detecting 5-formylcytosine, caCAB-seq (Kohli and Zhang, 2013) can detect 5-carboxycytosine. In addition, MAB-seq (Klengel et al., 2013) can detect whole-genome 5fC/5caC (Figure 1).



Figure 1 Sequencing method based on bisulfite conversion


On the other hand, these above-described techniques for DNA methylation detection based on bisulfite conversion are the methylation status of a detected cell population. Although cells in an organism share a single genome, there is a difference in the apparent genome between cells and cells. Recently, the development of single-cell sequencing technology has made it possible to observe dynamic changes in DNA methylation from a single cell level. For example, in 2013, Guo et al. realized the single-cell level of RRBS technology (Down et al., 2008). In 2014, Smallwood et al. developed the scBS-seq technique (Stelzer et al., 2015) to detect more than half of the methylation groups in a single cell.


Other cytosine modification sequencing and single-cell methylation sequencing data are still very rare, and WGBS data is increasing due to the reduction in sequencing costs and the maturity of the technology. Although the sample size of RRBS is also sufficient, the consistency of methylation sites detected by RRBS between different samples is poor and is not recommended for analysis of multiple methylation groups. Therefore, in this review, we focus on the strategies and challenges of WGBS data analysis.


2 From Reads to Methylation Levels: Comparison of Reads

The level of methylation from raw reads to single bases generated by the sequencing platform requires three steps: (i) Quality control of the original reads; (ii) Align clean reads to the reference genome; (iii) Calculate the methylation level of each cytosine based on the covered reads. For a cytosine, the methylation level is the ratio between the methylated reads and the total number of reads overlying it. The specific analysis procedures for comparison and factors affecting the accuracy of methylation level calculations are reviewed by Felix Krueger et al. It is recommended to use the variable-step wiggle format or its binary form bigwig to store genome-wide methylation level data (similar to the Roadmap program and the ENCODE program) to facilitate subsequent use of web genome browsers (such as UCSC) or local genome browsers (eg. IGV) for the visualization of methylation.


Due to the maintenance of DNA methylation mediated by DNMT1, the methylation status of most CpG sites in the genome is symmetric on the positive and negative strands, so the reads of positive and negative strands are usually combined to increase the depth of reads. For diploid mammals, when a CpG site is methylated at one position and unmethylated at another, the methylation level is expected to be 0.5.


Limited to bisulfite conversion cannot distinguish between 5hmC and 5mC, 5fC/5caC and 5hmC, the methylation level is affected by the three cytosine modifications related to demethylation. Compared with 5mC, 5hmC and 5fC/5caC account for a very small proportion, 5hmC is about 5% in mouse embryonic stem cells, 5fC/5caC is about 2% (Figure 2B), and the peaks of the methylation level distribution of 1,5hmC and 5fC/5caC were below 0.5 (Figure 2A), while the methylation level of 5mC was close to 1, indicating that cells of the same cell type are not demethylated at the same time. Cytosine in some cells is at 5mC, 5hmC, while in another cell is at 5fC, 5caC, 5mC (Figure 2C). Furthermore, 5hmC was not recognized by DNMT1, resulting in its inability to be maintained during DNA replication. It was observed that the chain asymmetry of 5hmC and 5fC in mouse embryonic stem cells, indicating early observed chain differences in CpG sites, which may be due to the dynamic process of DNA demethylation at the time of detection.



Figure 2 The complexity in the methylation level metric


The level of methylation in bulk WGBS is also affected by intercellular differences. The cellular heterogeneity of DNA methylation comes mainly from two aspects: (i) epigenetic group differences between different cell types. For example, when detecting blood tissue, blood contains multiple types of cells, and DNA methylation of different types of cells is cell type specific at certain locations. (ii) DNA methylation process and DNA demethylation process. WGBS is static.


Before the subsequent analysis, the CpG sites with lower read coverage need to be removed, because more sample reads can more accurately quantify the overall methylation status of the cell population, and are less affected by sequencing errors and alignment errors. Previous studies have shown that except for except for brain tissue and embryonic stem cells, non-CpG methylation (CHH and CHG methylation, H is the other three bases other than C) is very low in mammals. In this review, the primary focus is on methylation analysis on the CpG context in mammalian whole genomes.


3 From Methylation Site to Region: Methylation Pattern Region

Although DNA methylation in vertebrates occurs mainly at the CpG site, dinucleotides are too short to be a biologically interesting region. At the same time, the number of CpG sites in the mammalian genome is too large (the human genome contains about 28 million CG sites and the mouse genome contains about 2,100 CG sites), so it is not appropriate to use a single CpG site as the basic unit of methylation analysist. The methylation status at the CpG site is not randomly distributed in the genome, and adjacent CpG sites in linear positions usually have similar methylation status. Therefore, researchers usually combine adjacent CpG sites into one methylation pattern area.


On the genome, the most prominent DNA methylation feature is the CpG island, a region with high GC content and high CpG density relative to the background genome. The vast majority of CpG islands are located in the promoter region of the gene and are usually unmethylated. Hypermethylation of the CpG island of the promoter inhibits the expression of adjacent genes. However, most tissue-specific DNA methylation does not occur on CpG islands, but on the CpG island shore (Irizarry et al., 2009) adjacented to the CpG island.


Sequence-based CpG islands and CpG islands do not accurately and comprehensively reflect complex methylation patterns across the genome. Since the birth of whole-genome bisulfite sequencing technology, researchers have discovered and identified a series of methylated regions with distinct genomic and epigenetic features based on genome-wide single-base methylation profiles (Table 1). Early studies divided the genome into windows to identify methylation pattern regions, but such methods did not accurately identify the boundaries of the region, and were extremely low depending on the size of the window. Such methods should be avoided as much as possible.



Table 1 The list of methylation pattern regions


For the identification of methylation pattern regions in the methylation group, the three-state hidden Markov model is a suitable method. Using the methylation state (low, medium, high) of the CpG site as the implicit state of the model and the detected methylation level as the observed state to infer the true state of the CpG site. The three methylation states correspond to ultra-low methylation, hypermethylation, and intermediate methylation, respectively, and intermediate methylation is caused by various factors such as cell heterogeneity, demethylation modification, and allelic differences. It is worth noting that the β distribution is more suitable than the Gaussian distribution when estimating the emission probability of the hidden Markov model.


When the cell population detected by WGBS contains more than one cell type, such as blood containing multiple cell types, the cell type-specific methylation of the CpG site will cause the methylation level to not exhibit the traditional ultra-high/ultra-low-level Basis. Moreover, the methylation level depends on the proportion of each cell type. At the same time, the presence of other cytosine modifications such as 5hmC makes the methylation status of the region difficult to assess. Cellular heterogeneity and demethylated cytosine modification primarily affect the region in which the methylation level is intermediate. Another level of methylation level in the middle is the allelic-specific methylation region.


The regulation of methylation modification on DNA sequence-protein interaction is affected by the number of CpG sites and the type of cytosine modification thereon. Therefore, it is recommended to the identified methylation pattern regions can be further classified according to CpG density, length, and the like.


4 Multi-sample Methylation Spectrum

The datasets detected by WGBS usually have fewer samples and large sites, so the methods commonly used to identify differential methylation sites in microarray methylation data cannot be directly applied to the identification of differential methylation in WGBS.


For methylation group comparisons of paired samples, such as normal samples and cancer samples, conventional methods for identifying differentially methylated regions are by segmenting the genome into small windows. Although it was able to recognize differential methylation at a low genome-wide extent, it is expected that the identification of differential methylation sites as a differential methylation region is feasible.


It is worth noting that the identification of differentially methylated regions requires paired samples or two sets of samples and does not apply to comparisons between multiple sample methylation groups. When comparing methylation groups of multiple samples, it is recommended to use a predefined methylation region as a baseline. The boundaries of genomic features such as promoters and exons depend on the transcription initiation site and the splice site and are therefore not suitable as reference regions for methylation regions. Relatively, CpG islands based on sequence feature recognition are a more suitable genomic feature, while the proportion of CpG sites contained in CpG islands is too small, and most CpG islands are unmethylated in most tissues and cell types. Therefore, the methylation group cannot be fully reflected.


5 Conclusion

The development of next-generation sequencing technology has greatly promoted the research of DNA methylation. However, the analysis and processing methods of various high-throughput methylation data are still not perfect. The bisulfite conversion-based next-generation sequencing technology can detect cytosine methylated cytosine, and obtain the location and methylation level of whole genome methylation through combining with bioinformatics methods. According to an algorithm, the CpG sites with similar characteristics are connected to form a methylation pattern region having a regulatory effect, such as a CpG island. The methylation group of multiple samples can identify differentially methylated regions, thereby identifying genomic elements with regulatory functions. It is worth noting that the rational use of DNA methylation data is still a challenge due to the complexity of WGBS data resulting from the heterogeneity of cells and the discovery of novel methylated cytosines.


Authors’ contributions

WZX wrote and translated the manuscript. ZWM approved the final manuscript. WZX collected materials. Both authors read and approved the final manuscript.



This work was supported by the Heilongjiang scientific research project (grants 201810).



Blaschke K., Ebata K.T., Karimi M.M., et al., 2013, Vitamin C induces Tet-dependent DNA demethylation and a blastocyst-like state in ES cells, Nature, 500(7461): 222-226

PMid:23812591 PMCid:PMC3893718


Chen S., Sanjana N.E., Zheng K., et al., 2015, Genome-wide CRISPR screen in a mouse model of tumor growth and metastasis, Cell, 160(6): 1246-1260

PMid:25748654 PMCid:PMC4380877


Down T.A., Rakyan V.K., Turner D.J., et al., 2008, A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis, Nat Biotechnol., 26(7): 779-785

PMid:18612301 PMCid:PMC2644410


Dudley J.T., and Butte A.J., 2009, A quick guide for developing effective bioinformatics programming skills, PLoS Comput Biol., 5(12): e1000589

PMid:20041221 PMCid:PMC2791169


Feng H., Conneely K.N., and Wu H., 2014, A Bayesian hierarchical model to detect differentially methylated loci from single nucleotide resolution sequencing data, Nucleic Acids Res., 42(8): e69

PMid:24561809 PMCid:PMC4005660


Hackett J.A., Sengupta R., Zylicz J.J., Murakami K., Lee C., Down T.A., and Surani M.A., 2013, Germline DNA demethylation dynamics and imprint erasure through 5-hydroxymethylcytosine, Science, 339(6118): 448-452

PMid:23223451 PMCid:PMC3847602


Hu X., Zhang L., Mao S.Q., et al., 2014, Tet and TDG mediate DNA demethylation essential for mesenchymal-to-epithelial transition in somatic cell reprogramming, Cell Stem Cell, 14(4): 512-522



Irizarry R.A., Ladd-Acosta C., Wen B., et al., 2009, The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores, Nat Genet., 41(2): 178-186

PMid:19151715 PMCid:PMC2729128


Klengel T., Mehta D., Anacker C., et al., 2013, Allele-specific FKBP5 DNA demethylation mediates gene-childhood trauma interactions, Nat Neurosci., 16(1): 33-41

PMid:23201972 PMCid:PMC4136922


Kohli R.M., and Zhang Y., 2013, TET enzymes, TDG and the dynamics of DNA demethylation, Nature, 502(7472): 472-479

PMid:24153300 PMCid:PMC4046508


Lu X., Han D., Zhao B.S., Song C.X., Zhang L.S., Doré L.C., and He C., 2015, Base-resolution maps of 5-formylcytosine and 5-carboxylcytosine reveal genome-wide DNA demethylation dynamics, Cell Res., 25(3): 386-389

PMid:25591929 PMCid:PMC4349244


Noble W.S., 2009, How does multiple testing correction work? Nat Biotechnol., 27(12): 1135-1137

PMid:20010596 PMCid:PMC2907892


Pennisi E., 2013, The CRISPR craze, Science, 341(6148): 833-836



Robinson M.D., and Smyth G.K., 2008, Small-sample estimation of negative binomial dispersion, with applications to SAGE data, Biostatistics, 9(2): 321-332



Stelzer Y., Shivalila C.S., Soldner F., Markoulaki S., and Jaenisch R., 2015, Tracing dynamic changes of DNA methylation at single-cell resolution, Cell, 163(1): 218-229

PMid:26406378 PMCid:PMC4583717


Varshney G.K., Zhang S., Pei W., et al., 2016, CRISPRz: a database of zebrafish validated sgRNAs, Nucleic Acids Res., 44(D1): D822-6

PMid:26438539 PMCid:PMC4702947


Wong N.C., Pope B.J., Candiloro I.L., et al., 2016, MethPat: a tool for the analysis and visualisation of complex methylation patterns obtained by massively parallel sequencing, BMC Bioinformatics, 17(1): 98

PMid:26911705 PMCid:PMC4765133


Zhang Y., Liu H., Lv J., et al., 2011, QDMR: a quantitative method for identification of differentially methylated regions by entropy, Nucleic Acids Res., 39(9): e58

PMid:21306990 PMCid:PMC3089487

Cancer Genetics and Epigenetics
• Volume 6
View Options
. PDF(315KB)
Associated material
. Readers' comments
Other articles by authors
. Zixu Wang
. Weiming Zhao
Related articles
. DNA methylation
. Epigenetic modification
. Differential methylation
. Cell heterogeneity
. Email to a friend
. Post a comment