CTCFBSDB 2.0: A database for CTCF binding sites and genome organization
  Home Search Experimentally Identified CTCFBS Browse Topological Domain Predicted CTCFBS CTCFBS Prediction Tool Help  

Table of Contents

  1. Background
  2. Sources of binding sites
  3. Database access and content
  4. In silico CTCFBS prediction tool
  5. Recent updates
  6. Download
  7. References
  8. Contact us

1. Background

CCCTC-binding factor (CTCF) is a versatile transcription regulator that is evolutionarily conserved from fruit fly to human. CTCF binds to different DNA sequences through combinatorial use of 11-zinc fingers, and shows distinct functions (transcription activation/repression and chromatin insulation) depending on the biological context 1,2. Insulators, with the functions of enhancer-blocking and domain-bordering, are critical regulatory elements for gene expression control 3,4. They represent a class of diverged DNA sequences capable of shielding genes against inappropriate cis-regulatory signals from their genomic neighborhood. Recent studies also linked insulators to epigenetics, such as imprinting 5,6 and X-chromosome inactivation 7. In eukaryotic genomes, maintenance of distinct chromatin domains is critical for transcription control, and CTCF has been identified as playing a crucial role in the global organization of chromatic architecture 2. Evidence for this CTCF function has been strengthened by Hi-C experiments that have shown that interacting genomic regions commonly contain CTCF binding sites and that the boundaries of genomic topological domains are enriched for CTCF binding sites 8,9,10. To analyze this important type of DNA regulatory element, we created a CTCF binding site database (CTCFBSDB), a comprehensive collection of experimentally determined and computationally predicted CTCF binding sites (CTCFBS) from the literature. The database is designed to facilitate the studies on insulators and their roles in demarcating functional genomic domains.

2. Sources of binding sites

Currently, the database contains almost 15 million experimentally determined CTCF binding sequences across several species. CTCF binding sequences were collected from 12 published papers containing CTCF binding sites identified using ChIPSeq or similar methods, data from the ENCODE project, and a set of approximately 100 manually curated binding sites identified by low-throughput experiments. Each record in the database is identified by a prefix containing information about the data source appended to a number, creating a unique identifier for each binding sequence. For binding site datasets from ENCODE, the cell type and/or cell type and experimental treatment was added to the end of the identifier prefix. The following table summarize the data sources:

PubmedID/ENCODE project Genome Cell type Binding site ID prefix
15256511 mm8 Fetal liver(E16) INSUL_OHL
17382889 hg18 IMR90 INSUL_REN
17512414 hg18 CD4+ T INSUL_ZHAO
18684996 hg18 CD4+ T INSULH_JOTHI
19056695 hg18 CD4+ T, HeLa, Jurkat INSULH_CUDDAPAH
21602820 mm9 ES, MEF INSULM_MARTIN
21602820 hg18 CD4+ T, HeLa, Jurkat INSULH_MARTIN
21602820 galgal3 RBC 5 days, RBC 10 days INSULC_MARTIN
18555785 mm8 ES INSULM_CHEN
22153082 hg18 HeLa INSULH_RHEE
21685913 mm8 ES INSULM_HANDOKO
20526341 hg18 ES INSULH_KUNARSO
22244452 hg19 Liver SCHMIDT_hg19_
22244452 mm9 Liver SCHMIDT_mm9_
22244452 rn4 Liver SCHMIDT_rn4_
22244452 mmul1 Liver SCHMIDT_mmul1_
22244452 canFam2 Liver SCHMIDT_canFam2_
22244452 monDom5 Liver SCHMIDT_monDom5_
22244452 hg19 Liver SCHMIDT_hg19_
Broad Histone hg18 9 cell types/treatments ENCODE_Broad_hg18_
Open Chromatin hg18 14 cell types/treatments ENCODE_OC_hg18_
UW Histone hg18 23 cell types/treatments ENCODE_UW_hg18_
Broad Histone hg19 15 cell types/treatments ENCODE_Broad_hg19_
HAIB TFBS hg19 8 cell types/treatments ENCODE_HAIB_hg19_
SYDH TFBS hg19 2 cell types/treatments ENCODE_SYDH_hg19_
UTA TFBS hg19 18 cell types/treatments ENCODE_UTA_hg19_
UW CTCF binding hg19 56 cell types/treatments ENCODE_UW_hg19_
Caltech TFBS mm9 1 cell types/treatments ENCODE_CalTech_mm9_
LICR TFBS mm9 10 cell types/treatments ENCODE_Licr_mm9_
PSU TFBS mm9 4 cell types/treatments ENCODE_PSU_mm9_
Stan/Yale TFBS mm9 3 cell types/treatments ENCODE_SYDH_mm9_
17382889, 17442748 hg18, mm8 Predicted binding sites INSUL_PRE
Manually curated binding sites Various Various MAN

3. Database access and content

Browse
The contents of the database can be browsed in three main ways.
(1)"Experimentally Identified CTCFBS" allows for browsing CTCF binding sites for each species. For species with CTCF binding sites identified under many conditions, such as human and mouse, only records for a specific cell type and chromosome can be displayed.
(2) "Browse Topological Domains" allows for browsing CTCF binding sites within the topological domains determined through recent Hi-C experiments for the mm9 and hg18 genomes10. The boundaries of the topological domains are enriched for CTCF binding sites, indicating that CTCF binding at these locations are likely to play a role in higher-order genomic organization.
(3) "Predicted CTCFBS" allows for browsing predicted CTCF binding sites.

Search
The contents of the database can also be accessed by searching for either CTCF binding sites within a particular genomic region or by searching for the unique identifier of the binding sequence. Genomic region searches can be limited to provide the records for either particular cell types or a specific dataset.

Content
Each CTCF binding site in the database is annotated with data collected from multiple resources that is displayed on the main record page of that binding site. This page is organized into the following sections:
(1) Description: A table containing general information about the binding site and its potential functional significance.

Column Label Description
ID Unique identifier for the binding sequence in the database
Species and Build The species and genomic build in which the binding site was identified
Location The genomic location of the binding sequence
ENCODE Whether or not the data was identified in an ENCODE dataset
Source The PubMedID or ENCODE accession number containing the binding site
Cell type and Experiment type Experimental conditions in which the binding site was identified
Occupancy A numeric value (i.e., read count for ChIP-Seq experiments or signal strength for ENCODE data) indicating the extent to which the binding site was occupied in the experiment. If the data source did not report any occupancy data, this field is left empty.
Occupancy% As the occupancy values reported for different experiments had different scales, we calculated the percentile of the occupancy value for each binding site within its dataset to allow for comparisons of occupancy across experiments. A value of 99 indicates that the binding site had one of the top 1% highest occupancies within the dataset.
M1M2 Class See Determination of M1 and M2 classes below.
ENCODE PEAK location The location of the peak of the binding site, if available, for ENCODE datasets

Description of M1 and M2 classes
Schmidt et al.11 recently investigated CTCF binding sites across six mammalian species and found that there were two main modes of CTCF interactions. For the majority of CTCF-DNA binding events, the N-terminal zinc fingers interact with the 14 bp long M1 motif. Additionally, in a subset of binding events, the C-terminal fingers interact with a shorter M2 interactions, creating a 34 bp long M1+M2 motif. In the most common arrangement of sites containing M1+M2 motifs, the half-site distance between M1 and M2 was 21 or 22 bps. In order to classify the CTCF binding sites based on the type of binding event, the sequence of each binding site was scanned for matches to the M1 and M2 CTCF binding motifs described by Schmidt et al.11 and provided here using the nmscan module of NestedMica with a cutoff of -15. They were then classified as None, M1, M1M1, or M1M2_21_22 sites based on the following criteria:
(a) None: The binding sequence does not match the M1 motif.
(b) M1: The binding sequence contains a sequence that matches the M1 motif.
(c) M1M2: The binding sequence contains sequences matching the M1 and M2 motifs that were separated by a half-site distance of 12 bp to 42 bp.
(d) M1M2_21_22: The binding sequence contains sequences matching the M1 and M2 motifs that were separated by a half-site distance of 21 or 22 bp (i.e., the distance between the M1 end and M2 start was 8 or 9 bp).

(2) Topological Domains
For CTCF binding sites identified in the hg18 and mm9 genomes, a description of the topological domains found in recent Hi-C experiments containing the CTCF binding sites10. The boundaries of the topological domains were found to be enriched for CTCF binding sites. Therefore, CTCF binding sites that have a small number for the distance to the domain boundary may be likely to play a role in the higher-order genomic organization. Topological domains were not defined at the ends of chromosomes, and, therefore, binding sites at the chromosome ends may not display topological domain data.

(3) Flanking Gene expression
The flanking gene expression track compares the expression status of genes flanking the CTCF binding site. Red indicates overexpression in the tissue, and green indicates underexpression. The expression data were obtained from the GNF Gene Expression Atlas 212, which contains genome-wide gene expression profiles of 61 mouse tissues and 79 human tissues. The raw data was base 2 log-transformed and normalized to have a zero mean and a standard deviation of one. The images were generated using the BioHeatmap Javascript library.
Additionally, for human CTCF binding sites, we display the expression of flanking genes determined using RNA-Seq experiments in 10 tissues. The RNA-Seq expression data was obtain from the RNA-Seq Atlas and is displayed using Google Charts showing the number of RNA-Seq Reads Per Kilobase of exon per Million mapped reads (RPKM).

(4) Overlapping CTCF binding sequences
A list of CTCF binding sequences found in other studies or cell types that overlap the CTCF binding sequence. This list can be used to investigate if the binding event is conserved across tissues or is specific to a single cell type.

(5) Genome Browser
The genomic context of the CTCF binding site. Custom tracks displaying topological domains and overlapping CTCF sites, if selected, are displayed, in addition to tracks showing the sequence of the site, UCSC genes, and SNPs that may disrupt binding to the site. As methylation has been shown to disrupt CTCF binding, the Genome Browser viewer also displays the ENC DNA Methyl track for the hg19 genome and the HAIB Methyl-Seq and HAIB Methyl27 tracks for the hg18 genome which contain methylation data generated by the ENCODE project.

4. In silico CTCFBS prediction tool

CTCF uses different combinations of its zinc fingers to recognize divergent DNA sequences. Recent studies have identified core motifs for CTCFBS sequences and the motifs are represented by position weight matrices (PWM) . Altogether, six PWM derived to accommodate the divergence of CTCFBS sequences have been identified and included in the web tool11,13,14. The EMBL_M1 and EMBL_M2 motifs were identified by Schmidt et al.11 and are further described here; the Ren_20 motif was identified by Kim et al.13; and the LM2, LM7, and LM23 motifs were identified by Xie et al.14 The specific PWM used in the CTCFBS prediction tool are available here.

We offer the users a simple web tool to search for CTCFBS core motifs in a query sequence. We used the STORM program15 and each of the six PWM to report the single best hit in the query sequence. The PWM score corresponds to the log-odds of the observed sequence being generated by the motif versus being generated by the background. So a large positive score suggests a good match, while a negative score indicates that a the best match to the query sequence was worse than would be expected in a random sequence of the same length. Usually a short sequence with a PWM score >3.0 is a suggestive match. We note that the CTCF binding site prediction tool returns only the single best sequence match for each PWM and that the query sequence may include additional high scoring sequences that match a CTCF binding motif.

Bug fix February 12 2013: We have corrected a bug in the CTCF binding site prediction tool that may have resulted in incorrectly providing the location of sequences matching CTCF binding motifs or missing high scoring sequence matches altogether.

5. Recent updates

We recently completed a significant update to the CTCFBSDB to both significantly expand the database content as well as provide new features, including topological domains, the ability to visualize overlapping binding sites, binding site occupancy, and classification of binding sites according to matches with the M1 and M2 motifs. The following tables compare the original contents of the database with the contents of CTCFBSDB 2.0, as of 6/28/2012.

CTCFBSDB 1.0 CTCFBSDB 2.0
Experimentally determined CTCF binding sites 34,420 14,735,367
Experimental CTCF binding sites with occupancy data 0 14,216,654
Sources of CTCF binding sites/n(excluding ENCODE data) 4 12
ENCODE CTCF binding site datasets 0 163
Genomes with >100 experimental CTCF binding sites 2 9
Human CTCF binding sites 61,852 13,760,124
Mouse CTCF binding sites 6,552 821,858
Human topological domains NA 7947
Mouse topological domains NA 8937
Position weight matrices used to scan sequences for CTCF binding motifs 4 6

6. Download

The CTCF binding sites in the database can be downloaded here.

7. References

1. Ohlsson, R., Renkawitz, R. and Lobanenkov, V. (2001) CTCF is a uniquely versatile transcription regulator linked to epigenetics and disease. Trends Genet, 17, 520-527.
2. Phillips, J.E., Corces, V.G. (2009) CTCF: master weaver of the genome. Cell, 137, 1194-1211.
3. Bell, A.C., West, A.G. and Felsenfeld, G. (2001) Insulators and boundaries: versatile regulatory elements in the eukaryotic. Science, 291, 447-450.
4. West, A.G., Gaszner, M. and Felsenfeld, G. (2002) Insulators: many functions, many mechanisms. Genes Dev, 16, 271-288.
5. Hark, A.T., et al. (2000) CTCF mediates methylation-sensitive enhancer-blocking activity at the H19/Igf2 locus. Nature, 405, 486-489.
6. Bell, A.C. and Felsenfeld, G. (2000) Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene. Nature, 405, 482-485.
7. Chao, W., et al. (2002) CTCF, a candidate trans-acting factor for X-inactivation choice. Science, 295, 345-347.
8. Lieberman-Aiden, E., et al. (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326, 289-293.
9. Botta, M., et al. (2010) Intra- and inter-chromosomal interactions correlate with CTCF binding genome wide. Mol Syst Bio, 6, 426.
10. Dixon, J. R., et al. (2012) Topological domains in mammalian genomes identified by analysis of chromatic interactions. Nature, 485, 376-380.
11. Schmidt, D., et al. (2012) Waves of retrotransposon expansion remodel genome organization and CTCF binding in multiple mammalian lineages. Cell, 148, 335-348.
12. Su, A.I., et al. (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. USA, 101: 6062-6067.
13. Kim, T.H., et al. (2007) Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell, 128, 1231-1245.
14. Xie, X., et al. (2007) Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites. Proc Natl Acad Sci U S A, 104, 7145-7150.
15. Schones, D.E., et al.(2007) Statistical significance of cis-regulatory modules. BMC Bioinformatics, 8:19.

8. Contact us

Please send questions and comments to Dr. Yan Cui at University of Tennesee Health Science Center.