Introduction

Gliomas are the most common primary brain tumor and result in over 15,000 deaths a year in the United States alone1. These tumors are often initially discovered as slow-growing low-grade gliomas (Grade II gliomas)2,3. However, inevitably these tumors undergo a transformation into faster growing more aggressive higher-grade tumors (Grade III and Grade IV gliomas). Following diagnosis, standard clinical management includes maximal safe resection and adjuvant treatment (chemotherapy and radiation)1,3,4,5. This often leads to a period of clinical stability where there is no visible tumor on serial MRI scans. However, the tumors inevitably recur. While primary grade IV gliomas (i.e. glioblastomas) can have periods of stability following resection and adjuvant therapy, once they recur, recurrent glioblastomas are typically treatment resistant and clinical stability is rare6,7. Whether this is due to a natural progression of the tumor or as a response to typically used DNA-damaging therapies (radiation and chemotherapy) is being investigated8.

The transformation from a normal cell to a low-grade glioma and eventually a high-grade glioma involves the dysregulation of many gene pathways and cellular processes9,10. Much attention has been paid to deletions (e.g. PTEN11,12, p5313, 1p/19q14,15) and amplifications (EGFR16, MYC17) of specific gene regions that are frequently encountered in these tumors. However, there is a similar and equally important pathology in the epigenetic dysregulation of these gene pathways that has received less attention. A thorough exploration of the epigenetic landscape of these tumors may lead to the discovery of additional important biomarkers.

The discovery of the IDH1 mutation (frequently found in low-grade gliomas and secondary high-grade gliomas) has brought additional focus and attention to the problem of epigenetic dysregulation in gliomas. The presumed mechanism of the IDH1 mutation is to methylate cytosine nucleotides in DNA and induce a more closed inaccessible DNA chromatin with subsequent gene down-regulation18,19,20 however atac-seq data on IDH1 mutant glioma samples is limited.

Atac-seq is a relatively new sequencing technology that identifies open accessible DNA regions (e.g. “peaks”) with very little cellular input enabling the analysis of surgically resected samples. Its low-cost and relative ease has led to wide-spread use across multiple cell types. However, when atac-seq data is used in isolation without additional lines of supporting evidence, it has been difficult to determine how best to interpret the biological significance of these open regions (measured as “peaks” from aligned sequencing reads). Data sets involving multiple modalities (atac-seq and RNA-seq) on the same samples would be useful in supporting or refuting these assumptions. To address these short-comings this study combines a large set of atac-seq and RNA-seq data from surgically removed glioma specimens to create an atac-seq/RNA-seq matrix to identify highly correlated peaks and genes. By correlating these data sets with demographic and survival data we have identified open DNA regions that have prognostic and thereby likely biological significance.

Methods

Generation of clinical database

The University of Cincinnati maintains an IRB-approved biorepository of surgically resected specimens with associated clinical and demographic information including medical record number, diagnosis, name, gender, age, date of collection and survival. A collection of 38 specimens were chosen from the bio-repository with a goal to obtain a heterogenous mixture of low-grade and high-grade specimens as well as those with long and short survival. Specimens were between 2 and 9 years old prior to selection to allow for adequate follow-up for survival analysis. All specimens denoted as “Grade II,” “Grade III,” or “Grade IV primary” are taken from the patient’s initial surgery and have not been exposed to any chemotherapy or radiation treatment. All grades were assigned based on the WHO guidelines of that time. Since then, grading guidelines have changed to include more molecular markers (e.g. IDH1, p53, 1p/19q)21. Following the initial surgery all patients underwent standard therapy including temozolomide and radiation. Survival was calculated as time between surgical resection and death. The presence of the IDH1 mutation was determined via either immunostaining or by direct sequencing.

Protocol for generating Atac-seq libraries

Nuclei were isolated from patient tumor tissue as described by Habib et al.22, with slight modifications. Briefly, frozen tissue samples (< 0.5 cm) were incubated on ice in 1 mL of Nuclei EZ Lysis Buffer (Sigma, #Nuc-101). The tissue was then homogenized using a glass tissue grinder. 1 mL of additional EZ Lysis Buffer was added, the tissue was incubated on ice for 5 min, and then filtered through a 70 μm cell strainer. Nuclei were collected by centrifugation and resuspended in 1 mL ATAC-resuspension buffer (10 mM Tris–HCl (pH 7.4), 10 mM NaCl, 3 mM MgCl2, and 0.1% Tween-20) and filtered through a 40 μm cell strainer. 1 mL of ATAC-resuspension buffer was then added and the nuclei were filtered through a 5 μm cell strainer. ATAC-seq library preparation was performed as previously described23.

RNA-seq library creation

Frozen patient tumor tissue (< 0.5 cm) was placed into pre-chilled RNA-ice later (Invitrogen, #AM7030) for at least 24 h at − 20 °C. The tissue was then homogenized with Lysis/Binding Buffer (Invitrogen, #00671513) in lysing matrix D tubes (MP, #116913050-CF) for 40 s at 6 m/s, using the MP FastPrep-24 5G Instrument (MP, #116005500). RNA (≥ 200 nucleotides) was then purified using the Quick-RNA MiniPrep Plus Kit (Zymo Research, #R1057).

Processing of ATAC-seq data

ATAC-seq reads in FASTQ format were first subjected to quality control to assess the need for trimming of adapter sequences or bad quality segments. The programs used in these steps were FastQC v0.11.7, Trim Galore! v0.4.2 and cutadapt v1.9.1. The trimmed reads were aligned to the reference human genome version GRCh38/hg38 with the program HISAT2 v2.0.5. Aligned reads were stripped of duplicate reads with the program sambamba v0.6.8. Peaks were called using the program MACS v2.1.2 using the broad peaks mode.

To obtain the consensus set of unique peaks, called peaks from all samples are merged at 50% overlap using BEDtools v2.27.0. The consensus peaks, originally in BED format were converted to a Gene Transfer Format (GTF) to enable fast counting of reads under the peaks with the program featureCounts v1.6.2. Each feature in the GTF file has the value “peak” on the second column. Peaks located on chromosomes X, Y and mitochondrial DNA are excluded from further analysis. Raw read counts are normalized with respect to library size and transformed to log2 scale using rlog() function in R package DESeq2 v1.26.0.

Random forest regression model

Log-transformed data were filtered to remove low variance peaks using the VarianceThreshold function from scikit-learn v0.24.1 using a threshold of 0.45. The resulting 8439 features were used to build a random forest regression model using RandomForestRegressor from the ensemble module of scikit-learn. The options used was n_estimators: 1000, max_samples = 0.9, oob_score = True. Permutation importance was used to rank the peaks. The out-of-bag R2 value for the regression was 0.31.

Elastic-net regularized generalized linear models

We selected the top 20 peaks with highest mean feature importance value from 8439 peaks. To assess the survival predictive power of clinical metadata and top 20 ATAC-seq peaks, we trained elastic-net regularized generalized linear models using ‘glmnet’ package in R. We tested the performance of three different models that are built using following features (1) top 20 ATAC-seq peaks (2) clinical metadata only; (3) clinical metadata + top 20 ATAC-seq peaks. Leave one out cross validation imputation was applied and alpha parameter was explored between 0 and 1 with incrementing step size of 0.1. Out of 36 samples, IDH mutation information of 4 samples was not available. These 4 samples were excluded from models which include clinical metadata as features. Prediction accuracy of models was assessed by computing R2 value.

Processing of RNA-seq data

The quality control check on RNA-seq reads was performed with FastQC v0.11.7. Adapter sequences and bad quality segments were trimmed using Trim Galore! v0.4.2 and cutadapt v1.9.1. The trimmed reads were aligned to the reference human genome version GRCh38/hg38 with the program STAR v2.6.1e. Duplicate aligned reads were removed using the program sambamba v0.6.8. Gene-level expression was assessed by counting features for each gene, as defined in the NCBI’s RefSeq database. Read counting was done with the program featureCounts v1.6.2 from the Rsubread package. Raw counts were normalized with respect to library size and transformed to log2 scale using rlog() function in R package DESeq2 v1.26.0.

ATAC-seq and RNA-seq correlation analysis

Sixteen samples have atac-seq chromatin accessibility as well as RNA-seq gene expression data which is used for correlation analysis. To compare the chromatin accessibility and gene expression, we identified the peak-gene pairs which are located within same topologically associating domain (TAD). Peaks having at least 90% overlap with TAD are selected for further analysis. Out of 8439 peaks, 8419 peaks show ≥ 90% overlap with TAD region and 8416 peaks have at least 1 gene within the overlapping TAD. Pearson correlation coefficient was calculated for each peak-gene pair.

Analysis of differential chromatin accessibility

The significant changes of chromatin accessibility in each subgroup (Grade II, Grade III, Grade IV Primary and Grade IV Recurrent) were assessed with the R package DESeq2 v1.26.0. The 8439 most important peaks from the random forest regression model are selected for the differential analysis. Samples from a specific subgroup (e.g. Grade II) are compared against remaining samples from the dataset. Differential peaks with log2FoldChange ≥ 0.3 and FDR ≤ 0.05 are selected for motif enrichment analysis using HOMER v4.10. Remaining peaks from initial set of 8439 peaks are used as background data set in HOMER analysis. Only known human motifs from CIS-BP 2.0 are considered.

Radiation treatment

Three previously characterized patient-derived cultures (HK29624, HK35724, TS60325) underwent a radiation dose of 9 Gy in three fractions in a XenX (X-Strahl, Suwanee, GA) pre-clinical cabinet irradiator, with output calibrated using NIST-traceable instruments. The instrument parameters were: 220 kVp, 0.67 mm Cu HVL, dose rate ~ 6 cGy/s, delivered using a 100 mm × 100 mm collimator with 2 cm backscatter below the cell plates.

Sequencing database

The human database is publicly available from “basespace” login.illumina.com. All experimental protocols were approved by the University of Cincinnati IRB committee. This study was used de-identified patient specimens and was designated “Not human research” 19-02-25-02 (5/3/2019). Consent is not applicable.

Human database

The creation of the human database followed all methods in accordance with relevant guidelines and regulations. This study was used de-identified patient specimens and was designated “Not human research” 19-02-25-02 (5/3/2019). Consent is not applicable.

Consent to participate

This study does not involve human subjects.

Results

Atac-seq/RNA-seq correlation matrix creates chromatin maps with candidate target genes

This study includes thirty-eight glioma specimens spanning all grades (Grade II-7 specimens, Grade III-9 specimens, Grade IV primary-9 specimens, Grade IV recurrent-13 specimens). Seventeen specimens were determined to have an IDH1 mutation, 17 specimens were determined to be IDH1 wildtype. In four specimens, the IDH1 mutant status could not be determined. The collection includes three paired specimens in which a single patient underwent two surgeries (MG 18/MG19, MG20/MG21, MG27/28). Sixteen samples were of sufficient quality to allow RNA-sequencing analysis. The demographic and clinical information is shown in Table 1. Restricting analysis to those peaks with a univariate correlation of |r2| ≥ 0.6 revealed that the majority of peaks were positively correlated with survival (Fig. 1A) and positively correlated with gene expression (Fig. 1B). Gene ontology enrichment analysis of the identified correlated genes (peak-survival correlation |r2| ≥ 0.6 and peak-gene correlation |r2| ≥ 0.6) identified “Nervous System Development” as the most enriched gene module (Fig. 1C). Clustering using Uniform Manifold Approximation and Projection (UMAP) identified two clusters that segregated by IDH1 mutant status (Fig. 1D,E). While it is widely presumed that the IDH1 mutation induces DNA methylation and a closed conformation18,19,20, there was a non-significant trend (p = 0.1) towards IDH1 mutant samples having more peaks (Fig. 1F).

Table 1 Demographic and clinical information for 38 glioma specimens spanning all grades (Grade II-7 specimens, Grade III-9 specimens, Grade IV prary-9 specimens, Grade IV recurrent-13 specimens).
Figure 1
figure 1

(A) Univariate correlations between peaks and survival are plotted above for |r2| > 0.5, |r2| > 0.6, and |r2| > 0.7. The vast majority of the correlations between peaks and survival are positively correlated. (B) Univariate correlations between peaks and genes are plotted above for |r2| > 0.5, |r2| > 0.6, and |r2| > 0.7. The vast majority of the correlations between peaks and genes are positively correlated. (C) Peak-gene correlations of |r2| > 0.6 were subjected to gene ontology analysis. Significantly enriched gene modules are shown above. (D,E) UMAP clustering of all 38 samples coded by grade (D) and IDH1 status (E). (F) Total peaks (Transcript per million).

Prognostic model predicts identifies atac-seq peaks associated with grade/survival

Restricting the focus to only those genes with the most prognostic value, a random forest model identified 8439 peaks that were predictive of survival. The majority of these peaks mapped to intergenic regions. Using the atac-seq/RNA-seq correlation matrix, 29,679 genes were correlated with the identified 8439 peaks (gene-peak correlation |r2| ≥ 0.6). Contrary to the hypothesis that peaks target the “nearest gene”, only 1863 (2.5%) of these correlations were between a peak and its “nearest gene”. Linear regression models identified six peaks that predicted survival with a high degree of accuracy (R2 = 0.6845). The six most closely correlated target gene regions are shown in Fig. 2A. Four of the annotated genes have reported roles in cancer. GUCY1B326,27 and IMPAD128,29 are reported oncogenes while GRIA230 and LZTS131 are reported tumor suppressors. This peak linear regression model was compared against a model using established predictive clinical variables (age, gender, grade, pathology and IDH1 status) which yielded a higher degree of accuracy (R2 = 0.7271 Fig. 2B). Combining the atac-seq peaks and clinical variables created the most accurate model (R2 = 0.8164 Fig. 2C) indicating that four of the atac-seq peaks have predictive value independent of traditionally used clinical variables.

Figure 2
figure 2

(A) The six most predictive atac-seq peaks were used to create a linear regression model to predict survival. The constants associated with this linear equation are shown in column two and the most correlated gene in column 3. (B) Six typically used clinical and demographic variables were used to create a linear regression model to predict survival. (C) Combining atac-seq peaks and clinical variables into a linear regression model yielded the most accurate prediction model.

Recurrent grade IV gliomas are associated with fewer open chromatin sites

Each grade was compared to all others to identity any differentially represented atac-seq peaks (Fig. 3). These grade-specific peaks then underwent HOMER motif analysis to determine which likely transcription factors drove gene expression in each grade. Finally, peak-associated genes were identified and underwent gene ontology analysis to determine candidate target gene pathways for each grade. Grade II (2326 peaks, 3204 genes), grade III (1708 peaks, 2508 genes), and grade IV (primary) (3058 peaks, 3819 genes) samples had similar and heavily overlapping sets of transcription factor motifs and enriched gene modules (Fig. 3). In contrast, grade IV recurrent (14 peaks, 37 genes) had relatively few peaks and no significantly enriched gene modules.

Figure 3
figure 3

Each of the 8349 peaks was assigned to one of the four grades (Grade II, Grade III, Grade IV primary, Grade IV recurrent) based on representation in that group. These peaks underwent HOMER analysis to determine enriched transcription factor motifs and gene ontology based on the associated genes. Grade IV recurrent samples had very few peaks and no significantly enriched gene modules.

To determine if this finding was specific to the identified 8439 peaks or part of a larger trend, the total peak count was calculated for each of the grades. Similar to the previous finding, grade IV (recurrent) samples had significantly fewer total peaks than grade II samples (p = 0.02) (Fig. 4A,B). This collection contains three “paired” samples each of which were grade IV that underwent resection followed by standard adjuvant treatment (temozolomide and radiation) and later a second resection. In each case, the later sample (“recurrent” sample) had fewer peaks (Fig. 4C). This invites two possibilities. The first possibility is the decrease in peak count is a selection process where only cells with fewer peaks grow back. The second possibility is that the decrease in peak count is a consequence of the clinical treatment (e.g. surgery, temozolomide and radiation). In support of the latter induction hypothesis, MG 30/31 was a patient who had two distinct tumors that were present on initial diagnosis. The first (MG 30) was resected and the second tumor (MG31) underwent chemotherapy and radiation followed by resection. To provide further evidence for the latter hypothesis, three previously characterized patient-derived glioma cultures (HK29624, HK35724, TS60325) underwent atac-seq analysis before and after 9 Gray of radiation in three daily fractions. All three cell lines showed a decrease in peak count following radiation (Fig. 4D).

Figure 4
figure 4

(A) The total peaks (transcripts per million > 10) were calculated for each sample and tabulated by grade. Grade II samples had significantly more peaks than Grade IV recurrent samples (ANOVA). (B) The total peak counts for three paired samples are shown. (C) Samples were divided by IDH1 status and by primary vs. recurrent status. (D) Three patient derived glioblastoma cell lines were subjected to atac-seq analysis before and after 9 grey radiation in three fractions.

Discussion

Glioblastoma is a currently incurable disease and despite decades of research and hundreds of clinical trials, the prognosis remains grim. One frustration encountered by several trials6 including those investigating the roles of radiation4 is that improvements in progression-free survival do not always translate into improvements in overall survival. One possible explanation for this phenomenon is that while certain therapies may initially impede tumor growth those cells that survive may take on additional attributes that make them more destructive and refractory to treatment. The data from this study provides an intriguing new model for this transition. The majority of atac-seq peaks identified in this study were correlated with better survival and tended to positively correlate with genes of nervous system differentiation. There was a non-significant trend for total peak count to decrease as the grade increased with the most significant decrease seen in the recurrent grade IV samples. It may be the case, that the closing of DNA regions responsible for driving neural differentiation induces more malignant behavior and that this process can be accelerated by radiation. If this were true, it may be beneficial to pair radiation therapy with other compounds that could potentially minimize this unwanted side effect.

With advancements in the availability and cost of sequencing, traditional histological classifications have been replaced with more sophisticated genetic markers. Despite the wealth of glioma expression data, attempts to identify patterns of gene expression that predict survival have been underwhelming32,33. In contrast, epigenetic data such as DNA methylation (e.g. G-CIMP) seems to have a high degree of prognostic value18,32,34. Supporting this observation, this study created an accurate prognostic survival model with only six atac-seq peaks, four of which remained significant when controlling for grade, age and IDH1 status.

The discovery of the IDH135 and H3K27M36,37 mutations has revealed the importance of epigenetic dysfunction in the early stages of cancer development. The presumed mechanism of the IDH1 mutation is to favor DNA methylation by inhibiting DNA demethylase enzymes (e.g. Tet family)18,19. This is hypothesized to induce decreased DNA accessibility with subsequent gene down-regulation20. In the current study, UMAP clustering divided samples by IDH1 status although several IDH1 mutant samples clustered with the IDH1 wildtype group. Running counter to the hypothesis that the IDH1 mutation leads to hyper-methylation and decreased DNA accessibility, there was a trend for IDH1 mutant samples to have more total peaks. This implies that the relationship between DNA methylation and DNA accessibility may be more complicated than previously assumed.

The relative ease and low input requirement of atac-seq technology has led to wide-spread use on a variety of tissue types38,39,40. However, while the data is easy to obtain it is challenging to interpret without additional sources of supporting evidence. In the absence of this supporting evidence several assumptions have become common in the literature namely that a given peak positively regulates the nearest gene. In the current study, the authors assembled a sufficiently large dataset to create a correlation matrix to make some rational assumptions about the relationships of peaks and target genes. Using relatively strict cut-offs, the data showed that most peaks were correlated with one or at most a few genes. The majority of peaks correlated positively with target gene expression. However, there was no correlation between peaks and the nearest genes. On the contrary, peaks and correlated genes were frequently separated by long distances and including many intervening genes. It should be noted that while this study provides some evidence of specific intergenic regions regulating specific target genes, the data is correlational and indirect and needs to be strengthened by alternative methods such as genetic manipulation (CRISPR) or chromosome conformation capture-based methods.

In conclusion, this study uses an innovative peak:gene correlation matrix to create an epigenetic gene regulatory map of gliomas and then uses this matrix to both identify specific gene regulatory regions of interest as well as introduce a possible mechanism of epigenetic malignant progression.