New computational method uses convolutional neural networks for cis-regulatory sequence analysis to analyze and cluster scATAC-seq data. Convolutional neural networks (CNNs) provide a powerful lens through which to decipher the genomic regulatory code. CNNs have been successfully trained on regulatory DNA sequences to predict transcription factor (TF) binding1, chromatin accessibility2,3,4, enhancer activity5 and gene expression6. To train such networks, convolutional filters in the first layer are applied across the input sequence, resulting in the activation of sequence motifs relevant for the final predictions. The size of the filters is usually chosen to reflect the width of a TF motif, so that many of the learned features represent TF motifs. Next, additional convolutional, recurrent or transformer layers provide a means to learn how TF motifs are combined and organized. These layers are followed by dense layers that exploit the learned representation for classification or regression. After training, the CNN is mostly used to discover important cis-regulatory features, including the presence or absence of, respectively, activator and repressor motifs; but also more advanced architecture of enhancers and promoters including motif combinations, orientation of motifs, distance between motifs, and the presence of nucleosome binding preferences. Compared to conventional techniques for sequence analysis with position weight matrices or de novo motif discovery, CNNs that are combined with network explainability methods7 provide a promising route to overcoming the high false-positive rate in TF binding site predictions (called the 'futility theorem'8). Sequence-based CNNs are particularly promising for learning regulatory codes across many cell types - for example, by applying them to atlases of single-cell chromatin accessibility data2,3. Indeed, single-cell ATAC-seq (scATAC-seq) provides chromatin accessibility profiles for every cell type in a heterogeneous tissue, or for every cell state in a dynamic biological process9,10. CNNs can then be trained on all cell type-specific accessible regions at once. However, when using scATAC-seq data, this approach depends on a high-quality clustering of cells by cell type or cell state, which can be a challenge given the high noise and sparsity in these data9. Yuan and Kelley address this problem in this issue by exploiting sequence-based CNNs to learn a high-confidence single-cell representation11. Before describing this method, I will briefly summarize conventional approaches for learning a single-cell representation from scATAC-seq data.

