Web servers: Puffin | Puffin-D | Orca | Sei | Multiplexer

Puffin: Deep-learning-inspired explainable sequence model of transcription initiation

Puffin is an interpretation-focused sequence model for transcription initiation in the human genome which is also applicable to other mammalian species. Puffin can predict basepair-resolution transcription initiation signals using only genomic sequence as input, and more importantly, analyze the sequence basis of any transcription start site (TSS) at motif and basepair levels.

Puffin-D is a prediction-focused deep learning sequence model for basepair-resolution transcription initiation signals. Puffin-D takes 100kb input and accurately predicts the transcriptional signal strength. The prediction-focused deep learning model Puffin-D is available at puffind.zhoulab.io.

For most use cases, we recommend the interactive Puffin web server at puffin.zhoulab.io. The prediction-focused deep learning model Puffin-D is available at puffind.zhoulab.io.

DDSM: Dirichlet Diffusion Score Model for biological sequence generation

Dirichlet Diffusion Score Model (DDSM) is a continuous-time diffusion framework designed specifically for modeling discrete data such as biological sequences. We introduce a diffusion process defined in probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We demonstrated applications to Sudoku solving and promoter sequence design.

Orca: predict genome 3D interactions from kilobase to whole-chromosome scale from sequence

Orca is a deep learning sequence modeling framework for multiscale genome interaction prediction from kilobase to beyond whole-chromosome scale. Orca allows predicting genome structural impacts of any genomic variants, including very large structural variants, and designing virtual genetic screens to probe the sequence basis of genome 3D organization. Orca-leukemia update (2023): a version of Orca for leukemia-related cell lines is now available in the webserver too and a GitHub repository for structural variant regulatory impact scoring for leukemia subtypes is available here.

Orca-leukemia update (2023): a version of Orca for leukemia-related cell lines is now available in the webserver, and a GitHub repository for structural variant regulatory impact scoring for leukemia subtypes is available below.

Sei: A sequence-based global map of regulatory activity for deciphering human genetics

Sei is a framework for systematically predicting sequence regulatory activities and applying sequence information to human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes, and each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types). Sei has been used for interpreting tissue-specific regulatory signals from GWAS and predicting individual pathogenic variant effects.

Quasildr: An analytical framework for interpretable and generalizable single-cell omics data analysis

GraphDR is a method for single-cell data visualization and representation; StructDR is a method for generalized trajectory inference (StructDR). The unique features of these methods are 1. linear interpretability, which is similar to PCA and eases comparisons across datasets, 2. allowing statistical inference of confidence sets for the trajectories, and 3. unifying the inference of clusters (0-dimensional), trajectories (1-dimensional), and surfaces (2-dimensional) structures. GraphDR is also scalable to more than 10 million cells.

ExPecto: tissue-specific gene expression effect prediction for human mutations

ExPecto is a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks. Precomputed variant effects can be browsed and downloaded here. You can access deep learning sequence model-based variant chromatin effect prediction from the webserver here (thanks to the Flatiron Institute Genomics team!).

Selene: a PyTorch-based deep learning library for sequence data

Selene is a Python library and command line interface for developing deep neural networks for biological sequence data. Our aim for Selene is to accelerate development and application of deep learning sequence models in biology. The development of Selene is led by Kathy Chen.

ASD Browser

For analyzing impact of regulatory mutations to disease, we developed deep learning sequence models for molecular-level effects at chromatin level and RNA-binding protein level and Disease Impact Scores summarizing molecular level effects. Precomputed ASD mutation effects can be browsed and downloaded here.

DeepSEA: Deep learning-based algorithmic framework for predicting chromatin effects

DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities, and histone marks in multiple cell types. It can be further utilized to predict the chromatin effects of sequence variants and prioritize regulatory variants.

FIND: Drosophila embryonic development tissue-stage-specific gene expression predictions

We developed a machine learning method to provide genome-wide, quantitative spatiotemporal gene expression predictions. The new method we developed is structured in silico nano-dissection, a lineage-aware probabilistic graphical model that predicts gene expression in >200 tissue-developmental stages. FIND (Fly in silico nanodissection) is a webserver for exploring the gene expression predictions.

Multiplexer

Accelerating Systematic Prediction of Variant Effects and Sequence Interpretation with Multiplexer Models. Multiplexer is a framework for deep learning sequence model-based prediction and interpretation.