Ofir Yaish and Yaron Orenstein
The CRISPR/Cas9 system has been revolutionizing the field of genome editing in recent years. Despite its great success in editing the target site, there are also many off-target edits. The discovery and prediction of these off-target have been the focus of research for many experimental and computational researchers. But, up until recently, these advancements were held back by limited and low-quality datasets.
Fortunately, last year, a dataset of unprecedented scale and quality was published as part of the CHANGE-seq study [1]. The dataset included more than 200,000 measurements of off-targets in vitro over 110 guide RNAs, with about half also tested in vivo. These new data opened the door to many new opportunities, including the systematic evaluation of CRISPR off-target prediction models and problem formulation.
In a recent preprint we published [2], we derived fundamental conclusions behind the off-target prediction problem. First, we discovered that the log transformation of measured read counts lends to improved prediction performance in detecting off-target sites and predicting their editing efficiency. Second, we found out that including potential off-targets, which were not edited in the experiment (i.e., inactive sites) improves off-target site detection, and even improves off-target editing efficiency prediction. Third, we compared models that differ in their input features (sequence and number of mismatches between the target site and off-target site), and showed that the number of mismatches captures almost all information regarding the off-target editing efficiency. Notably, our analysis showed that our models trained on in vitro data predict off-target in vivo better than if we had taken the in vitro measurements themselves as the prediction. Last, we are the first to apply the transfer learning technique (training on a large dataset, and fine-tuning on a smaller more-relevant dataset) to the off-target prediction problem to improve the prediction performance of in vitro trained models to in vivo data.
We believe that our findings and conclusions will be instrumental in any future development of an off-target predictor. They show the power of machine-learning in extrapolating over the training data, and generalizing over unseen test data, even better than the experiments themselves, which always carry some noise component. In addition, they demonstrate the effect of transfer learning in improving prediction when large less-correlated datasets and smaller and more-correlated datasets are available. This approach is expected to improve prediction performance in many more genomic applications. All in all, our evaluation is expected to advance the off-target field further by leading the way to the development of improved computational methods for off-target site prediction.
[1] C. R. Lazzarotto et al., “CHANGE-seq reveals genetic and epigenetic effects on CRISPR–Cas9 genome-wide activity,” Nat. Biotechnol., 2020, doi: 10.1038/s41587-020-0555-7.
[2] O. Yaish, M. Asif, and Y. Orenstein, “A systematic evaluation of data processing and problem formulation of CRISPR off-target site prediction,” bioRxiv, 2021.
The computational model used through our study. The sgRNA and off-target site are paired, and each position is encoded into a 4-by-4 binary matrix, with 1 at the position of the nucleotide pair. A concatenation of the flattened matrices in addition to the distance, i.e. the number of mismatches between the sgRNA and site, is given as input to an XGBoost model.