# DeepMP **Repository Path**: W-seventeen/DeepMP ## Basic Information - **Project Name**: DeepMP - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: release - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-07-29 - **Last Updated**: 2021-07-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DeepMP DeepMP is a convolutional neural network (CNN)-based model that takes information from Nanopore signals and basecalling errors to detect whether a read is methylated or not. DeepMP's architecture consists of two different modules. First, the sequence module involves 6 1D convolutional layers with 256 1x4 filters. On the other hand, the error module comprises 3 1D convolutional layers and 3 locally connected layers both with 128 1x3 filters. Outputs are finally concatenated and inputted into a fully connected layer with 512 units. Furthermore, DeepMP introduces a threshold-free position modification calling model sensitive to sites methylated at low frequency across cells. These novelties allow DeepMP to accurately detect methylated sites at read and position levels. alt text # Contents - [Installation](#Installation) - [Usage](#Usage) - [Example data](#Example-data) # Installation ## Clone repository First download DeepMP from the github repository: git clone https://github.com/pepebonet/DeepMP.git ## Install dependencies We highly recommend to use a virtual environment for the installation and employment of DeepMP: `Create environment and install DeepMP:` conda create --name deepmp_2021 python=3.8 conda activate deepmp_2021 pip install -e . `Install additional dependencies:` pip install ont-tombo pip install biopython conda install -c bioconda samtools conda install -c bioconda minimap2 conda install -c anaconda openjdk conda install -c anaconda bottleneck # Usage This section highlights the main functionalities of DeepMP and the commands to run them. For a detailed insight into the whole process of predicting modifications go to the [Example data](#Example-data) section. ### Feature extraction: Features for the model need to be extracted. We present 3 different options: `Option 1:` Extract combined features ``` DeepMP combined-extraction -fr path/to/fast5/files/ -re path/to/error/folder/ -rp path/to/reference/file/ -dn path/to/dict_read_names -m CG -o CpG_methylated_combined.tsv -ml 1 -cpu 56 ``` `Option 2:` Extract sequence features ``` DeepMP sequence-feature-extraction path/to/fast5/files/ -rp path/to/reference/file/ -m CG -o CpG_methylated.tsv -ml 1 -cpu 56 ``` `Option 3:` Extract error features ``` DeepMP single-read-error-extraction -ef path/to/error/folder/ -m CG -o output/error_features/ -l 1 -cpu 56 ``` ### Preprocess: Extracted features are processed to get the information into h5 format which is the desired input for training, validation and testing. ``` DeepMP preprocess -f path/to/features.tsv -ft combined -o output/folder/ -cpu 56 ``` ### Train models Preprocessing is needed before training. Use `--model_type` flag to specify model for data preparation, choose between `seq` and `err`. ``` DeepMP preprocess path/to/csv_file --model_type seq ``` Train sequence model from binary files. ``` DeepMP train-nns -m seq -tf path/to/train/data -vf path/to/validation/data -md save/model/to/directory -ld save/log/to/directory ``` Train errors model from binary files. ``` DeepMP train-nns -m err -tf path/to/train/data -vf path/to/validation/data -md save/model/to/directory -ld save/log/to/directory ``` Train joint model from binary files. ``` DeepMP train-nns -m joint -tf path/to/train/data -vf path/to/validation/data -md save/model/to/directory -ld save/log/to/directory ``` - Use `-cp` to specify the checkpoint file while training model from checkpoints. ### Call modifications Finally modifications are called for a given test data. Call modifications can be employed to be run in parallel or in a single CPU. Running call-modifications in parallel (faster and memory efficient): ``` DeepMP call-modifications -m joint -tf test/folder/preprocess/output/ -md model/directory -o output/ -pos -cpu 56 ``` Running call-modifications in single CPU (deprecated): ``` DeepMP call-modifications -m joint -tf path/to/test/data/test_file.h5 -md model/directory -o output/ -pos ``` - Specify model type with flag `-m`, choose from `seq, err, joint`(required). - Add `-ef` for sequence model with both seq and error features. - Add `-pos` for test on positions. ### Fast call modifications from fast5 files (under test) DeepMP can call modifications from re-squiggled fast5 files in one step: ``` DeepMP fast-call-joint -f path/to/fast5s/ -ref path/to/reference/genome -md path/to/trained_model -j path/to/sam2tsv.jar ``` Note that this function is currently under test, stepwise process is recommanded. Please see the following example for details. # Example data Step by step process to detect modifications employing DeepMP on a sample (10 reads) of the E. coli dataset. Copy paste of all the commnads starting from the DeepMP folder will output the results on the read and position predictions. E. coli reads are located in: ``` docs/reads/ ``` ## Data Preparation First, extract the fastqs from the reads (output paths need to be generated and updated for the following commands): ``` python deepmp/miscellaneous/parse_fast5.py docs/reads/treated/ -ff5 True -o docs/output_example/error_features/treated/ -cpu 56 -bg Basecall_1D_001 python deepmp/miscellaneous/parse_fast5.py docs/reads/untreated/ -ff5 True -o docs/output_example/error_features/untreated/ -cpu 56 -bg Basecall_1D_001 ``` Next, for both fastqs extracted, the reads are mapped to the reference genome. For that to be done, first load the following packages in your environment if you do not have them already. ``` conda install -c bioconda samtools conda install -c bioconda minimap2 conda install -c anaconda openjdk ``` Then, map reads to the genome with the reference genome available in /docs/ref/ : ``` cd docs/output_example/error_features/treated/ minimap2 -ax map-ont ../../../ref/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa Basecall_1D_001_BaseCalled_template.fastq | samtools view -hSb | samtools sort -@ 56 -o sample.bam samtools index sample.bam ``` Repeat for the untreated folder: ``` cd ../untreated/ minimap2 -ax map-ont ../../../ref/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa Basecall_1D_001_BaseCalled_template.fastq | samtools view -hSb | samtools sort -@ 56 -o sample.bam samtools index sample.bam ``` The following step is to call the variants in both treated and untreated folders: ``` samtools view -h -F 3844 sample.bam | java -jar ../../../jvarkit/sam2tsv.jar -r ../../../ref/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa > sample.tsv cd ../treated/ samtools view -h -F 3844 sample.bam | java -jar ../../../jvarkit/sam2tsv.jar -r ../../../ref/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa > sample.tsv ``` To allow downstream parallelisation of the feature extraction, the generated sample.tsv is split into the different reads in a tmp folder. ``` mkdir tmp awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > "tmp/"$1".txt" }' sample.tsv ``` Then go back to the untreated and repeat: ``` cd ../untreated/ mkdir tmp awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > "tmp/"$1".txt" }' sample.tsv ``` As in some scenarios the readnames of the fastqs do not match the fast5 readnames, a dictionary to parse each read pair may be needed. To do so, we employ the sequencing summary output after running Guppy. That is the case of E. coli: ``` pip install biopython python ../../../../deepmp/miscellaneous/get_dict_guppy.py -ss ../../../../docs/reads/sequencing_summary_untreated.txt -o dict_reads.pkl cd ../treated/ python ../../../../deepmp/miscellaneous/get_dict_guppy.py -ss ../../../../docs/reads/sequencing_summary_treated.txt -o dict_reads.pkl ``` ## Running DeepMP First, go back to the DeepMP folder to run the commands: ``` cd ../../../../ ``` With the data ready, we can now extract the combined features from the reads: ``` DeepMP combine-extraction -fr docs/reads/treated/ -re docs/output_example/error_features/treated/tmp/ -rp docs/ref/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa -ml 1 -cpu 56 -m CG -dn docs/output_example/error_features/treated/dict_reads.pkl -o docs/output_example/treated_features.tsv DeepMP combine-extraction -fr docs/reads/untreated/ -re docs/output_example/error_features/untreated/tmp/ -rp docs/ref/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa -ml 0 -cpu 56 -m CG -dn docs/output_example/error_features/untreated/dict_reads.pkl -o docs/output_example/untreated_features.tsv ``` Once the features are extracted one can concat the resulting features of the treated and untreated samples into a single file to then perform the preprocess step: ``` cd docs/output_example/ cat untreated_features.tsv treated_features.tsv > features.tsv DeepMP preprocess -f features.tsv -ft combined -o . -cpu 56 ``` With the test folder available we can now use one of the trained models to get the predictions from the model at read and position level: ``` DeepMP call-modifications -m joint -md ../../trained_models/K12ER2925_joint_202101/ -tf test/ -pos -cpu 56 ``` The resulting files of the analysis should be the read and position calling predictions from DeepMP: ``` read_predictions_joint_DeepMP.tsv position_calling_joint_DeepMP.tsv ```