From 59a5978aab96a0d14e3ecf46c21e789173201228 Mon Sep 17 00:00:00 2001 From: Lucas Hendren Date: Tue, 20 Feb 2024 22:39:20 -0800 Subject: [PATCH 1/6] Providing preprocessing code for chromatin profiling and adding in additional instructions/examples --- README.md | 76 +++++++++++++++++++++++++++++++++++++++++++++++- create_coords.py | 16 ++++++++++ 2 files changed, 91 insertions(+), 1 deletion(-) create mode 100644 create_coords.py diff --git a/README.md b/README.md index 2770324..be32fd9 100644 --- a/README.md +++ b/README.md @@ -334,7 +334,81 @@ python -m evals/instruction_tuned_genomics ### Chromatin Profile -You'll need to see the [DeepSea](https://www.nature.com/articles/nmeth.3547) and [repo](https://github.com/FunctionLab/sei-framework) for info how to download and preprocess. +You'll need to see the [DeepSea](https://www.nature.com/articles/nmeth.3547) and [repo](https://github.com/FunctionLab/sei-framework) for info how to download and preprocess. + +For a more detailed example look below + +1. Git clone or Download the following [buildling-deepsea repo](https://github.com/jakublipinski/build-deepsea-training-dataset). + +2. Git clone or Download the [sei framework library](https://github.com/FunctionLab/sei-framework.git) + +3. Step into the sei framework and perform the setup + +``` +sh ./download_data.sh. +``` + +This should download data into the resources folder. + +4. Next step into the build-deepsea-training-dataset. Follow the instructions in the repo to build the dataset in debugging mode, which will include the instructions below. The hg19 path will need to be specifically given a path to the given fa file. You should include the hg19 file you downloaded in step 3 in resources, specifically the FA hg19 file from the sei framework. You can modify other parameters as well for further customization. + + A. + ``` + git clone git@github.com:jakublipinski/build-deepsea-training-dataset.git + cd build-deepsea-training-dataset/data + xargs -L 1 curl -C - -O -L < deepsea_data.urls + find ./ -name \*.gz -exec gunzip {} \; + cd .. + ``` + + B. + ``` + mkdir out + + python build.py \ + --metadata_file data/deepsea_metadata.tsv \ + --pos data/allTFs.pos.bed \ + --beds_folder data/ \ + --hg19 [path to FA file] \ + --train_size 2200000 \ + --valid_size 4000 \ + --train_filename out/train.mat \ + --valid_filename out/valid.mat \ + --test_filename out/test.mat \ + --train_data_filename out/train_data.npy \ + --train_labels_filename out/train_labels.npy \ + --valid_data_filename out/valid_data.npy \ + --valid_labels_filename out/valid_labels.npy \ + --test_data_filename out/test_data.npy \ + --test_labels_filename out/test_labels.npy \ + --save_debug_info True + ``` + +5. Run the following Python Code to create your coordinate target files, youll need to pass in the path to your directory for, please make a note of their location for step 8 + +``` +python ./create_coords.py +``` + + +6. Now step into the Sei Framework and follow the steps in chromatin profile prediction and specifically run the following command + + A. + ``` + sh 1_sequence_prediction.sh --cuda + ``` + + The input-file will be the bed or fasta input file you download in step 3 which should be in the resources directory within the sei framework. For the genome, this example is geared towards hg19. You can do hg38 as well but you will need to make changes to earlier steps. Output directory is your choice + +6. This should create a folder called chromatin-profiles-hdf5 in your output directory along with several other files. + +7. Take your coordinate files from step 5 and copy them to the chromatin-profiles-hdf5 folder + +8. Now go back to hyena dna and, assuming you have already setup hyena dna, perform the following + + A. python -m train wandb=null experiment=hg19/chromatin_profile dataset.ref_genome_path=/path/to/fasta/hg19.ml.fa dataset.data_path=/path/to/chromatin_profile dataset.ref_genome_version=hg19 + + B. For paths to chromatin files, those will be the paths to the chromatin-profiles-hdf5. For the hg19 fa files, those can be in resources or found in your output directory with the chromatin profile. For genome version and experiment, this is for the hg19 experiment example chromatin profile run: diff --git a/create_coords.py b/create_coords.py new file mode 100644 index 0000000..da5507f --- /dev/null +++ b/create_coords.py @@ -0,0 +1,16 @@ +import Pandas as pd + +def create_coord_target_files(file, name): + target_cols=pd.read_csv('data/deepsea_metadata.tsv', sep='\t')['File accession'].tolist() # metadata from build-deepsea-training-dataset repo + colnames=target_cols+['Chr_No','Start','End'] + df = pd.read_csv(file, usecols=colnames, header=0) + df.drop_duplicates(inplace=True) + df.reset_index(drop=True, inplace=True) + df.rename(columns={k:f'y_{k}' for k in target_cols}, inplace=True) + df.to_csv(f'{name}_coords_targets.csv') + + +path_to_deepsea_data_repo = sys.argv[1] +create_coord_target_files('debug_valid.tsv', 'val') +create_coord_target_files('debug_test.tsv', 'test') +create_coord_target_files('debug_train.tsv', 'train') \ No newline at end of file From 280c17b551a5e481b6a9d7525869910cdcd86310 Mon Sep 17 00:00:00 2001 From: Lucas Hendren Date: Wed, 21 Feb 2024 19:32:09 -0800 Subject: [PATCH 2/6] Fixing Formatting --- README.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index be32fd9..bd7132c 100644 --- a/README.md +++ b/README.md @@ -352,7 +352,7 @@ This should download data into the resources folder. 4. Next step into the build-deepsea-training-dataset. Follow the instructions in the repo to build the dataset in debugging mode, which will include the instructions below. The hg19 path will need to be specifically given a path to the given fa file. You should include the hg19 file you downloaded in step 3 in resources, specifically the FA hg19 file from the sei framework. You can modify other parameters as well for further customization. - A. + ``` git clone git@github.com:jakublipinski/build-deepsea-training-dataset.git cd build-deepsea-training-dataset/data @@ -360,8 +360,7 @@ This should download data into the resources folder. find ./ -name \*.gz -exec gunzip {} \; cd .. ``` - - B. + ``` mkdir out @@ -393,9 +392,8 @@ python ./create_coords.py 6. Now step into the Sei Framework and follow the steps in chromatin profile prediction and specifically run the following command - A. ``` - sh 1_sequence_prediction.sh --cuda + sh 1_sequence_prediction.sh --cuda ``` The input-file will be the bed or fasta input file you download in step 3 which should be in the resources directory within the sei framework. For the genome, this example is geared towards hg19. You can do hg38 as well but you will need to make changes to earlier steps. Output directory is your choice From 24024ccf15ce7cd3c3ec438c422c55608bb54778 Mon Sep 17 00:00:00 2001 From: Lucas Hendren Date: Wed, 21 Feb 2024 19:34:18 -0800 Subject: [PATCH 3/6] Fixing Formatting with code --- README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/README.md b/README.md index bd7132c..8f51751 100644 --- a/README.md +++ b/README.md @@ -392,9 +392,7 @@ python ./create_coords.py 6. Now step into the Sei Framework and follow the steps in chromatin profile prediction and specifically run the following command - ``` - sh 1_sequence_prediction.sh --cuda - ``` + ```sh 1_sequence_prediction.sh --cuda``` The input-file will be the bed or fasta input file you download in step 3 which should be in the resources directory within the sei framework. For the genome, this example is geared towards hg19. You can do hg38 as well but you will need to make changes to earlier steps. Output directory is your choice From f3c08aea24deee2b0e4e0c2a15266a40f1149810 Mon Sep 17 00:00:00 2001 From: Lucas Hendren Date: Wed, 21 Feb 2024 19:35:31 -0800 Subject: [PATCH 4/6] Restructuring readme --- README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 8f51751..d5a9cc4 100644 --- a/README.md +++ b/README.md @@ -340,9 +340,9 @@ For a more detailed example look below 1. Git clone or Download the following [buildling-deepsea repo](https://github.com/jakublipinski/build-deepsea-training-dataset). -2. Git clone or Download the [sei framework library](https://github.com/FunctionLab/sei-framework.git) +2. Git clone or Download the [sei framework library](https://github.com/FunctionLab/sei-framework.git). -3. Step into the sei framework and perform the setup +3. Step into the sei framework and perform the setup. ``` sh ./download_data.sh. @@ -383,28 +383,28 @@ This should download data into the resources folder. --save_debug_info True ``` -5. Run the following Python Code to create your coordinate target files, youll need to pass in the path to your directory for, please make a note of their location for step 8 +5. Run the following Python Code to create your coordinate target files, youll need to pass in the path to your directory for, please make a note of their location for step 8. ``` python ./create_coords.py ``` -6. Now step into the Sei Framework and follow the steps in chromatin profile prediction and specifically run the following command +6. Now step into the Sei Framework and follow the steps in chromatin profile prediction and specifically run the following command below. The input-file will be the bed or fasta input file you download in step 3 which should be in the resources directory within the sei framework. For the genome, this example is geared towards hg19. You can do hg38 as well but you will need to make changes to earlier steps. Output directory is your choice. ```sh 1_sequence_prediction.sh --cuda``` - The input-file will be the bed or fasta input file you download in step 3 which should be in the resources directory within the sei framework. For the genome, this example is geared towards hg19. You can do hg38 as well but you will need to make changes to earlier steps. Output directory is your choice + 6. This should create a folder called chromatin-profiles-hdf5 in your output directory along with several other files. -7. Take your coordinate files from step 5 and copy them to the chromatin-profiles-hdf5 folder +7. Take your coordinate files from step 5 and copy them to the chromatin-profiles-hdf5 folder. -8. Now go back to hyena dna and, assuming you have already setup hyena dna, perform the following +8. Now go back to hyena dna and, assuming you have already setup hyena dna, perform the following. A. python -m train wandb=null experiment=hg19/chromatin_profile dataset.ref_genome_path=/path/to/fasta/hg19.ml.fa dataset.data_path=/path/to/chromatin_profile dataset.ref_genome_version=hg19 - B. For paths to chromatin files, those will be the paths to the chromatin-profiles-hdf5. For the hg19 fa files, those can be in resources or found in your output directory with the chromatin profile. For genome version and experiment, this is for the hg19 experiment + B. For paths to chromatin files, those will be the paths to the chromatin-profiles-hdf5. For the hg19 fa files, those can be in resources or found in your output directory with the chromatin profile. For genome version and experiment, this is for the hg19 experiment. example chromatin profile run: From 10cc73489206d198d93c90ef332410df71417623 Mon Sep 17 00:00:00 2001 From: Lucas Hendren Date: Wed, 21 Feb 2024 19:37:45 -0800 Subject: [PATCH 5/6] Indentation fix --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index d5a9cc4..d5dd593 100644 --- a/README.md +++ b/README.md @@ -392,7 +392,9 @@ python ./create_coords.py 6. Now step into the Sei Framework and follow the steps in chromatin profile prediction and specifically run the following command below. The input-file will be the bed or fasta input file you download in step 3 which should be in the resources directory within the sei framework. For the genome, this example is geared towards hg19. You can do hg38 as well but you will need to make changes to earlier steps. Output directory is your choice. - ```sh 1_sequence_prediction.sh --cuda``` + ``` + sh 1_sequence_prediction.sh --cuda + ``` From 3c137d6334519280abe7de977f0c426fd5495d4c Mon Sep 17 00:00:00 2001 From: Lucas Hendren Date: Thu, 22 Feb 2024 01:15:45 -0800 Subject: [PATCH 6/6] Fixing formatting and instructions in readme --- README.md | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index d5dd593..d296bc6 100644 --- a/README.md +++ b/README.md @@ -404,12 +404,6 @@ python ./create_coords.py 8. Now go back to hyena dna and, assuming you have already setup hyena dna, perform the following. - A. python -m train wandb=null experiment=hg19/chromatin_profile dataset.ref_genome_path=/path/to/fasta/hg19.ml.fa dataset.data_path=/path/to/chromatin_profile dataset.ref_genome_version=hg19 - - B. For paths to chromatin files, those will be the paths to the chromatin-profiles-hdf5. For the hg19 fa files, those can be in resources or found in your output directory with the chromatin profile. For genome version and experiment, this is for the hg19 experiment. - - -example chromatin profile run: ``` python -m train wandb=null experiment=hg38/chromatin_profile dataset.ref_genome_path=/path/to/fasta/hg38.ml.fa dataset.data_path=/path/to/chromatin_profile dataset.ref_genome_version=hg38 ``` @@ -417,6 +411,8 @@ python -m train wandb=null experiment=hg38/chromatin_profile dataset.ref_genome_ - `dataset.ref_genome_path` # path to a human ref genome file (the input sequences) - `dataset.ref_genome_version` # the version of the ref genome (hg38 or hg19, we use hg38) - `dataset.data_path` # path to the labels of the dataset + + For paths to chromatin files, those will be the paths to the chromatin-profiles-hdf5. For the hg19 fa files, those can be in resources or found in your output directory with the chromatin profile. For genome version and experiment, this is for the hg19 experiment. ### Species Classification