Merge pull request #2 from manulera/manu_todo_1

some comments and todos
JeffXiePL · Aug 21, 2024 · e940b61 · e940b61
2 parents f719456 + 3aacabd
commit e940b61
Show file tree

Hide file tree

Showing 2 changed files with 120 additions and 12 deletions.
diff --git a/docs/notebooks/Dseq_Features.ipynb b/docs/notebooks/Dseq_Features.ipynb
@@ -12,9 +12,15 @@
     "\n",
     "pydna offers many ways to easily view, add, extract, and write features into a Genbank file via the `Dseqrecord` class. Before working with features, you need to import the Genbank files (or other biological formats including FASTA, snapgene, EMBL) into python. Please refer to the Importing_Seqs page for a quick how-to tutorial.\n",
     "\n",
-    "After importing the file, we can checkout the list of features in a sequence using the following code:  \n",
+    "After importing the file, we can check out the list of features in a sequence using the following code:  \n",
     "\n",
-    "Note that all the following code in this page assumes that a Dseqrecord object has already been loaded/parsed, and that `Dseqrecord` has been imported from `pydna.dseqrecord`"
+    "Note that all the following code in this page assumes that a Dseqrecord object has already been loaded/parsed, and that `Dseqrecord` has been imported from `pydna.dseqrecord`\n",
+    "\n",
+    "manu_todo:\n",
+    "\n",
+    "* Use actual examples, not your_sequence... You can simply say that you are using an example file.\n",
+    "* Provide the code to produce the list of features that you show in the cell that starts with \"For the sample record...\". That will be quite useful for people to quickly see what they have. If it's long or you don't want to repeat it over and over, you can just create a function.\n",
+    "* Use descriptive variable names. E.g. \"file\" is not a file, it's a dseqrecord so you could call it \"dseqrecord\" or \"seq_record\" or something like that, or even my_sequence, example_sequence."
    ]
   },
   {
@@ -74,14 +80,30 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "manu_todo:\n",
+    "\n",
+    "* I am not sure what you want to do here. I can't run this without knowing which file you used as input, but cds_feature would be the entire CDS feature, not the first part of it's location. When you use the concrete example, I may be able to help you better.\n",
+    "\n",
     "If you have a CDS with parts (i.e CDS separated by introns or other genetic elements), you can show each part of the CDS as such. The `[0]` represents the first part of the CDS, which can be modified to show the second `[1]`, third `[2]`, etc. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "ename": "NameError",
+     "evalue": "name 'file' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[1], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m cds_feature \u001b[38;5;241m=\u001b[39m [f \u001b[38;5;28;01mfor\u001b[39;00m f \u001b[38;5;129;01min\u001b[39;00m \u001b[43mfile\u001b[49m\u001b[38;5;241m.\u001b[39mfeatures \u001b[38;5;28;01mif\u001b[39;00m f\u001b[38;5;241m.\u001b[39mtype \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCDS\u001b[39m\u001b[38;5;124m\"\u001b[39m][\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m      2\u001b[0m \u001b[38;5;28mprint\u001b[39m(cds_feature\u001b[38;5;241m.\u001b[39mlocation)\n",
+      "\u001b[0;31mNameError\u001b[0m: name 'file' is not defined"
+     ]
+    }
+   ],
    "source": [
     "cds_feature = [f for f in file.features if f.type == \"CDS\"][0]\n",
     "print(cds_feature.location)"
@@ -100,14 +122,31 @@
    "source": [
     "## Adding Features and Qualifiers\n",
     "\n",
+    "manu_todo:\n",
+    "\n",
+    "* I would not recommend in general adding features using `add_feature`. I prefer to use the second method you showed for the compound locations for all cases. Sorry if this was not clear. You can use your same examples with this syntax instead using the `SimpleLocation`. You already show what to do with qualifiers, type, etc.\n",
+    "* The method is convenient for origin-spannign features as you show there, but I feel like it's better not to show it in the tutorial ( I actually never use it).\n",
+    "\n",
     "Adding a new feature to describe a region of interest, for instance for a region that you would like to perform a PCR, to a file is easy. pydna provides the `add_feature` method to add a misc type feature to a file, given the starting base and the ending base. Note that the base numbering follow biological convention, rather than python numbering methods. For instance, the following code adds a new feature from the 2nd to the 4th nucleotide."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "ename": "NameError",
+     "evalue": "name 'file' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[2], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mfile\u001b[49m\u001b[38;5;241m.\u001b[39madd_feature(\u001b[38;5;241m2\u001b[39m,\u001b[38;5;241m4\u001b[39m)\n\u001b[1;32m      3\u001b[0m \u001b[38;5;66;03m#Confirms that the new feature has been added\u001b[39;00m\n\u001b[1;32m      4\u001b[0m file\u001b[38;5;241m.\u001b[39mfeatures\n",
+      "\u001b[0;31mNameError\u001b[0m: name 'file' is not defined"
+     ]
+    }
+   ],
    "source": [
     "file.add_feature(2,4)\n",
     "\n",
@@ -303,7 +342,10 @@
    "source": [
     "This method is convenient for checking-out a brief overview of each feature, without reading through an entire sequence record.  \n",
     "\n",
-    "To look for specific features using their qualifiers, we can search through the features list for a specific qualifier. For instance, if I want to file my feature with the gene name of example_gene, I can use the following code:"
+    "To look for specific features using their qualifiers, we can search through the features list for a specific qualifier. For instance, if I want to file my feature with the gene name of example_gene, I can use the following code:\n",
+    "\n",
+    "manu_todo:\n",
+    "* Here it will be more helpful with the concrete example."
    ]
   },
   {
@@ -329,6 +371,10 @@
    "source": [
     "### Removing Features\n",
     "\n",
+    "manu_todo:\n",
+    "\n",
+    "* I would not mention this as a limitation, it's normal practice to filter lists like this, so people are used to do this kind of thing.\n",
+    "\n",
     "Unfortunately, pydna does not provide a built-in method to remove features from a features list. However, we can search for the feature that we would like to remove using the types or qualififers to edit a feature list. \n",
     "\n",
     "For instance, we can modify the features list to exclude all CDS:"
@@ -376,7 +422,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.4"
+   "version": "3.12.3"
   }
  },
  "nbformat": 4,

diff --git a/docs/notebooks/Importing_Seqs.ipynb b/docs/notebooks/Importing_Seqs.ipynb
@@ -9,14 +9,33 @@
     "\n",
     "pydna can be used to work with FASTA, Genbank, EMBL, and snapgene files (.fasta, .gb, .embl, .dna). Specifically, pydna provides ways to read these file types, and store them as a record (I.e `Dseqrecord` object) that one can view and work with. \n",
     "\n",
-    "To import files into pydna is simple. pydna provides the `parse` method to view a list of sequence files, when given the path to the file from your computer. The following code shows an example of how to use the `parse` function to import a hypothetical downloaded FASTA file, but other types of files can also be imported in the same way."
+    "To import files into pydna is simple. pydna provides the `parse` method to read all sequences in a file into a list. As an input, it can take the path to a file from your computer, or a string with the file content. The following code shows an example of how to use the `parse` function to import a hypothetical downloaded FASTA file, but other types of files can also be imported in the same way.\n",
+    "\n",
+    "manu_todo:\n",
+    "* Use an example in which you read from a string (e.g. write a string with a FASTA file and read it).\n",
+    "* Make two sections here: read from file and read from string. Read from sting is an important use-case if people are for example reading sequences that they may get from the genbank API.\n",
+    "* Do not use pseudopaths (/your/absolute...) in the examples. Instead, simply put an actual file. (in other words the second example is enough)\n",
+    "* Don't use absolute paths, instead of `/Users/PeilunXie/Desktop/pydna/docs/notebooks/sequence.gb` use `./sequence.gb` or `../etc` if it's in a directory above the current one. You can read about relative paths or ask chatgpt.\n",
+    "* I added an extra info section for some more control of the parsing"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "ename": "IndexError",
+     "evalue": "list index out of range",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mIndexError\u001b[0m                                Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[4], line 5\u001b[0m\n\u001b[1;32m      3\u001b[0m file_path \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m/your/absolute/path/to/the/file/file.fasta\u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[1;32m      4\u001b[0m files \u001b[38;5;241m=\u001b[39m parse(file_path)\n\u001b[0;32m----> 5\u001b[0m \u001b[43mfiles\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[38;5;241m.\u001b[39mformat(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfasta\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
+      "\u001b[0;31mIndexError\u001b[0m: list index out of range"
+     ]
+    }
+   ],
    "source": [
     "from pydna.parsers import parse\n",
     "\n",
@@ -234,6 +253,49 @@
    "source": [
     "Now, you can work with the sequence record using pydna, using the `Dseqrecord` class. `Dseqrecord` provides way to select regions of interest on the sequence, adding new features to the record, removing features, and creating new `Dseqrecord` objects to store and export your changes. Please refer to the `Dseq_Features` notebook for more information."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extra info\n",
+    "\n",
+    "Note that pydna's `parse` guesses whether the argument passed is a file path or a string, and also guesses the file type based on the content, so it can give unexpected behaviour if your files are not well formatted. To have more control over the parsing of sequences, you can use biopython's `parse` from `Bio.SeqIO`, and then instantiate a Dseqrecord from the biopython's `SeqRecord`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Dseqrecord(-5028)"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from Bio.SeqIO import parse as seqio_parse\n",
+    "from pydna.dseqrecord import Dseqrecord\n",
+    "\n",
+    "file_path = './sequence.gb'\n",
+    "\n",
+    "# Extract the first Seqrecord of the SeqIO.parse iterator\n",
+    "seq_record = next(seqio_parse(file_path, 'genbank'))\n",
+    "\n",
+    "# This is how circularity is stored in biopython's seqrecord\n",
+    "is_circular = 'topology' in seq_record.annotations.keys() and seq_record.annotations['topology'] == 'circular'\n",
+    "\n",
+    "# Convert into Dseqrecord\n",
+    "dseq_record = Dseqrecord(seq_record, circular=is_circular)\n",
+    "\n",
+    "dseq_record"
+   ]
   }
  ],
  "metadata": {
@@ -252,7 +314,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.4"
+   "version": "3.12.3"
   }
  },
  "nbformat": 4,