diff --git a/docs/notebooks/Dseq_Features.ipynb b/docs/notebooks/Dseq_Features.ipynb index d01241d3..e25158d0 100644 --- a/docs/notebooks/Dseq_Features.ipynb +++ b/docs/notebooks/Dseq_Features.ipynb @@ -6,13 +6,13 @@ "source": [ "# Working with Features using the Dseqrecord class\n", "\n", - "> Visit the full library documentation [here](https://bjornfjohansson.github.io/pydna/)\n", + "> Before working with features, check how to import sequences from files in the [Importing_Seqs notebook](./Importing_Seqs.ipynb).\n", + ">\n", + "> For full library documentation, visit [here](https://bjornfjohansson.github.io/pydna/).\n", "\n", - "Features are important components of a .gb file, describing the key biological properties of the sequence. In Genbank, features \"include genes, gene products, as well as regions of biological significance reported in the sequence.\" (See [here](https://www.ncbi.nlm.nih.gov/genbank/samplerecord/) for a description of a Genbank file and associated terminologies/annotations) Examples include coding sequences (CDS), introns, promoters, etc.\n", + "Some sequence file formats (like Genbank) include features, describing key biological properties of sequence regions. In Genbank, features \"include genes, gene products, as well as regions of biological significance reported in the sequence.\" (See [here](https://www.ncbi.nlm.nih.gov/genbank/samplerecord/) for a description of a Genbank file and associated terminologies/annotations) Examples include coding sequences (CDS), introns, promoters, etc.\n", "\n", - "pydna offers many ways to easily view, add, extract, and write features into a Genbank file via the `Dseqrecord` class. Before working with features, you need to import the Genbank files (or other biological formats including FASTA, snapgene, EMBL) into python. Please refer to the Importing_Seqs page for a quick how-to tutorial.\n", - "\n", - "After importing the file, we can check out the list of features in a sequence using the following code. An example `Dseqrecord` object has been loaded using the GenBank sample sequence [here](https://www.ncbi.nlm.nih.gov/genbank/samplerecord/)." + "pydna offers many ways to easily view, add, extract, and write features into a Genbank file via the `Dseqrecord` class. After reading a file into a `Dseqrecord` object, we can check out the list of features in the record using the following code. This example uses the sample record [U49845](https://www.ncbi.nlm.nih.gov/genbank/samplerecord/)." ] }, { @@ -94,12 +94,12 @@ "from pydna.parsers import parse\n", "\n", "#Import your file into python. \n", - "file_path = \"./sequence.gb\"\n", + "file_path = \"./U49845.gb\"\n", "records = parse(file_path)\n", - "my_record = records[0]\n", + "sample_record = records[0]\n", "\n", "# List all features\n", - "for feature in my_record.features:\n", + "for feature in sample_record.features:\n", " print(feature)" ] }, @@ -116,7 +116,12 @@ "source": [ "## Adding Features and Qualifiers\n", "\n", - "Adding a new feature to describe a region of interest, for instance for a region that you would like to perform a PCR, to a file involves the parent module `Bio.SeqFeature`. `Bio.SeqFeature` provides the `FeatureLocation` method to add a feature to a file, given the starting base, ending base, and the feature type. Note that the base numbering follow biological convention, rather than python numbering methods. For instance, the following code adds a new feature from the 2nd to the 5th nucleotide, of the mRNA type." + "To add new feature to describe a region of interest to a record, for instance a region that you would like to perform a PCR, you need to create a `SeqFeature` (sequence feature). The minimal information required is:\n", + "* A `FeatureLocation`: position of the feature in the sequence.\n", + "* The `type` of feature you want to add.\n", + "\n", + "\n", + "🚨🚨 **VERY IMPORTANT** 🚨🚨. Note that `FeatureLocation`s are like python ranges (zero-based open intervals), whereas in GenBank files, locations are one-based closed intervals. For instance, the following code adds a new feature from the 2nd to the 5th nucleotide (`FeatureLocation(3, 15)`), of the `gene` type, but in the GenBank file will be represented as `4..15`." ] }, { @@ -125,54 +130,58 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "SeqFeature(SimpleLocation(ExactPosition(2), ExactPosition(5)), type='mRNA')" - ] - }, - "execution_count": null, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "type: gene\n", + "location: [3:15]\n", + "qualifiers:\n", + "\n", + "LOCUS name 19 bp DNA linear UNK 01-JAN-1980\n", + "DEFINITION description.\n", + "ACCESSION id\n", + "VERSION id\n", + "KEYWORDS .\n", + "SOURCE .\n", + " ORGANISM .\n", + " .\n", + "FEATURES Location/Qualifiers\n", + " gene 4..15\n", + "ORIGIN\n", + " 1 aaaatgcgta cgtgaacgt\n", + "//\n" + ] } ], "source": [ "from Bio.SeqFeature import FeatureLocation, SeqFeature\n", "\n", - "# Define the locations of the CDS\n", - "location = FeatureLocation(2, 5)\n", + "# Create a dummy record\n", + "dummy_record = Dseqrecord(\"aaaATGCGTACGTGAacgt\")\n", + "\n", + "# Define the locations of a CDS\n", + "location = FeatureLocation(3, 15)\n", "\n", "# Create a SeqFeature with the type mRNA\n", - "my_feature = SeqFeature(location=location, type=\"mRNA\")\n", + "my_feature = SeqFeature(location=location, type=\"gene\")\n", "\n", - "# Add my_feature to my_record with .append\n", - "my_record.features.append(my_feature)\n", + "# Add my_feature to dummy_record with .append\n", + "dummy_record.features.append(my_feature)\n", "\n", "# Confirm that my_feature has been added\n", - "my_record.features[-1]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that a new feature is always added to the last position of the features list. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`pydna` and `Bio.SeqFeature` suppports all the conventional feature types through the `type` parameters. A non-exhaustive list include gene, CDS, promoter, exon, intron, 5' UTR, 3' UTR, terminator, enhancer, and RBS. You can also define custom features, which could be useful for synthetic biology applications. For instance, you might want to have Bio_brick or spacer features to describe a synthetic standardised plasmid construct. \n", - " \n", - "It is important to note that while `pydna` and `Bio.SeqFeature` does not restrict the feature types you can use, sticking to standard types helps maintain compatibility with other bioinformatics tools and databases. I recommend referring to the official [GenBank_Feature_Table](https://www.insdc.org/submitting-standards/feature-table/#2), if in doubt." + "print(dummy_record.features[-1])\n", + "\n", + "# Print the feature in GenBank format (see how the location is `4..15`)\n", + "print(dummy_record.format(\"genbank\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "To make a note of what this sequence is, we can add a qualifier to the new feature by accessing the `qualifiers` dictionary. This dictionary can be accessed writing your notes as a dictionary key value pair, under the `qualifiers` parameter of the `SeqFeature` class object. \n", - "For instance, if I would like to note a new feature of type 'domain', between 24-56 bases as my region of interest, I can instantiate the `SeqFeature` class object as such." + "To give further information about a feature, we can add a qualifier using the `qualifiers` property of `SeqFeature`, which contains a dictionary of qualifiers. For instance, if I would like to note a new feature of type 'domain', between 3-9 bases as my region of interest, I can instantiate the `SeqFeature` class object as such.\n", + "\n", + "> Note that a new feature is always added to the last position of the features list." ] }, { @@ -181,41 +190,97 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "SeqFeature(SimpleLocation(ExactPosition(24), ExactPosition(56)), type='domain', qualifiers=...)" - ] - }, - "execution_count": null, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + ">> Feature was added:\n", + "type: domain\n", + "location: [3:9]\n", + "qualifiers:\n", + " Key: Note, Value: ['Region of interest']\n", + "\n", + "\n", + ">> GenBank format:\n", + "LOCUS name 19 bp DNA linear UNK 01-JAN-1980\n", + "DEFINITION description.\n", + "ACCESSION id\n", + "VERSION id\n", + "KEYWORDS .\n", + "SOURCE .\n", + " ORGANISM .\n", + " .\n", + "FEATURES Location/Qualifiers\n", + " gene 4..15\n", + " domain 4..9\n", + " /Note=\"Region of interest\"\n", + "ORIGIN\n", + " 1 aaaatgcgta cgtgaacgt\n", + "//\n" + ] } ], "source": [ - "location = FeatureLocation(24, 56)\n", + "location = FeatureLocation(3, 9)\n", "\n", "# Create a SeqFeature with a qualifier\n", - "my_feature2 = SeqFeature(location=location, type=\"domain\", qualifiers={\"Note\":\"Region of interest\"})\n", + "my_feature2 = SeqFeature(location=location, type=\"domain\", qualifiers={\"Note\": [\"Region of interest\"]})\n", "\n", "# Add my_feature to my_record with .append\n", - "my_record.features.append(my_feature2)\n", + "dummy_record.features.append(my_feature2)\n", "\n", "# Confirm that my_feature has been added\n", - "my_record.features[-1]" + "print('>> Feature was added:')\n", + "print(dummy_record.features[-1])\n", + "print()\n", + "\n", + "# Print the feature in GenBank format\n", + "print('>> GenBank format:')\n", + "print(dummy_record.format(\"genbank\"))\n", + "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "On the feature list of the .gb file, the new feature is reflected as so:\n", - "\n", - "type: CDS \n", - "location: [24:56] \n", - "qualifiers: \n", - " Key: Note, Value: Region of interest \n", - "\n", - "Note that `Bio.SeqFeatures` does not automatically assume a sequence strand. If you would like to refer to a sequence on the positive or minus strand, you can add a parameter in `FeatureLocation` specifying `strand=+1` or `strand=-1`. " + "**🤔 Best practices for qualifiers:**\n", + "\n", + "The values in the `qualifiers` dictionary should be lists. The reason for this is that in a GenBank file, a single feature can have multiple values for a single qualifier. Below is a real world of the ase1 CDS example from the _S. pombe_ genome in EMBL format:\n", + "\n", + "```\n", + "FT CDS join(1878362..1878785,1878833..1880604)\n", + "FT /colour=2\n", + "FT /primary_name=\"ase1\"\n", + "FT /product=\"antiparallel microtubule cross-linking factor\n", + "FT Ase1\"\n", + "FT /systematic_id=\"SPAPB1A10.09\"\n", + "FT /controlled_curation=\"term=species distribution, conserved\n", + "FT in eukaryotes; date=20081110\"\n", + "FT /controlled_curation=\"term=species distribution, conserved\n", + "FT in metazoa; date=20081110\"\n", + "FT /controlled_curation=\"term=species distribution, conserved\n", + "FT in vertebrates; date=20081110\"\n", + "FT /controlled_curation=\"term=species distribution,\n", + "FT predominantly single copy (one to one); date=20081110\"\n", + "FT /controlled_curation=\"term=species distribution, conserved\n", + "FT in fungi; date=20081110\"\n", + "FT /controlled_curation=\"term=species distribution, conserved\n", + "FT in eukaryotes only; date=20081110\"\n", + "```\n", + "\n", + "Note how there are several `controlled_curation` qualifiers, therefore it makes sense to store them as a list.\n", + "\n", + "By default, you can add any type of object in the qualifiers dictionary, and most things will work if you add a string. However, you risk overwriting the existing value for a qualifier, so best practice is:\n", + "1. Check if the qualifier already exists using `if \"qualifier_name\" in feature.qualifiers`\n", + "2. If it exists, append to the existing list of values using `feature.qualifiers[\"qualifier_name\"].append(\"new_value\")`\n", + "3. If it does not exist, add it to the qualifiers dictionary using `feature.qualifiers[\"qualifier_name\"] = [\"new_value\"]`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that `Bio.SeqFeatures` does not automatically assume a sequence strand for the feature. If you would like to refer to a feature on the positive or minus strand, you can add a parameter in `FeatureLocation` specifying `strand=+1` or `strand=-1`. " ] }, { @@ -224,25 +289,46 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "SeqFeature(SimpleLocation(ExactPosition(134), ExactPosition(520), strand=-1), type='CDS', qualifiers=...)" - ] - }, - "execution_count": null, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "type: domain\n", + "location: [15:19](-)\n", + "qualifiers:\n", + " Key: gene, Value: ['example_domain']\n", + "\n", + "LOCUS name 19 bp DNA linear UNK 01-JAN-1980\n", + "DEFINITION description.\n", + "ACCESSION id\n", + "VERSION id\n", + "KEYWORDS .\n", + "SOURCE .\n", + " ORGANISM .\n", + " .\n", + "FEATURES Location/Qualifiers\n", + " gene 4..15\n", + " domain 4..9\n", + " /Note=\"Region of interest\"\n", + " domain complement(16..19)\n", + " /gene=\"example_domain\"\n", + "ORIGIN\n", + " 1 aaaatgcgta cgtgaacgt\n", + "//\n" + ] } ], "source": [ "#Create a location specifying the minus strand\n", - "location = FeatureLocation(134, 520, strand=-1)\n", + "location = FeatureLocation(15, 19, strand=-1)\n", + "\n", + "my_feature3 = SeqFeature(location=location, type=\"domain\", qualifiers={\"gene\":[\"example_domain\"]})\n", "\n", - "my_feature3 = SeqFeature(location=location, type=\"CDS\", qualifiers={\"gene\":\"example_CDS\"})\n", + "dummy_record.features.append(my_feature3)\n", "\n", - "my_record.features.append(my_feature3)\n", + "print(dummy_record.features[-1])\n", "\n", - "my_record.features[-1]" + "print(dummy_record.format(\"genbank\"))\n", + "\n" ] }, { @@ -251,9 +337,9 @@ "source": [ "### Adding a Feature with Parts\n", "\n", - "To add a feature with parts, like a CDS with introns, we need to apply another class and methods from `Bio.SeqFeature`. Specifically, you can achieve this by creating a `CompoundLocation` object, and then adding it to a `SeqFeature` object as normal. \n", + "To add a feature with parts, like a CDS with introns, we need to use a `CompoundLocation` object when creating a `SeqFeature`.\n", "\n", - "The example code belows adds a CDS with two parts, between 5-15bp and 20-30bp, named \"example gene\" in the qualifiers, to my features list. " + "The example code below adds a CDS with two parts, between 3-9bp and 12-15bp, to my features list. In a real-world scenario this would represent a CDS with an intron that skips the `ACG` codon: ATGCGT~~ACG~~TGA" ] }, { @@ -262,46 +348,102 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "SeqFeature(CompoundLocation([SimpleLocation(ExactPosition(5), ExactPosition(15)), SimpleLocation(ExactPosition(20), ExactPosition(30))], 'join'), type='CDS', qualifiers=...)" - ] - }, - "execution_count": null, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "type: CDS\n", + "location: join{[3:9], [12:15]}\n", + "qualifiers:\n", + " Key: gene, Value: ['example_gene']\n", + "\n", + "LOCUS name 19 bp DNA linear UNK 01-JAN-1980\n", + "DEFINITION description.\n", + "ACCESSION id\n", + "VERSION id\n", + "KEYWORDS .\n", + "SOURCE .\n", + " ORGANISM .\n", + " .\n", + "FEATURES Location/Qualifiers\n", + " gene 4..15\n", + " domain 4..9\n", + " /Note=\"Region of interest\"\n", + " domain complement(16..19)\n", + " /gene=\"example_domain\"\n", + " CDS join(4..9,13..15)\n", + " /gene=\"example_gene\"\n", + "ORIGIN\n", + " 1 aaaatgcgta cgtgaacgt\n", + "//\n" + ] } ], "source": [ - "from pydna.dseqrecord import Dseqrecord\n", - "from Bio.SeqFeature import SeqFeature, FeatureLocation, CompoundLocation\n", + "from Bio.SeqFeature import CompoundLocation\n", "\n", "# Define the locations of the CDS\n", - "locations = [FeatureLocation(5, 15), FeatureLocation(20, 30)]\n", + "locations = [FeatureLocation(3, 9), FeatureLocation(12, 15)]\n", "\n", "# Create a compound location from these parts\n", "compound_location = CompoundLocation(locations)\n", "\n", "# Create a SeqFeature with this compound location, including type and qualifiers. \n", - "cds_feature = SeqFeature(location=compound_location, type=\"CDS\", qualifiers={\"gene\": \"example_gene\"})\n", + "cds_feature = SeqFeature(location=compound_location, type=\"CDS\", qualifiers={\"gene\": [\"example_gene\"]})\n", "\n", "# Add the feature to the Dseqrecord\n", - "my_record.features.append(cds_feature)\n", + "dummy_record.features.append(cds_feature)\n", + "\n", + "print(dummy_record.features[-1])\n", + "\n", + "print(dummy_record.format(\"genbank\"))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can even extract a protein record as follows (see how the protein sequence is `MR`, skipping the intron):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ID: id\n", + "Name: name\n", + "Description: description\n", + "Number of features: 0\n", + "/molecule_type=DNA\n", + "ProteinSeq('MR')\n" + ] + } + ], + "source": [ + "sub_record = dummy_record.features[-1].extract(dummy_record)\n", "\n", - "my_record.features[-1]" + "print(sub_record.translate())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - " Our added feature looks like this on the features list, as appropriate: \n", + "### Standard Feature Types and Qualifiers\n", "\n", - "type: CDS \n", - "location: join{[5:15], [20:30]} . \n", - "qualifiers: \n", - " Key: gene, Value: example_gene \n", + "`pydna` and `Bio.SeqFeature` suppports all the conventional feature types through the `type` parameters. A non-exhaustive list include gene, CDS, promoter, exon, intron, 5' UTR, 3' UTR, terminator, enhancer, and RBS. You can also define custom features, which could be useful for synthetic biology applications. For instance, you might want to have Bio_brick or spacer features to describe a synthetic standardised plasmid construct. \n", "\n", + "It is important to note that while `pydna` and `Bio.SeqFeature` does not restrict the feature types you can use, sticking to standard types helps maintain compatibility with other bioinformatics tools and databases. Please refer to the official [GenBank_Feature_Table](https://www.insdc.org/submitting-standards/feature-table/#2), that lists the standard feature types and their associated qualifiers." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "Further documentation for `SeqFeature`, `CompoundLocation`, and `FeatureLocation` can be found in the `SeqFeature` module [here](https://biopython.org/docs/1.75/api/Bio.SeqFeature.html). " ] }, @@ -311,15 +453,17 @@ "source": [ "### Handling Origin Spanning Features\n", "\n", - "An origin spanning feature is a special type of features that crosses over a circular sequence's origin. In pydna, such a feature is represented as a feature with parts, joining the part of the sequence before the origin and after the origin. They can be added using `CompoundLocation` as normal. \n", + "An origin spanning feature is a special type of feature that crosses over a circular sequence's origin. In pydna, such a feature is represented as a feature with parts, joining the part of the sequence immediately before the origin and immediately after the origin. They can be added using `CompoundLocation` as normal. \n", "\n", "An origin spanning feature, between base 19 to base 6, in a 25bp long circular sequence, is represented like so: \n", - " \n", + "\n", + "```\n", "type: gene \n", "location: join{[19:25](+), [0:6](+)} \n", "qualifiers: gene, Value: example_gene \n", - " \n", - "The code uses the `add_feature` method as normal." + "```\n", + "\n", + "This feature will be displayed as a single feature in SnapGene viewer and Benchling, since they support this convention." ] }, { @@ -328,22 +472,54 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "SeqFeature(CompoundLocation([SimpleLocation(ExactPosition(4023), ExactPosition(4037)), SimpleLocation(ExactPosition(0), ExactPosition(6))], 'join'), type='misc', qualifiers=...)" - ] - }, - "execution_count": null, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + ">> Feature:\n", + "type: misc\n", + "location: join{[19:25], [0:6]}\n", + "qualifiers:\n", + " Key: gene, Value: ['example origin spanning gene']\n", + "\n", + ">> Feature sequence:\n", + "ATGCGTACGTGA\n", + "\n", + ">> GenBank format:\n", + "LOCUS name 25 bp DNA circular UNK 01-JAN-1980\n", + "DEFINITION description.\n", + "ACCESSION id\n", + "VERSION id\n", + "KEYWORDS .\n", + "SOURCE .\n", + " ORGANISM .\n", + " .\n", + "FEATURES Location/Qualifiers\n", + " misc join(20..25,1..6)\n", + " /gene=\"example origin spanning gene\"\n", + "ORIGIN\n", + " 1 acgtgaaaaa aaaaaaaaaa tgcgt\n", + "//\n" + ] } ], "source": [ - "location = [FeatureLocation(4023, 4037), FeatureLocation(0, 6)]\n", + "circular_record = Dseqrecord('ACGTGAaaaaaaaaaaaaaATGCGT', circular=True)\n", + "\n", + "location = [FeatureLocation(19,25), FeatureLocation(0, 6)]\n", "ori_feat_location = CompoundLocation(location)\n", - "ori_feature = SeqFeature(location=ori_feat_location, type=\"misc\", qualifiers={\"gene\":\"example origin spanning gene\"})\n", - "my_record.features.append(ori_feature)\n", - "my_record.features[-1]" + "ori_feature = SeqFeature(location=ori_feat_location, type=\"misc\", qualifiers={\"gene\": [\"example origin spanning gene\"]})\n", + "circular_record.features.append(ori_feature)\n", + "\n", + "print('>> Feature:')\n", + "print(circular_record.features[-1])\n", + "\n", + "# Note how the feature sequence is extracted properly across the origin.\n", + "print('>> Feature sequence:')\n", + "print(circular_record.features[-1].extract(circular_record).seq)\n", + "print()\n", + "\n", + "print('>> GenBank format:')\n", + "print(circular_record.format(\"genbank\"))" ] }, { @@ -376,26 +552,21 @@ "| 6 | nd | <-- | <3299 | >4037 | 738 | gene | yes |\n", "| 7 | nd | <-- | <3299 | >4037 | 738 | mRNA | yes |\n", "| 8 | nd | <-- | 3299 | 4037 | 738 | CDS | yes |\n", - "| 9 | nd | --- | 2 | 5 | 3 | mRNA | no |\n", - "| 10 | nd | --- | 24 | 56 | 32 | domain | no |\n", - "| 11 | nd | <-- | 134 | 520 | 386 | CDS | no |\n", - "| 12 | nd | --- | 5 | 30 | 20 | CDS | no |\n", - "| 13 | nd | --- | 0 | 4037 | 20 | misc | no |\n", "+-----+------------------+-----+-------+-------+------+--------+------+\n" ] } ], "source": [ - "print(my_record.list_features())" + "print(sample_record.list_features())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This method is convenient for checking-out a brief overview of each feature, without reading through an entire sequence record. \n", + "This method is convenient for checking-out a brief overview of each feature, without reading through an entire sequence record.\n", "\n", - "Alternatively, we can look for specific features using their qualifiers. For instance, if I want to find my feature with the gene name of example_gene (This was the example feature for `CompoundLocation`), I can use the following code:" + "Alternatively, we can look for specific features using their qualifiers. For instance:" ] }, { @@ -407,13 +578,70 @@ "name": "stdout", "output_type": "stream", "text": [ - "[SeqFeature(CompoundLocation([SimpleLocation(ExactPosition(5), ExactPosition(15)), SimpleLocation(ExactPosition(20), ExactPosition(30))], 'join'), type='CDS', qualifiers=...)]\n" + "Getting all CDS features:\n", + "type: CDS\n", + "location: [<0:206](+)\n", + "qualifiers:\n", + " Key: codon_start, Value: ['3']\n", + " Key: product, Value: ['TCP1-beta']\n", + " Key: protein_id, Value: ['AAA98665.1']\n", + " Key: translation, Value: ['SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM']\n", + "\n", + "type: CDS\n", + "location: [686:3158](+)\n", + "qualifiers:\n", + " Key: codon_start, Value: ['1']\n", + " Key: gene, Value: ['AXL2']\n", + " Key: note, Value: ['plasma membrane glycoprotein']\n", + " Key: product, Value: ['Axl2p']\n", + " Key: protein_id, Value: ['AAA98666.1']\n", + " Key: translation, Value: ['MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESFTFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFNVILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNEVFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPETSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYVYLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYGDVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQDHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSANATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIACGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLNNPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQSQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDSYGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTKHRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRLVDFSNKSNVNVGQVKDIHGRIPEML']\n", + "\n", + "type: CDS\n", + "location: [3299:4037](-)\n", + "qualifiers:\n", + " Key: codon_start, Value: ['1']\n", + " Key: gene, Value: ['REV7']\n", + " Key: product, Value: ['Rev7p']\n", + " Key: protein_id, Value: ['AAA98667.1']\n", + " Key: translation, Value: ['MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQFVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISGDDKILNGVYSQYEEGESIFGSLF']\n", + "\n" ] } ], "source": [ - "gene = [f for f in my_record.features if \"gene\" in f.qualifiers and f.qualifiers[\"gene\"] == \"example_gene\"]\n", - "print(gene)" + "# Filter based on feature type\n", + "print('Getting all CDS features:')\n", + "cds_features = [f for f in sample_record.features if f.type == \"CDS\"]\n", + "for feature in cds_features:\n", + " print(feature)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "type: gene\n", + "location: [<3299:>4037](-)\n", + "qualifiers:\n", + " Key: gene, Value: ['REV7']\n", + "\n" + ] + } + ], + "source": [ + "# Find a particular feature by its qualifier (e.g. gene name)\n", + "rev7_cds_feature = next(f for f in sample_record.features if \n", + " f.type == \"gene\" and\n", + " \"gene\" in f.qualifiers and \"REV7\" in f.qualifiers[\"gene\"]\n", + " )\n", + "\n", + "print(rev7_cds_feature)\n" ] }, { @@ -475,29 +703,15 @@ "qualifiers:\n", " Key: gene, Value: ['REV7']\n", " Key: product, Value: ['Rev7p']\n", - "\n", - "type: mRNA\n", - "location: [2:5]\n", - "qualifiers:\n", - "\n", - "type: domain\n", - "location: [24:56]\n", - "qualifiers:\n", - " Key: Note, Value: Region of interest\n", - "\n", - "type: misc\n", - "location: join{[4023:4037], [0:6]}\n", - "qualifiers:\n", - " Key: gene, Value: example origin spanning gene\n", "\n" ] } ], "source": [ "#Remove all CDS type features from my feature list\n", - "my_record.features = [f for f in my_record.features if not (f.type == \"CDS\")]\n", + "sample_record.features = [f for f in sample_record.features if not (f.type == \"CDS\")]\n", "\n", - "for feature in my_record.features:\n", + "for feature in sample_record.features:\n", " print(feature)" ] }, @@ -540,42 +754,22 @@ "qualifiers:\n", " Key: gene, Value: ['AXL2']\n", " Key: product, Value: ['Axl2p']\n", - "\n", - "type: gene\n", - "location: [<3299:>4037](-)\n", - "qualifiers:\n", - " Key: gene, Value: ['REV7']\n", - "\n", - "type: mRNA\n", - "location: [<3299:>4037](-)\n", - "qualifiers:\n", - " Key: gene, Value: ['REV7']\n", - " Key: product, Value: ['Rev7p']\n", - "\n", - "type: mRNA\n", - "location: [2:5]\n", - "qualifiers:\n", - "\n", - "type: domain\n", - "location: [24:56]\n", - "qualifiers:\n", - " Key: Note, Value: Region of interest\n", - "\n", - "type: misc\n", - "location: join{[4023:4037], [0:6]}\n", - "qualifiers:\n", - " Key: gene, Value: example origin spanning gene\n", "\n" ] } ], "source": [ - "#Exclude example_gene from my feature list\n", - "my_record.features = [f for f in my_record.features if not (f.qualifiers == {\"gene\": \"example_gene\"})]\n", + "#Exclude REV7 from my feature list\n", + "sample_record.features = [f for f in sample_record.features if not ('gene' in f.qualifiers and 'REV7' in f.qualifiers['gene'])]\n", "\n", - "for feature in my_record.features:\n", + "for feature in sample_record.features:\n", " print(feature)" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] } ], "metadata": { @@ -594,7 +788,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.4" + "version": "3.12.5" } }, "nbformat": 4, diff --git a/docs/notebooks/Example_Gibson.ipynb b/docs/notebooks/Example_Gibson.ipynb index 03538540..8c514bf6 100644 --- a/docs/notebooks/Example_Gibson.ipynb +++ b/docs/notebooks/Example_Gibson.ipynb @@ -259,8 +259,8 @@ "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[26], line 6\u001b[0m\n\u001b[1;32m 4\u001b[0m pcr_product_F2 \u001b[38;5;241m=\u001b[39m pcr(F2_For, F2_Rev, gene_docs[\u001b[38;5;241m0\u001b[39m], limit\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m40\u001b[39m)\n\u001b[1;32m 5\u001b[0m pcr_product_F3 \u001b[38;5;241m=\u001b[39m pcr(F3_For, F3_Rev, gene_docs[\u001b[38;5;241m0\u001b[39m], limit\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m40\u001b[39m)\n\u001b[0;32m----> 6\u001b[0m pcr_product_BAC \u001b[38;5;241m=\u001b[39m \u001b[43mpcr\u001b[49m\u001b[43m(\u001b[49m\u001b[43mBACF3_Rev\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mBACF1_For\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mpCC1BAC_docs\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlimit\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m69\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 8\u001b[0m \u001b[38;5;66;03m# Printing out the PCR results\u001b[39;00m\n\u001b[1;32m 10\u001b[0m \u001b[38;5;28mprint\u001b[39m(pcr_product_F1\u001b[38;5;241m.\u001b[39mformat(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mgb\u001b[39m\u001b[38;5;124m\"\u001b[39m))\n", - "File \u001b[0;32m~/Desktop/pydna/src/pydna/amplify.py:523\u001b[0m, in \u001b[0;36mpcr\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 521\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m anneal_primers\u001b[38;5;241m.\u001b[39mproducts[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m 522\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(anneal_primers\u001b[38;5;241m.\u001b[39mproducts) \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m0\u001b[39m:\n\u001b[0;32m--> 523\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNo PCR product! \u001b[39m\u001b[38;5;132;01m{\u001b[39;00manneal_primers\u001b[38;5;241m.\u001b[39mreport()\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 524\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPCR not specific! \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mformat\u001b[39m(anneal_primers\u001b[38;5;241m.\u001b[39mreport())\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n", + "Cell \u001b[0;32mIn[5], line 6\u001b[0m\n\u001b[1;32m 4\u001b[0m pcr_product_F2 \u001b[38;5;241m=\u001b[39m pcr(F2_For, F2_Rev, gene_docs[\u001b[38;5;241m0\u001b[39m], limit\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m40\u001b[39m)\n\u001b[1;32m 5\u001b[0m pcr_product_F3 \u001b[38;5;241m=\u001b[39m pcr(F3_For, F3_Rev, gene_docs[\u001b[38;5;241m0\u001b[39m], limit\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m40\u001b[39m)\n\u001b[0;32m----> 6\u001b[0m pcr_product_BAC \u001b[38;5;241m=\u001b[39m \u001b[43mpcr\u001b[49m\u001b[43m(\u001b[49m\u001b[43mBACF1_For\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mBACF3_Rev\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mpCC1BAC_docs\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlimit\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m69\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 8\u001b[0m \u001b[38;5;66;03m# Printing out the PCR results\u001b[39;00m\n\u001b[1;32m 10\u001b[0m \u001b[38;5;28mprint\u001b[39m(pcr_product_F1\u001b[38;5;241m.\u001b[39mformat(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mgb\u001b[39m\u001b[38;5;124m\"\u001b[39m))\n", + "File \u001b[0;32m~/Documents/OpenSource/summer_internship_2024/pydna/src/pydna/amplify.py:523\u001b[0m, in \u001b[0;36mpcr\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 521\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m anneal_primers\u001b[38;5;241m.\u001b[39mproducts[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m 522\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(anneal_primers\u001b[38;5;241m.\u001b[39mproducts) \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m0\u001b[39m:\n\u001b[0;32m--> 523\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNo PCR product! \u001b[39m\u001b[38;5;132;01m{\u001b[39;00manneal_primers\u001b[38;5;241m.\u001b[39mreport()\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 524\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPCR not specific! \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mformat\u001b[39m(anneal_primers\u001b[38;5;241m.\u001b[39mreport())\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n", "\u001b[0;31mValueError\u001b[0m: No PCR product! Template EU140750 8128 bp circular limit=69:\nNo forward primers anneal...\nNo reverse primers anneal..." ] } @@ -325,7 +325,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.4" + "version": "3.12.5" } }, "nbformat": 4, diff --git a/docs/notebooks/Example_Restriction.ipynb b/docs/notebooks/Example_Restriction.ipynb index 6d1f3996..800f64fb 100644 --- a/docs/notebooks/Example_Restriction.ipynb +++ b/docs/notebooks/Example_Restriction.ipynb @@ -7,7 +7,11 @@ "# Example of a Plasmid Restriction/Ligation Cloning\n", "> Visit the full library documentation [here](https://bjornfjohansson.github.io/pydna/)\n", "\n", - "This example showcases a workflow of modelling molecular cloning with restriction enzymes, PCR, and ligases, to clone gene fragments into plasmids. This example constructs a synthetic plasmid by cloning the ase1 gene, which encodes a microtubule associated protein responsible for mitotic spindle assembly, into the pFA6a-kanMX6 cloning vector. The ase1 gene fragment is first cloned from a portion of the Saccharomyces genome through PCR. The pFA6a-kanMX6 cloning vector is then cleaved with AscI and SalI. The ase1 gene fragment is also cleaved with SalI and AscI, and are lastly ligated with the linearized pFA6a-kanMX6 vector. \n", + "This example showcases a workflow of modelling molecular cloning with restriction enzymes, PCR, and ligases, to clone gene fragments into plasmids. This example constructs a synthetic plasmid by cloning the ase1 gene, which encodes a microtubule associated protein responsible for mitotic spindle assembly, into the pFA6a-kanMX6 cloning vector:\n", + "\n", + "1. The ase1 gene fragment is first cloned from a portion of the _S. pombe_ genome through PCR:\n", + "2. The pFA6a-kanMX6 cloning vector is then cleaved with AscI and SalI. The ase1 gene fragment is also cleaved with SalI and AscI\n", + "3. The fragment is ligated with the linearized pFA6a-kanMX6 vector.\n", "\n", "Source files can be found alongside this notebook, if you would like to follow along. Annotations are made alongside the code to describe key steps." ] @@ -537,16 +541,15 @@ ], "source": [ "# Parsing the files\n", - "\n", "pFA6akanMX6_path = \"./pFA6a-kanMX6.gb\"\n", "ase1_path = \"./CU329670.gb\"\n", - "pFA6_docs = parse(pFA6akanMX6_path)\n", - "ase1_docs = parse(ase1_path)\n", + "vector = parse(pFA6akanMX6_path)[0]\n", + "pombe_chromosome_I = parse(ase1_path)[0]\n", "\n", "# Printing the parsed files\n", "\n", - "print(pFA6_docs[0].format(\"gb\"))\n", - "print(ase1_docs[0].format(\"gb\"))" + "print(vector.format(\"gb\"))\n", + "print(pombe_chromosome_I.format(\"gb\"))" ] }, { @@ -566,8 +569,14 @@ "source": [ "# Generating primers for the ase1 insert fragment. \n", "\n", - "fwd_primer_ase1 = Dseqrecord(\"ACCATGTCGAC\") + ase1_docs[0][1000:1020] # Adding a SalI cut site\n", - "rvs_primer_ase1_3_start = ase1_docs[0][3516:3546] + Dseqrecord(\"GGCGCGCCAT\") # Adding a AscI cut site\n", + "#todo-manu: \n", + "# 1. find the feature containing the CDS filtering the list of features\n", + "# 2. Use the coordinates of the CDS to design primers using pydna design\n", + "# 3. Append the cut site to the primers\n", + "# 4. Do the PCR\n", + "\n", + "fwd_primer_ase1 = Dseqrecord(\"ACCATGTCGAC\") + pombe_chromosome_I[1000:1020] # Adding a SalI cut site\n", + "rvs_primer_ase1_3_start = pombe_chromosome_I[3516:3546] + Dseqrecord(\"GGCGCGCCAT\") # Adding a AscI cut site\n", "rvs_primer_ase1 = rvs_primer_ase1_3_start.reverse_complement()\n", "\n", "# Printing out the primers\n", @@ -593,6 +602,10 @@ "source": [ "# Checking that the primer Tm are matching\n", "\n", + "#todo-manu: \n", + "# 1. Check the Tm of the primers using the part that aligns with the genome only (here you are calculating the Tm including the restriction sites, which\n", + "# will not anneal to the template during the PCR)\n", + "\n", "print(tm_default(fwd_primer_ase1.seq)) # Modify the primer sequence above retroactively, if Tm not matching.\n", "print(tm_default(rvs_primer_ase1.seq))" ] @@ -709,7 +722,7 @@ "source": [ "# Performing a PCR to check that the primers are specific. An error message is returned if otherwise.\n", "\n", - "pcr_product = pcr(fwd_primer_ase1, rvs_primer_ase1, ase1_docs[0])\n", + "pcr_product = pcr(fwd_primer_ase1, rvs_primer_ase1, pombe_chromosome_I)\n", "\n", "# Printing out the PCR results\n", "\n", @@ -733,7 +746,7 @@ "source": [ "# Cleaving the cloning vector with restriction enzymes\n", "\n", - "plamsid_digests = pFA6_docs[0].cut(SalI, AscI)\n", + "plamsid_digests = vector.cut(SalI, AscI)\n", "\n", "# Cleaving the gene fragment with restriction enzymes\n", "\n", @@ -1009,7 +1022,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.4" + "version": "3.12.5" } }, "nbformat": 4, diff --git a/docs/notebooks/Gibson.ipynb b/docs/notebooks/Gibson.ipynb index cc502b10..d0bc9a8a 100644 --- a/docs/notebooks/Gibson.ipynb +++ b/docs/notebooks/Gibson.ipynb @@ -9,11 +9,12 @@ "\n", "Gibson Assembly is a powerful method to assemble multiple DNA fragments into a single, continuous sequence in a seamless, one-step reaction. Developed by Daniel Gibson and colleagues in 2009, this method has been widely applied to work in molecular cloning, biotechnology, and synthetic biology. \n", "\n", - "`pydna` provides the `Assembly` class to simulate the assembly of DNA sequences. This page provides a guide to performing Gibson Assembly with pre-existing DNA fragments, followed by primer design for generating these fragments via the `pcr` method, if needed.\n", + "`pydna` provides the `Assembly` class to simulate the assembly of DNA sequences. Below is an example fpr performing Gibson Assembly with pre-existing DNA fragments, followed by primer design for generating these fragments via the `pcr` method, if needed.\n", "\n", - "The `Assembly` class simulates Gibson assembly by searching for homologous sequence pairings. The `Assembly` class needs a list of DNA fragments, given in the datatype of `Dseqrecord` objects, and a minimum length of DNA (`limit` parameter) for which a sequence homology is considered.\n", - "\n", - "The example below shows how to create an `Assembly` object using multiple DNA fragments. `Assembly` takes `Dseqrecord` objects as input that you can directly instantiate (first example), or parse from files (second example). Note that the sequences are always inputted from a 5'-3' direction." + "The `Assembly` takes the following arguments:\n", + " * `frags`: list of DNA fragments as `Dseqrecord` objects\n", + " * `limit`: the minimum sequence homology required.\n", + " * `algorithm`: the function used to find homology regions between DNA fragments. For Gibson Assembly, we use the `terminal_overlap` function, which finds homology regions only at the terminal regions. By default, the `Assembly` class uses the `common_sub_strings` function to find homology regions, which finds homology anywhere, as it could happen in a homologous recombination event.\n" ] }, { @@ -29,24 +30,25 @@ "fragments..: 33bp 34bp 35bp\n", "limit(bp)..: 14\n", "G.nodes....: 6\n", - "algorithm..: common_sub_strings\n" + "algorithm..: terminal_overlap\n" ] } ], "source": [ "from pydna.dseqrecord import Dseqrecord\n", "from pydna.assembly import Assembly\n", + "from pydna.common_sub_strings import terminal_overlap\n", "\n", "#Creating example Dseqrecord sequences\n", "fragment1 = Dseqrecord(\"acgatgctatactgCCCCCtgtgctgtgctcta\")\n", "fragment2 = Dseqrecord(\"tgtgctgtgctctaTTTTTtattctggctgtatc\")\n", "fragment3 = Dseqrecord(\"tattctggctgtatcGGGGGtacgatgctatactg\")\n", "\n", - "#Cerating a list of sequences to assemble\n", + "#Creating a list of sequences to assemble\n", "fragments = [fragment1, fragment2, fragment3]\n", "\n", "#Performing Gibson assembly, with a minimum shared homology of 14bp\n", - "assembly = Assembly(fragments, limit=14)\n", + "assembly = Assembly(fragments, limit=14, algorithm=terminal_overlap)\n", "\n", "#Displaying the assembled product\n", "print(assembly)" @@ -56,9 +58,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The printed output shows the length of each assembled DNA sequenece, the minimum length required for sequence homology search, the number of nodes (number of overlapping regions), and the algorithm used for sequence homology search. Please refer to the full `Assembly` module documentation for more information on the algorithm applied. \n", + "The printed output shows the length of each fragment provided to the assembly, the minimum length required for sequence homology search, the number of nodes (number of overlapping regions), and the algorithm used for sequence homology search. Please refer to the full `Assembly` module documentation for more information on the algorithm applied.\n", "\n", - "To make a circular sequence from an `Assembly`, pydna provides the `assemble_circular` method. The assembled sequence can be printed as normal, as `Dseqrecord` objects. Note that the `assemble_circular` method returns a list, where the first element (index 0) represents the Watson strand as the top strand, and the second element (index 1) represents the Crick strand as the top strand. " + "To make a circular sequence from an `Assembly`, pydna provides the `assemble_circular` method. The assembled sequence can be printed as normal, as `Dseqrecord` objects. Note that the `assemble_circular` method returns a list, where the two elements are reverse complement of each other." ] }, { @@ -81,6 +83,7 @@ "Dseq(o59)\n", "acga..GGGt\n", "tgct..CCCa\n", + "\n", "Dseqrecord\n", "circular: True\n", "size: 59\n", @@ -103,6 +106,7 @@ "\n", "#Printing the sequence records\n", "print(assembly_circ[0])\n", + "print()\n", "print(assembly_circ[1])\n" ] }, @@ -130,7 +134,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.4" + "version": "3.12.5" } }, "nbformat": 4, diff --git a/docs/notebooks/Importing_Seqs.ipynb b/docs/notebooks/Importing_Seqs.ipynb index 44749182..153f3738 100644 --- a/docs/notebooks/Importing_Seqs.ipynb +++ b/docs/notebooks/Importing_Seqs.ipynb @@ -7,11 +7,16 @@ "# Importing and viewing sequence files in pydna\n", "> Visit the full library documentation [here](https://bjornfjohansson.github.io/pydna/)\n", "\n", - "pydna can be used to work with FASTA, Genbank, EMBL, and snapgene files (.fasta, .gb, .embl, .dna). Specifically, pydna provides ways to read these file types, and store them as a record (I.e `Dseqrecord` object) that one can view and work with. Alternatively, pydna can also work with sequences directly passed into python code as a string object, again storing them as a `Dseqrecord` object.\n", + "pydna can be used to work with FASTA, Genbank, EMBL, and snapgene files (.fasta, .gb, .embl, .dna). You can read these files into a `Dseqrecord` that one can view and work with. You can also instantiate `Dseqrecord` objects with strings.\n", "\n", "## Importing Sequence Files\n", "\n", - "To import files into pydna is simple. pydna provides the `parse` method to read all DNA sequences in a file into a list. As an input, `parse` can take the path to a file from your computer, or a python string with the file content. The following code shows an example of how to use the `parse` function to import a downloaded FASTA file, but other types of files can also be imported in the same way." + "To import files into pydna is simple. pydna provides the `parse` method to read all DNA sequences in a file into a list. As an input, `parse` can take:\n", + "\n", + "* The path to a file from your computer\n", + "* A python string with the file content.\n", + "\n", + "The following code shows an example of how to use the `parse` function to import a FASTA file." ] }, { @@ -34,8 +39,8 @@ "source": [ "from pydna.parsers import parse\n", "\n", - "#Import your file into python. \n", - "file_path = \"./sequence.fasta\"\n", + "#Import your file into python using its path\n", + "file_path = \"./U49845.fasta\"\n", "files = parse(file_path)\n", "\n", "#Show your FASTA file in python\n", @@ -46,16 +51,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that I used a relative path. Place your sequence file in the same directory as your code to copy the exact code above. \n", - " \n", - "The last line of code uses the `format` method to view the imported file in your Python Interpreter (e.g interactive window on Visual Studio Code). Note that `parse` returns a `list` object, hence requiring `[0]` to take the first element of the list. When you have a FASTA file that contains multiple sequences, you can index the list accordingly (e.g `[0]`, `[1]`, ...)" + "Note that `parse` returns a `list` object, hence requiring `[0]` to take the first element of the list. When you have a FASTA file that contains multiple sequences, you can index the list accordingly (e.g `[0]`, `[1]`, ...)\n", + "\n", + "The last line of code uses the `format` method to generate a string representation of the sequence as a FASTA file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Another example, using a complete GenBank file, is shown below. The GenBank file is downloaded [here](https://www.ncbi.nlm.nih.gov/nucleotide/U49845). I've done this on my Mac, and dragged/dropped the sequence.gb file in a series of subfolders on my Desktop. Replace my file path with yours to access the file." + "Another example, using a GenBank file ([U49845](https://www.ncbi.nlm.nih.gov/nucleotide/U49845)), is shown below." ] }, { @@ -239,10 +244,10 @@ "source": [ "from pydna.parsers import parse\n", "\n", - "file_path = \"./sequence.gb\"\n", + "file_path = \"./U49845.gb\"\n", "files = parse(file_path)\n", "\n", - "#Show your GenBank file in pyton\n", + "# Convert the Dseqrecord object into a formatted string in GenBank format\n", "files[0].format(\"gb\")\n" ] }, @@ -339,7 +344,7 @@ "from Bio.SeqIO import parse as seqio_parse\n", "from pydna.dseqrecord import Dseqrecord\n", "\n", - "file_path = './sequence.gb'\n", + "file_path = './U49845.gb'\n", "\n", "# Extract the first Seqrecord of the SeqIO.parse iterator\n", "seq_record = next(seqio_parse(file_path, 'genbank'))\n", @@ -370,7 +375,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.4" + "version": "3.12.5" } }, "nbformat": 4, diff --git a/docs/notebooks/Restrict_Ligate_Cloning.ipynb b/docs/notebooks/Restrict_Ligate_Cloning.ipynb index cf9ad460..be8d8f90 100644 --- a/docs/notebooks/Restrict_Ligate_Cloning.ipynb +++ b/docs/notebooks/Restrict_Ligate_Cloning.ipynb @@ -60,7 +60,7 @@ "source": [ "The circular `Dseqrecord` is cut into a linear `Dseqrecord` object, since there is only one EcoRI recognition site. `Dseqrecord` also shows the 5' sticky end after cutting.\n", "\n", - "The sequence can also be cut with multiple restriction enzymes, into multiple linear DNA sequences. We can simply import all the restriction enzymes, and use the cut method as normal. Note that the code follows from the previous sample code; the code below assume all the other required classes and modules have been imported." + "The sequence can also be cut with multiple restriction enzymes, into multiple linear DNA sequences. We can simply import all the restriction enzymes, and use the cut method as normal." ] }, { @@ -72,6 +72,7 @@ "name": "stdout", "output_type": "stream", "text": [ + "\n", "Dseqrecord\n", "circular: False\n", "size: 51\n", @@ -83,6 +84,8 @@ "Dseq(-51)\n", "ATCT..TGTG \n", "TAGA..ACACTTAA\n", + "\n", + "\n", "Dseqrecord\n", "circular: False\n", "size: 214\n", @@ -94,6 +97,8 @@ "Dseq(-214)\n", "AATTCTTC..TGAT\n", " GAAG..ACTA\n", + "\n", + "\n", "Dseqrecord\n", "circular: False\n", "size: 73\n", @@ -104,7 +109,8 @@ "/molecule_type=DNA\n", "Dseq(-73)\n", "ATCT..AGAT\n", - "TAGA..TCTA\n" + "TAGA..TCTA\n", + "\n" ] } ], @@ -116,7 +122,9 @@ "\n", "# Display the resulting fragments\n", "for frag in multi_cut_records:\n", - " print(frag)" + " print()\n", + " print(frag)\n", + " print()" ] }, { @@ -156,8 +164,8 @@ } ], "source": [ - "ligated_records = multi_cut_records[0] + multi_cut_records[1]\n", - "print(ligated_records)" + "ligated_product = multi_cut_records[0] + multi_cut_records[1]\n", + "print(ligated_product)" ] }, { @@ -176,7 +184,9 @@ "source": [ "## Circularizing fragments\n", "\n", - "To circularize a cut DNA sequence into a plasmid is easy. `pydna` provides the `looped` method to circularise a `Dseqrecord`, returning a new sequence object. In other words, the `.looped` method does not act in place, and a new variable should be created to store the new circularised sequence, as shown in the following example." + "To circularize a cut DNA sequence use the `looped` method, which returns a new sequence object.\n", + "\n", + "🚨🚨 **VERY IMPORTANT** 🚨🚨 `.looped()` method does not act in place, so a new variable should be created to store the new circularised sequence, as shown in the following example." ] }, { @@ -188,6 +198,9 @@ "name": "stdout", "output_type": "stream", "text": [ + "is ligated_product circular? False\n", + "is circular_record circular? True\n", + "\n", "Dseqrecord\n", "circular: True\n", "size: 261\n", @@ -203,7 +216,11 @@ } ], "source": [ - "circular_record = ligated_records.looped()\n", + "circular_record = ligated_product.looped()\n", + "\n", + "print('is ligated_product circular?', ligated_product.circular)\n", + "print('is circular_record circular?', circular_record.circular)\n", + "print()\n", "\n", "print(circular_record)" ] @@ -214,7 +231,7 @@ "source": [ "## Extra Notes: What happens to features when cutting/ligating?\n", "\n", - "A feature is removed from a `Dseqrecord` if the features is truncated by the cut. For instance, the example_gene feature is removed from the record after cutting `record` with PstI, who has recognition site within example_gene. within the cutand if the feature is completely within the cut, it is retained. " + "A feature is removed from a `Dseqrecord` if the features is truncated by the cut. For instance, the example_gene feature is removed from the record after cutting `record` with PstI, which has recognition site within example_gene. within the cutand if the feature is completely within the cut, it is retained. " ] }, { @@ -272,7 +289,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.4" + "version": "3.12.5" } }, "nbformat": 4, diff --git a/docs/notebooks/sequence.fasta b/docs/notebooks/U49845.fasta similarity index 100% rename from docs/notebooks/sequence.fasta rename to docs/notebooks/U49845.fasta diff --git a/docs/notebooks/sequence.gb b/docs/notebooks/U49845.gb similarity index 100% rename from docs/notebooks/sequence.gb rename to docs/notebooks/U49845.gb diff --git a/dummy.md b/dummy.md new file mode 100644 index 00000000..e69de29b