diff --git a/docs/src/man/structure.md b/docs/src/man/structure.md index 18466bb66..657a32895 100644 --- a/docs/src/man/structure.md +++ b/docs/src/man/structure.md @@ -15,18 +15,19 @@ end The `Bio.Structure` module provides functionality to manipulate macromolecular structures, and in particular to read and write [Protein Data Bank](http://www.rcsb.org/pdb/home/home.do) (PDB) files. It is designed to be used for standard structural analysis tasks, as well as acting as a platform on which others can build to create more specific tools. It compares favourably in terms of performance to other PDB parsers - see some [benchmarks](https://github.com/jgreener64/pdb-benchmarks). -## Parsing PDB files +## Basics To download a PDB file: ```julia +# Stored in the current working directory by default downloadpdb("1EN2") ``` To parse a PDB file into a Structure-Model-Chain-Residue-Atom framework: ```julia -julia> struc = read(filepath_1EN2, PDB) +julia> struc = read("/path/to/pdb/file.pdb", PDB) Bio.Structure.ProteinStructure Name - 1EN2.pdb Number of models - 1 @@ -40,6 +41,8 @@ Number of hydrogens - 0 Number of disordered atoms - 27 ``` +**Note** : Refer to [Downloading PDB files](#downloading-pdb-files) and [Reading PDB files](#reading-pdb-files) sections for more options. + The elements of `struc` can be accessed as follows: | Command | Returns | Return type | @@ -194,21 +197,6 @@ RCGSQGGGSTCPGLRCCSIWGWCGDSEPYCGRTCENKCWSGERSDHRCGAAVGNPPCGQDRCCSVHGWCGGGNDYCSGGN ``` -## Writing PDB files - -PDB format files can be written: - -```julia -writepdb("1EN2_out.pdb", struc) -``` - -Any element type can be given as input to `writepdb`. Atom selectors can also be given as additional arguments: - -```julia -writepdb("1EN2_out.pdb", struc, backboneselector) -``` - - ## Spatial calculations Various functions are provided to calculate spatial quantities for proteins: @@ -244,6 +232,178 @@ julia> rad2deg(psiangle(struc['A'][50], struc['A'][51])) ``` +## Downloading PDB files + +To download a PDB file to a specify directory: + +```julia +downloadpdb("1EN2", pdb_dir="path/to/pdb/directory/") +``` + +To download multiple PDB files to a specify directory: + +```julia +downloadpdb(["1EN2","1ALW","1AKE"], pdb_dir="path/to/pdb/directory/") +``` + +To download a PDB file in PDB, XML, MMCIF or MMTF format: + +```julia +# PDB file format +downloadpdb("1ALW", pdb_dir="path/to/pdb/directory/", file_format=PDB) +# XML file format +downloadpdb("1ALW", pdb_dir="path/to/pdb/directory/", file_format=PDBXML) +# MMCIF file format +downloadpdb("1ALW", pdb_dir="path/to/pdb/directory/", file_format=MMCIF) +# MMTF file format +downloadpdb("1ALW", pdb_dir="path/to/pdb/directory/", file_format=MMTF) +``` + +Various options can be set through optional keyword arguments when downloading PDB files as follows: + +| Keyword Argument | Description | +| :----------------------------- | :-------------------------------------------------------------------------------------------------------------------- | +| `pdb_dir::AbstractString=pwd()`| The directory to which the PDB file is downloaded | +| `file_format::Type=PDB` | The format of the PDB file. Options are PDB, PDBXML, MMCIF or MMTF | +| `obsolete::Bool=false` | If set `true`, the PDB file is downloaded into the auto-generated "obsolete" directory inside the specified `pdb_dir` | +| `overwrite::Bool=false` | If set `true`, overwrites the PDB file if exists in `pdb_dir`; by default skips downloading the PDB file | +| `ba_number::Integer=0` | If set > 0, downloads the respective biological assembly; by default downloads the PDB file | + + +## Reading PDB files + +- To parse a existing PDB file into a Structure-Model-Chain-Residue-Atom framework: + +```julia +julia> struc = read("/path/to/pdb/file.pdb", PDB) +Bio.Structure.ProteinStructure +Name - 1EN2.pdb +Number of models - 1 +Chain(s) - A +Number of residues - 85 +Number of point mutations - 5 +Number of other molecules - 5 +Number of water molecules - 76 +Number of atoms - 614 +Number of hydrogens - 0 +Number of disordered atoms - 27 +``` + +Various options can be set through optional keyword arguments when parsing a PDB file as follows: + +| Keyword Argument | Description | +| :------------------------------------------- | :------------------------------------------------------------------------------ | +| `structure_name::AbstractString="$pdbid.pdb"`| The name of the PDB Structure read. Defaults to "< PDBID >.pdb" | +| `remove_disorder::Bool=false` | If set true, then disordered atoms wont be parsed | +| `read_std_atoms::Bool=true` | If set false, then standard ATOM records wont be parsed | +| `read_het_atoms::Bool=true` | If set false, then HETATOM records wont be parsed | + +- To parse a PDB file by specifying the PDB ID and PDB directory into a Structure-Model-Chain-Residue-Atom framework (file name must be in upper case, e.g. "1EN2.pdb") + +The function `readpdb` provides an uniform way to download and read PDB files. For example: + +```julia +struc = readpdb("1EN2", pdb_dir="/path/to/pdb/directory") +``` + +The same keyword arguments are taken as `read` above, plus `pdb_dir` and `ba_number`. + +- To download and parse a PDB file into a Structure-Model-Chain-Residue-Atom framework in a single line: + +```julia +julia> struc = retrievepdb("1ALW", pdb_dir="path/to/pdb/directory") +INFO: Downloading PDB : 1ALW +INFO: Parsing the PDB file... +Bio.Structure.ProteinStructure +Name - 1ALW.pdb +Number of models - 1 +Chain(s) - AB +Number of residues - 346 +Number of point mutations - 0 +Number of other molecules - 10 +Number of water molecules - 104 +Number of atoms - 2790 +Number of hydrogens - 0 +Number of disordered atoms - 0 +``` + +Various options can be set through optional keyword arguments when downloading and parsing a PDB file as follows: + +| Keyword Argument | Description | +| :--------------------------------------------| :--------------------------------------------------------------------------------------------------------------- | +| `pdb_dir::AbstractString=pwd()` | The directory from which the PDB file is read | +| `obsolete::Bool=false` | If set `true`, PDB file is downloaded into the auto-generated "obsolete" directory inside the specified `pdb_dir`| +| `overwrite::Bool=false` | if set `true`, overwrites the PDB file if exists in `pdb_dir`; by default skips downloading PDB file if exists | +| `ba_number::Integer=0` | If set > 0 reads the respective biological assembly; by default reads PDB file | +| `structure_name::AbstractString="$pdbid.pdb"`| The name of the PDB Structure read. Defaults to "< PDBID >.pdb" | +| `remove_disorder::Bool=false` | If set true, then disordered atoms wont be parsed | +| `read_std_atoms::Bool=true` | If set false, then standard ATOM records wont be parsed | +| `read_het_atoms::Bool=true` | If set false, then HETATOM records wont be parsed | + + +## Writing PDB files + +PDB format files can be written: + +```julia +writepdb("1EN2_out.pdb", struc) +``` + +Any element type can be given as input to `writepdb`. Atom selectors can also be given as additional arguments: + +```julia +writepdb("1EN2_out.pdb", struc, backboneselector) +``` + + +## RCSB PDB Utility Functions + +- To download the entire RCSB PDB database in your preferred file format: + +```julia +downloadentirepdb(pdb_dir="path/to/pdb/directory/", file_format=MMTF, overwrite=false) +``` + +The keyword arguments are described below: + +| Keyword Argument | Description | +| :------------------------------- | :------------------------------------------------------------------------------------------------------- | +| `pdb_dir::AbstractString=pwd()` | The directory to which the PDB files are downloaded | +| `file_format::Type=PDB` | The format of the PDB file. Options are PDB, PDBXML, MMCIF or MMTF | +| `overwrite::Bool=false` | If set `true`, overwrites the PDB file if exists in `pdb_dir`; by default skips downloading the PDB file | + +- To update your local PDB directory based on the weekly status list of new, modified and obsolete PDB files from RCSB Server: + +```julia +updatelocalpdb(pdb_dir="path/to/pdb/directory/", file_format=MMTF) +``` + +The `file_format` specifies the format of the PDB files present in the local PDB directory. Obsolete PDB files are stored in the autogenerated `obsolete` directory inside the specified local PDB directory. + +- To download all obsolete PDB files from RCSB Server: + +```julia +downloadallobsoletepdb(;obsolete_dir="/path/to/obsolete/directory/", file_format=MMCIF, overwrite=false) +``` + +The `file_format` specfies the format in which the PDB files are downloaded; Options are PDB, PDBXML, MMCIF or MMTF. + +If `overwrite=true`, the existing PDB files in obsolete directory will be overwritten by the newly downloaded ones. + +- To maintain a local copy of the entire RCSB PDB Database + +Run the `downloadentirepdb` function once to download all PDB files and setup a CRON job or similar to run `updatelocalpdb` function once in every week to keep the local PDB directory up to date with the RCSB Server. + +There are a few more functions that may help. + +| Function | Returns | Return type | +| :----------------- | :------------------------------------------------------------------------------ | :------------------------------------------------------- | +| `pdbentrylist` | List of all PDB entries from RCSB Server | `Array{String,1}` | +| `pdbstatuslist` | List of PDB entries from specified RCSB weekly status list URL | `Array{String,1}` | +| `pdbrecentchanges` | Added, modified and obsolete PDB lists from the recent RCSB weekly status files | `Tuple{Array{String,1},Array{String,1},Array{String,1}}` | +| `pdbobsoletelist` | List of all obsolete PDB entries in the RCSB server | `Array{String,1}` | + + ## Examples A few further examples of `Bio.Structure` usage are given below. diff --git a/src/structure/pdb.jl b/src/structure/pdb.jl index 03e7bd583..93d0b22fe 100644 --- a/src/structure/pdb.jl +++ b/src/structure/pdb.jl @@ -1,15 +1,34 @@ export PDB, + PDBXML, + MMCIF, + MMTF, PDBParseError, + pdbextension, + pdbentrylist, + pdbstatuslist, + pdbrecentchanges, + pdbobsoletelist, downloadpdb, + downloadentirepdb, + updatelocalpdb, + downloadallobsoletepdb, + retrievepdb, + readpdb, spaceatomname, pdbline, writepdb +using Libz -"Protein Data Bank (PDB) file format." +"Protein Data Bank (PDB) file formats." immutable PDB <: Bio.IO.FileFormat end +immutable PDBXML <: Bio.IO.FileFormat end +immutable MMCIF <: Bio.IO.FileFormat end +immutable MMTF <: Bio.IO.FileFormat end +# A Dict mapping the type to their file extensions +const pdbextension = Dict{Type,String}( PDB => ".pdb", PDBXML => ".xml", MMCIF => ".cif", MMTF => ".mmtf") "Error arising from parsing a Protein Data Bank (PDB) file." type PDBParseError <: Exception @@ -30,25 +49,383 @@ end """ -Download a Protein Data Bank (PDB) file or biological assembly from the RCSB -PDB. By default downloads the PDB file; if `ba_number` is set the biological -assembly with that number will be downloaded. + pdbentrylist() + +Fetch list of all PDB entries from RCSB server. +""" +function pdbentrylist() + pdbidlist = String[] + info("Fetching list of all PDB Entries from RCSB Server...") + tempfilepath = tempname() + try + download("ftp://ftp.wwpdb.org/pub/pdb/derived_data/index/entries.idx",tempfilepath) + open(tempfilepath) do input + # Skips the first two lines as it contains headers + linecount = 1 + for line in eachline(input) + if linecount > 2 + # The first 4 characters in the line is the PDB ID + pdbid = uppercase(line[1:4]) + # Check PDB ID is 4 characters long and only consits of alphanumeric characters + if !ismatch(r"^[a-zA-Z0-9]{4}$", pdbid) + throw(ArgumentError("Not a valid PDB ID: \"$pdbid\"")) + end + push!(pdbidlist,pdbid) + end + linecount +=1 + end + end + finally + rm(tempfilepath, force=true) + end + return pdbidlist +end + + +""" + pdbstatuslist(url::AbstractString) + +Fetch list of PDB entries from RCSB weekly status file by specifying its URL. +""" +function pdbstatuslist(url::AbstractString) + statuslist = String[] + filename = split(url,"/")[end] + info("Fetching weekly status file $filename from RCSB Server...") + tempfilepath = tempname() + try + download(url, tempfilepath) + open(tempfilepath) do input + for line in eachline(input) + # The first 4 characters in the line is the PDB ID + pdbid = uppercase(line[1:4]) + # Check PDB ID is 4 characters long and only consits of alphanumeric characters + if !ismatch(r"^[a-zA-Z0-9]{4}$", pdbid) + throw(ArgumentError("Not a valid PDB ID: \"$pdbid\"")) + end + push!(statuslist,pdbid) + end + end + finally + rm(tempfilepath, force=true) + end + return statuslist +end + + +""" + pdbrecentchanges() + +Fetch three lists consisting added, modified and obsolete PDB entries from the recent +RCSB weekly status files. +""" +function pdbrecentchanges() + addedlist = pdbstatuslist("ftp://ftp.wwpdb.org/pub/pdb/data/status/latest/added.pdb") + modifiedlist = pdbstatuslist("ftp://ftp.wwpdb.org/pub/pdb/data/status/latest/modified.pdb") + obsoletelist = pdbstatuslist("ftp://ftp.wwpdb.org/pub/pdb/data/status/latest/obsolete.pdb") + return addedlist, modifiedlist, obsoletelist +end + + +""" + pdbobsoletelist() + +Fetch list of all obsolete PDB entries in the RCSB server. +""" +function pdbobsoletelist() + obsoletelist = String[] + info("Fetching list of all obsolete PDB Entries from RCSB Server...") + tempfilepath = tempname() + try + download("ftp://ftp.wwpdb.org/pub/pdb/data/status/obsolete.dat", tempfilepath) + open(tempfilepath) do input + for line in eachline(input) + # Check if its an obsolete pdb entry and not headers + if line[1:6] == "OBSLTE" + # The 21st to 24th characters in obsolete pdb entry has the pdb id + pdbid = uppercase(line[21:24]) + # Check PDB ID is 4 characters long and only consits of alphanumeric characters + if !ismatch(r"^[a-zA-Z0-9]{4}$", pdbid) + throw(ArgumentError("Not a valid PDB ID: \"$pdbid\"")) + end + push!(obsoletelist,pdbid) + end + end + end + finally + rm(tempfilepath, force=true) + end + return obsoletelist +end + + +""" + downloadpdb(pdbid::AbstractString; ) + +Download PDB or biological assembly file from the RCSB server. + +# Arguments +- `pdbid::AbstractString`: the PDB to be downloaded. +- `pdb_dir::AbstractString=pwd()`: the directory to which the PDB file is downloaded; +defaults to current working directory. +- `file_format::Type=PDB`: the format of the PDB file. Options ; +defaults to PDB format. +- `obsolete::Bool=false`: if set `true`, the PDB file is downloaded in the auto-generated +"obsolete" directory inside the specified `pdb_dir`. +- `overwrite::Bool=false`: if set `true`, overwrites the PDB file if exists in `pdb_dir`; +by default skips downloading PDB file if it exists in `pdb_dir`. +- `ba_number::Integer=0`: if set > 0 downloads the respective biological assembly; +by default downloads the PDB file. """ -function downloadpdb(pdbid::AbstractString, - out_filepath::AbstractString="$pdbid.pdb"; - ba_number::Integer=0) +function downloadpdb(pdbid::AbstractString; pdb_dir::AbstractString=pwd(), file_format::Type=PDB, obsolete::Bool=false, overwrite::Bool=false, ba_number::Integer=0) + pdbid = uppercase(pdbid) # Check PDB ID is 4 characters long and only consits of alphanumeric characters - if length(pdbid) != 4 || ismatch(r"[^a-zA-Z0-9]", pdbid) + if !ismatch(r"^[a-zA-Z0-9]{4}$", pdbid) throw(ArgumentError("Not a valid PDB ID: \"$pdbid\"")) end - if ba_number == 0 - download("http://www.rcsb.org/pdb/files/$pdbid.pdb", out_filepath) + # check if PDB file format is valid + if !haskey(pdbextension, file_format) + throw(ArgumentError("Invalid PDB file format!")) + end + # Check if the PDB file is marked as obsolete + if obsolete + # Set the download path to obsolete directory inside the "pdb_dir" + pdb_dir = joinpath(pdb_dir,"obsolete") + end + # Check and create directory if it does not exists in filesystem + if !isdir(pdb_dir) + info("Creating directory : $pdb_dir") + mkpath(pdb_dir) + end + # Standard file name format for PDB and biological assembly + if ba_number==0 + pdbpath = joinpath(pdb_dir,"$pdbid$(pdbextension[file_format])") else - # Will download error page if ba_number is too high - download("http://www.rcsb.org/pdb/files/$pdbid.pdb$ba_number", out_filepath) + pdbpath = joinpath(pdb_dir,"$(pdbid)_ba$ba_number$(pdbextension[file_format])") + end + # Download the PDB file only if it does not exist in the "pdb_dir" and when "overwrite" is true + if isfile(pdbpath) && !overwrite + info("PDB Exists : $pdbid") + else + # Temporary location to download compressed PDB file. + archivefilepath = tempname() + try + # Download the compressed PDB file to the temporary location + info("Downloading PDB : $pdbid") + if ba_number == 0 + if file_format == PDB || file_format == PDBXML || file_format == MMCIF + download("http://files.rcsb.org/download/$pdbid$(pdbextension[file_format]).gz", archivefilepath) + else + # MMTF is downloaded in uncompressed form, thus directly stored in pdbpath + download("http://mmtf.rcsb.org/v1.0/full/$pdbid", pdbpath) + end + else + if file_format == PDB + download("http://files.rcsb.org/download/$pdbid$(pdbextension[file_format])$ba_number.gz",archivefilepath) + elseif file_format == MMCIF + download("http://files.rcsb.org/download/$pdbid-assembly$ba_number$(pdbextension[file_format]).gz", archivefilepath) + else + throw(ArgumentError("Biological Assembly is available only in PDB and MMCIF formats!")) + end + end + # Verify if the compressed PDB file is downloaded properly and extract it. For MMTF no extraction is needed + if isfile(archivefilepath) && filesize(archivefilepath) > 0 && file_format != MMTF + input = open(archivefilepath) |> ZlibInflateInputStream + open(pdbpath,"w") do output + for line in eachline(input) + println(output, chomp(line)) + end + end + close(input) + end + # Verify if the PDB file is downloaded and extracted without any error + if !isfile(pdbpath) || filesize(pdbpath)==0 + throw(ErrorException("Error downloading PDB : $pdbid")) + end + finally + # Remove the temporary compressd PDB file downloaded to clear up space + rm(archivefilepath, force=true) + end + end +end + + +""" + downloadpdb(pdbid::Array{String,1}; ) + +Download PDB or biological assembly file from the RCSB server. + +# Arguments +- `pdbid::Array{String,1}`: the list of PDB files to be downloaded. +- `pdb_dir::AbstractString=pwd()`: the directory to which the PDB file is downloaded; +defaults to current working directory. +- `file_format::Type=PDB`: the format of the PDB file. Options ; +defaults to PDB format. +- `obsolete::Bool=false`: if set `true`, the PDB file is downloaded in the auto-generated +"obsolete" directory inside the specified `pdb_dir`. +- `overwrite::Bool=false`: if set `true`, overwrites the PDB file if exists in `pdb_dir`; +by default skips downloading PDB file if it exists in `pdb_dir`. +- `ba_number::Integer=0`: if set > 0 downloads the respective biological assembly; +by default downloads the PDB file. +""" +function downloadpdb(pdbidlist::Array{String,1}; kwargs...) + failedlist = String[] + for pdbid in pdbidlist + try + downloadpdb(pdbid; kwargs...) + catch + warn("Error downloading PDB : $pdbid") + push!(failedlist,pdbid) + end + end + if length(failedlist) > 0 + warn(length(failedlist)," PDB files failed to download : ", failedlist) + end +end + + +""" + downloadentirepdb(; ) + +Download the entire PDB files available in the RCSB server. + +# Arguments +- `pdb_dir::AbstractString=pwd()`: the directory to which the PDB files are downloaded; +defaults to current working directory. +- `file_format::Type=PDB`: the format of the PDB file. Options ; +defaults to PDB format. +- `overwrite::Bool=false`: if set `true`, overwrites the PDB file if exists in `pdb_dir`; +by default skips downloading PDB file if it exists in `pdb_dir`. +""" +function downloadentirepdb(;pdb_dir::AbstractString=pwd(), file_format::Type=PDB, overwrite::Bool=false) + # Get the list of all pdb entries from RCSB Server using getallpdbentries() and downloads them + pdblist = pdbentrylist() + info("About to download $(length(pdblist)) PDB files. Make sure to have enough disk space and time!") + info("You can stop it anytime and call the function again to resume downloading") + downloadpdb(pdblist, pdb_dir=pdb_dir, overwrite=overwrite, file_format=file_format) +end + + +""" + updatelocalpdb(;pdb_dir::AbstractString=pwd(), file_format::Type=PDB) + +Updates your local copy of the PDB files. It gets the recent weekly lists of new, modified +and obsolete PDB entries and automatically updates the PDB files in the given `file_format` +inside the local `pdb_dir` directory. +""" +function updatelocalpdb(;pdb_dir::AbstractString=pwd(), file_format::Type=PDB) + addedlist, modifiedlist, obsoletelist = pdbrecentchanges() + # download the newly added and modified pdb files + downloadpdb(vcat(addedlist,modifiedlist), pdb_dir=pdb_dir, overwrite=true, file_format=file_format) + # set the obsolete directory to be inside pdb_dir + obsolete_dir=joinpath(pdb_dir,"obsolete") + for pdbid in obsoletelist + oldfile = joinpath(pdb_dir,"$pdbid$(pdbextension[file_format])") + newfile = joinpath(obsolete_dir, "$pdbid$(pdbextension[file_format])") + # if obsolete pdb is in the "pdb_dir", move it to "obsolete" directory inside "pdb_dir" + if isfile(oldfile) + if !isdir(obsolete_dir) + mkpath(obsolete_dir) + end + mv(oldfile,newfile) + # if obsolete pdb is already in the obsolete directory, inform the user and skip + elseif isfile(newfile) + info("PDB $pdbid is already moved to the obsolete directory") + # if obsolete pdb not available in both pdb_dir and obsolete, inform the user and skip + else + info("Obsolete PDB $pdbid is missing") + end + end +end + + +""" + downloadallobsoletepdb(; ) + +Download all obsolete PDB files from RCSB server. + +# Arguments +- `obsolete_dir::AbstractString=pwd()`: the directory where the PDB files are downloaded; +defaults to current working directory. +- `file_format::Type=PDB`: the format of the PDB file. Options ; +defaults to PDB format. +- `overwrite::Bool=false`: if set `true`, overwrites the PDB file if exists in +`obsolete_dir`; by default skips downloading PDB file if it exists in `obsolete_dir`. +""" +function downloadallobsoletepdb(;obsolete_dir::AbstractString=pwd(), file_format::Type=PDB, overwrite::Bool=false) + # Get all obsolete PDB files in RCSB PDB Server using getallobsolete() and download them + obsoletelist = pdbobsoletelist() + downloadpdb(obsoletelist, pdb_dir=obsolete_dir, file_format=file_format, overwrite=overwrite) +end + + +""" + retrievepdb(pdbid::AbstractString; ) + +Download and parse(read) the PDB file or biological assembly from the RCSB PDB server. + +# Arguments +- `pdbid::AbstractString`: the PDB to be downloaded and read. +- `pdb_dir::AbstractString=pwd()`: the directory to which the PDB file is downloaded; +defaults to current working directory. +- `obsolete::Bool=false`: if set `true`, the PDB file is downloaded in the auto-generated +"obsolete" directory inside the specified `pdb_dir`. +- `overwrite::Bool=false`: if set `true`, overwrites the PDB file if exists in `pdb_dir`; +by default skips downloading PDB file if it exists in `pdb_dir`. +- `ba_number::Integer=0`: if set > 0 downloads the respective biological assembly; +by default downloads the PDB file. +- `structure_name::AbstractString="\$pdbid.pdb"`: used for representing the PDB structure +when parsing the file; defaults to ".pdb". +- `remove_disorder::Bool=false`: if set true, then disordered atoms wont be parsed. +- `read_std_atoms::Bool=true`: if set false, then standard ATOM records wont be parsed. +- `read_het_atoms::Bool=true`: if set false, then HETATOM records wont be parsed. +""" +function retrievepdb(pdbid::AbstractString; + pdb_dir::AbstractString=pwd(), + obsolete::Bool=false, + overwrite::Bool=false, + ba_number::Integer=0, + structure_name::AbstractString="$(uppercase(pdbid)).pdb", + kwargs...) + downloadpdb(pdbid, pdb_dir=pdb_dir, obsolete=obsolete, overwrite=overwrite, ba_number=ba_number) + info("Parsing the PDB file...") + if obsolete + # if obsolete is set true, the PDB file is present in the obsolete directory inside "pdb_dir" + pdb_dir = joinpath(pdb_dir,"obsolete") end + readpdb(pdbid; pdb_dir=pdb_dir, ba_number=ba_number, structure_name=structure_name, kwargs...) end +""" + readpdb(pdbid::AbstractString; ) + +Read a PDB file. + +# Arguments +- `pdbid::AbstractString`: the PDB to be read. +- `pdb_dir::AbstractString=pwd()`: the directory to which the PDB file is downloaded; +defaults to current working directory. +- `ba_number::Integer=0`: if set > 0 downloads the respective biological assembly; +by default downloads the PDB file. +- `structure_name::AbstractString="\$pdbid.pdb"`: used for representing the PDB structure +when parsing the file; defaults to ".pdb". +- `remove_disorder::Bool=false`: if set true, then disordered atoms wont be parsed. +- `read_std_atoms::Bool=true`: if set false, then standard ATOM records wont be parsed. +- `read_het_atoms::Bool=true`: if set false, then HETATOM records wont be parsed. +""" +function readpdb(pdbid::AbstractString; + pdb_dir::AbstractString=pwd(), + ba_number::Integer=0, + structure_name::AbstractString="$pdbid.pdb", + kwargs...) + pdbid = uppercase(pdbid) + # Standard file name format for PDB and biological assembly + if ba_number==0 + pdbpath = joinpath(pdb_dir,"$pdbid.pdb") + else + pdbpath = joinpath(pdb_dir,"$(pdbid)_ba$ba_number.pdb") + end + read(pdbpath, PDB; structure_name=structure_name, kwargs...) +end function Base.read(input::IO, ::Type{PDB}; diff --git a/test/structure/runtests.jl b/test/structure/runtests.jl index f22e54444..161d88e97 100644 --- a/test/structure/runtests.jl +++ b/test/structure/runtests.jl @@ -30,6 +30,89 @@ function pdbfilepath(filename::AbstractString) return joinpath(dirname(@__FILE__), "..", "BioFmtSpecimens", "PDB", filename) end +@testset "PDB Handling" begin + @test length(pdbentrylist()) > 100000 + @test length(pdbstatuslist("ftp://ftp.wwpdb.org/pub/pdb/data/status/latest/added.pdb")) > 0 + #Invalid URL + @test_throws ErrorException pdbstatuslist("ftp://ftp.wwpdb.org/pub/pdb/data/status/latest/dummy.pdb") + addedlist, modifiedlist, obsoletelist = pdbrecentchanges() + @test length(addedlist) > 0 && length(modifiedlist) > 0 && length(obsoletelist) > 0 + @test length(pdbobsoletelist()) > 3600 + + pdb_dir = joinpath(tempdir(),"PDB") + # Invalid PDB ID format + @test_throws ArgumentError downloadpdb("1a df") + # Valid PDB ID format but PDB does not exist + @test_throws ErrorException downloadpdb("no1e", pdb_dir=pdb_dir) + # Invalid PDB file_format. + @test_throws ArgumentError downloadpdb("1alw", pdb_dir=pdb_dir, file_format=String) + # Biological assembly not available in PDBXML and MMTF + @test_throws ArgumentError downloadpdb("1alw", pdb_dir=pdb_dir, file_format=PDBXML, ba_number=1) + # Invalid ba_number for PDB "1alw" + @test_throws ErrorException downloadpdb("1alw",pdb_dir=pdb_dir, file_format=MMCIF,ba_number=10) + + # PDB format + downloadpdb("1alw", pdb_dir=pdb_dir, file_format=PDB) + pdbpath = joinpath(pdb_dir,"1ALW$(pdbextension[PDB])") + @test isfile(pdbpath) && filesize(pdbpath) > 0 + # PDBXML format + downloadpdb("1alw", pdb_dir=pdb_dir, file_format=PDBXML) + pdbpath = joinpath(pdb_dir,"1ALW$(pdbextension[PDBXML])") + @test isfile(pdbpath) && filesize(pdbpath) > 0 + # MMCIF format + downloadpdb("1alw", pdb_dir=pdb_dir, file_format=MMCIF) + pdbpath = joinpath(pdb_dir,"1ALW$(pdbextension[MMCIF])") + @test isfile(pdbpath) && filesize(pdbpath) > 0 + # MMTF format + downloadpdb("1alw", pdb_dir=pdb_dir, file_format=MMTF) + pdbpath = joinpath(pdb_dir,"1ALW$(pdbextension[MMTF])") + @test isfile(pdbpath) && filesize(pdbpath) > 0 + # Obsolete PDB + downloadpdb("116l", pdb_dir=pdb_dir, file_format=PDB, obsolete=true) + pdbpath = joinpath(pdb_dir,"obsolete","116L$(pdbextension[PDB])") + @test isfile(pdbpath) && filesize(pdbpath) > 0 + # Biological Assembly - PDB format + downloadpdb("1alw", pdb_dir=pdb_dir, file_format=PDB, ba_number=1) + pdbpath = joinpath(pdb_dir,"1ALW_ba1$(pdbextension[PDB])") + @test isfile(pdbpath) && filesize(pdbpath) > 0 + # Biological Assembly - MMCIF format + downloadpdb("5a9z", pdb_dir=pdb_dir, file_format=MMCIF, ba_number=1) + pdbpath = joinpath(pdb_dir,"5A9Z_ba1$(pdbextension[MMCIF])") + @test isfile(pdbpath) && filesize(pdbpath) > 0 + # Download multiple PDB files + pdbidlist = ["1ent","1en2"] + downloadpdb(pdbidlist, pdb_dir=pdb_dir, file_format=PDB) + for pdbid in pdbidlist + pdbpath = joinpath(pdb_dir,"$(uppercase(pdbid))$(pdbextension[PDB])") + @test isfile(pdbpath) && filesize(pdbpath) > 0 + end + + # Test Retrieving and reading options + struc = retrievepdb("1AKE", pdb_dir=pdb_dir, structure_name="New name") + @test structurename(struc) == "New name" + @test countatoms(struc) == 3804 + + struc = retrievepdb("1AKE", pdb_dir=pdb_dir, obsolete=true, read_het_atoms=false) + @test countatoms(struc) == 3312 + @test serial(collectatoms(struc)[2000]) == 2006 + @test sum(map(ishetero, collectatoms(struc))) == 0 + + struc = retrievepdb("1AKE", pdb_dir=pdb_dir, ba_number=1, read_het_atoms=false, read_std_atoms=false) + @test countatoms(struc) == 0 + @test countresidues(struc) == 0 + @test countchains(struc) == 0 + @test countmodels(struc) == 0 + + struc = readpdb("1AKE", pdb_dir=pdb_dir, read_std_atoms=false) + @test countatoms(struc) == 492 + @test serial(collectatoms(struc)[400]) == 3726 + @test sum(map(ishetero, collectatoms(struc))) == 492 + + struc = readpdb("1AKE", pdb_dir=pdb_dir, ba_number=1, remove_disorder=true) + @test countatoms(struc) == 1954 + @test sum(map(isdisorderedatom, collectatoms(struc))) == 0 + @test tempfactor(struc['A'][167]["NE"]) == 23.32 +end @testset "Model" begin # Test constructors and indexing