-
-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RNA graph construction and KNN representation #109
Comments
Hey, thanks for this Ryan! Looks exciting! So, I think we should keep RNA secondary structure & 3D structure separate for now. The secondary structure is functional as a standalone piece of functionality (though it would be really nice to hook it up to Nussinov or bpRNA - the largest database I know of). With respect to 3D graphs - I had a quick look at this. I think it's actually quite straightforward as most of the components are implemented for protein structure graphs. Essentially, we can use the low-level API in graphein as building blocks and make a function more or less identical to the We need some Then, we simply add a new function RNA_ATOMS = [
"C1'",
"C2",
"C2'",
"C3'",
"C4",
"C4'",
"C5",
"C5'",
"C6",
"C8",
"N1",
"N2",
"N3",
"N4",
"N6",
"N7",
"N9",
"O2",
"O2'",
"O3'",
"O4",
"O4'",
"O5'",
"O6",
"OP1",
"OP2",
"P",
]
def subset_structure_to_rna(
df: pd.DataFrame,
) -> pd.DataFrame:
"""
Return a subset of atomic dataframe that contains only certain atom names relevant for RNA structures.
:param df: Protein Structure dataframe to subset
:type df: pd.DataFrame
:returns: Subsetted protein structure dataframe
:rtype: pd.DataFrame
"""
return filter_dataframe(
df, by_column="atom_name", list_of_values=RNA_ATOMS, boolean=True
) but more flexible (not keeping the The only other line that breaks is this one and we easily fix it by removing the What I'm unfamiliar with is how we coarsen the RNA graphs. E.g. all atom is what I've described above. For proteins it's obviously very normal to consider the alpha carbon trace as representative of a residue-level graph. I'm not sure what the standard for RNA is. In any case, we can leave this open to users with the |
Came across this today: https://www.biorxiv.org/content/10.1101/2022.03.14.484334v1 Might be of interest to you @rg314 |
Just to follow up on this... we found that the nussinov.py algo isn't great at predicting the dot-bracket notation. I suggest that we create a container running https://github.com/rg314/centroid-rna-package and ping it to get the centroid secondary structure. What do you think @a-r-j ? |
Implemented in |
I’ve just started looking at RNA graph construction. Ideally, I’d like to generate a KNN representation of the RNA. This function is currently implemented for proteins by using the
graphein.protein.edges.distance.add_k_nn_edges
function. In short, the edges for the KNN method are added by:a. To compute the distance matrix we need to know the x,y,z position of each basepair (BP) of RNA
At the moment the x,y,z cords for protein structures are obtained from a PDB file. This is currently not built for RNA structures. For an RNA sequence we must use the sequence and/or dot bracket notation to get the 3D structural information.
If the dot bracket notation is not provided and can be calculated using Nussinov Algorithm (DP approach, see https://github.com/cgoliver/Nussinov/blob/master/nussinov.py for python implementation). See implementation https://github.com/rg314/graphein/blob/rna-model/graphein/rna/nussinov.py
Note that nussinov algo does not guarantee that the dot-bracket notation is correct. There are several other ways of computing this.
The PDB database contains some RNA structures (~5233). PandasPdb can be used to directly read in the PDB file. I suggest that the current protein config is adapted for the RNA structure to read in the RNA structure from a PDB file. @a-r-j what do you think? I have started to implement this please see (https://github.com/rg314/graphein/blob/35bd2297d28bf09bcf0fb98c10c3866d4be6cb83/graphein/rna/graphs.py#L209 note reading in df is currently failing).
Then we can look at alternative sources for reading in the structure.
For example, it appears that the Xiao lab http://biophy.hust.edu.cn/new/ has a RESTful API to return RNA structure. However, I have not investigated this in detail and if it returns the correct 3D data. This could somewhat mimic the behaviour of
graphein.protein.utils.download_alphafold_structure
.Does anyone have an idea of other databases that could be used?
I’m also open to creating a server that can be contacted with a RESTful API to predict RNA structure. However, we would need to figure out the best implementation for structure prediction (and make sure it doesn’t take too long 😉).
The text was updated successfully, but these errors were encountered: