A wrapper around Huggingface the load data for eFold. You can:
- pull datasets from the Rouskinlab's HuggingFace
- create datasets from local files
pip install rouskinhf
- get a token access from the rouskilab huggingface's page
- add this token to your environment
export HUGGINGFACE_TOKEN="hf_yourtokenhere"
You'll need to install D. Mathew's RNAstructure Fold (also available on Rouskinlab GitHub).
Check your RNAstructure Fold installation in a terminal:
Fold --version
import rouskinhf
rouskinhf.get_dataset(
name='bpRNA-1m', # the name of a dataset from huggingface/rouskinlab
force_download = False # use a local copy of the data if it exists
)
import rouskinhf
rouskinhf.convert(
format = 'ct', # can be ct, seismic, bpseq, fasta or json (rouskinhf output data structure)
file_or_folder = 'path/to/my/ct/folder',
predict_structure = False, # Add structure from RNAstructure
filter = True, # removes duplicates, non-regular characters and low AUROC
min_AUROC=0.8,
)
Note: Sequences with bases different than
A
,C
,G
,T
,U
,N
,a
,c
,g
,t
,u
,n
are not supported. The data will be filtered out.
# rouskinhf_output_file.json
{
"reference_name": {
"sequence": "CACGCUAUG",
"structure": [(0,8), (1,7)], # base pair representation
# whatever other info you need
}
}