Read R datasets from Python.
The package rdata offers a lightweight way to import R datasets/objects stored in the ".rda" and ".rds" formats into Python. Its main advantages are:
- It is a pure Python implementation, with no dependencies on the R language or related libraries. Thus, it can be used anywhere where Python is supported, including the web using Pyodide.
- It attempt to support all R objects that can be meaningfully translated. As opposed to other solutions, you are no limited to import dataframes or data with a particular structure.
- It allows users to easily customize the conversion of R classes to Python ones. Does your data use custom R classes? Worry no longer, as it is possible to define custom conversions to the Python classes of your choosing.
- It has a permissive license (MIT). As opposed to other packages that depend on R libraries and thus need to adhere to the GPL license, you can use rdata as a dependency on MIT, BSD or even closed source projects.
rdata is on PyPi and can be installed using pip
:
pip install rdata
It is also available for conda
using the conda-forge
channel:
conda install -c conda-forge rdata
The current version from the develop branch can be installed as
pip install git+https://github.com/vnmabus/rdata.git@develop
The documentation of rdata is in ReadTheDocs.
Examples of use are available in ReadTheDocs.
The common way of reading an R dataset is the following one:
import rdata
converted = rdata.read_rda(rdata.TESTDATA_PATH / "test_vector.rda")
converted
which results in
{'test_vector': array([1., 2., 3.])}
Under the hood, this is equivalent to the following code:
import rdata
parsed = rdata.parser.parse_file(rdata.TESTDATA_PATH / "test_vector.rda")
converted = rdata.conversion.convert(parsed)
converted
This consists on two steps:
- First, the file is parsed using the function rdata.parser.parse_file. This provides a literal description of the file contents as a hierarchy of Python objects representing the basic R objects. This step is unambiguous and always the same.
- Then, each object must be converted to an appropriate Python object. In this step there are several choices on which Python type is the most appropriate as the conversion for a given R object. Thus, we provide a default rdata.conversion.convert routine, which tries to select Python objects that preserve most information of the original R object. For custom R classes, it is also possible to specify conversion routines to Python objects.
The basic convert routine only constructs a SimpleConverter object and calls its convert method. All arguments of convert are directly passed to the SimpleConverter initialization method.
It is possible, although not trivial, to make a custom Converter object to change the way in which the basic R objects are transformed to Python objects. However, a more common situation is that one does not want to change how basic R objects are converted, but instead wants to provide conversions for specific R classes. This can be done by passing a dictionary to the SimpleConverter initialization method, containing as keys the names of R classes and as values, callables that convert a R object of that class to a Python object. By default, the dictionary used is DEFAULT_CLASS_MAP, which can convert commonly used R classes such as data.frame and factor.
As an example, here is how we would implement a conversion routine for the factor class to bytes objects, instead of the default conversion to Pandas Categorical objects:
import rdata
def factor_constructor(obj, attrs):
values = [bytes(attrs['levels'][i - 1], 'utf8')
if i >= 0 else None for i in obj]
return values
new_dict = {
**rdata.conversion.DEFAULT_CLASS_MAP,
"factor": factor_constructor
}
converted = rdata.read_rda(
rdata.TESTDATA_PATH / "test_dataframe.rda",
constructor_dict=new_dict,
)
converted
which has the following result:
{'test_dataframe': class value 1 b'a' 1 2 b'b' 2 3 b'b' 3}
Additional examples illustrating the functionalities of this package can be found in the ReadTheDocs documentation.