-
Notifications
You must be signed in to change notification settings - Fork 0
Data Processing (with Pandas)
Awesome article about cleaning up data https://vita.had.co.nz/papers/tidy-data.pdf
In the context of pandas, review the functions melt
, stack
/unstack
, and pivot
.
If using pd.read_csv
, when possible, specify data types to avoid bugs. Create a dictionary mapping column names to data types, then pass it to pandas using the dtype
parameter. This has the added benefit of easily being able to restrict reading in only the columns of interest by using the usecols
parameter with the keys of the data type dictionary.
dtypes = {
PATIENT: str,
SAMPLE: str,
CANCER_TYPE: str,
CHR: str,
POS_START: int,
POS_END: int,
REF: str,
VAR: str
}
df = pd.read_csv(args.input_file, sep='\t', usecols=dtypes.keys(), dtype=dtypes)
Use pandas .apply
on the "source" column:
df["destination"] = df["source"].apply(my_conversion_function)
df["destination"] = df.apply(my_conversion_function, axis='columns')
Particularly important for enforcing orderings of mutation signatures categories, but also for many other things.
categories = ["A[C>A]A", "A[C>A]C", ...]
df = df[categories]
If a column of your dataset needs to be a categorical variable with relatively few values, use python's enum class:
from enum import Enum
class VITAL_STATUS_VALS(Enum):
ALIVE = 'Alive'
DEAD = 'Dead'
Then use the class like this:
# Access a value
VITAL_STATUS_VALS.ALIVE.value
# Get a list of all values
[e.value for e in VITAL_STATUS_VALS]