Skip to content

Data Processing (with Pandas)

Mark Keller edited this page Feb 21, 2019 · 2 revisions

Tidy data

Awesome article about cleaning up data

In the context of pandas, review the functions melt, stack/unstack, and pivot.

Reading in tabular data

If using pd.read_csv, when possible, specify data types to avoid bugs. Create a dictionary mapping column names to data types, then pass it to pandas using the dtype parameter. This has the added benefit of easily being able to restrict reading in only the columns of interest by using the usecols parameter with the keys of the data type dictionary.

dtypes = {
    PATIENT: str, 
    SAMPLE: str, 
    CANCER_TYPE: str, 
    CHR: str, 
    POS_START: int, 
    POS_END: int, 
    REF: str, 
    VAR: str
df = pd.read_csv(args.input_file, sep='\t', usecols=dtypes.keys(), dtype=dtypes)

Deriving a new column from an existing one

Use pandas .apply on the "source" column:

df["destination"] = df["source"].apply(my_conversion_function)

Deriving a new column from multiple existing ones

df["destination"] = df.apply(my_conversion_function, axis='columns')

Reordering columns by indexing with an array

Particularly important for enforcing orderings of mutation signatures categories, but also for many other things.

categories = ["A[C>A]A", "A[C>A]C", ...]
df = df[categories]

Using Enums

If a column of your dataset needs to be a categorical variable with relatively few values, use python's enum class:

from enum import Enum

    ALIVE = 'Alive'
    DEAD = 'Dead'

Then use the class like this:

# Access a value
# Get a list of all values
[e.value for e in VITAL_STATUS_VALS]