Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Even faster did #209

Merged
merged 50 commits into from
Nov 13, 2024
Merged

Even faster did #209

merged 50 commits into from
Nov 13, 2024

Conversation

marcelortizv
Copy link
Contributor

This PR introduces a new argument in did called faster_mode, which improves the overall performance of did by optimizing both pre_process_did and compute.att_gt.

The main objective of these enhancements is to implement faster data management when setting up the 2x2 DiD cohort data before running DRDID estimation. This new approach avoids filtering data inside a g-t loop, which can be computationally expensive for large datasets. Instead, faster_mode introduces a new setup that:

  1. Orders the dataset by tname, gname, and idname.
  2. Identifies the position of observations relevant to the current g-t estimation using information on the number of observations in each period, group, and period-group combination.

This process allows us to construct a vector of indicators, did_cohort_index, containing values of 1, 0, or NA to mark treated, untreated, and non-participating units for each g-t cell, respectively. The construction of did_cohort_index considers the structure of the data (panel vs RCS), the base period (universal vs varying), and the control group (never treated vs not-yet-treated).

Below is an example illustrating how did_cohort_index can be manually constructed, considering 3 groups, 3 periods, a universal base period, and both types of control groups:

image

(g,t) -> pret T G (2,2) -> 1 (2,3) -> 1 (3,1) -> 2 (3,3) -> 2 (2,2) -> 1 (2,3) -> 1 (3,1) -> 2 (3,3) -> 2
1 1 2 1 1 NA NA 1 1 NA NA
2 1 2 1 1 NA NA 1 1 NA NA
3 1 3 NA NA 1 NA 0 NA 1 NA
4 1 3 NA NA 1 NA 0 NA 1 NA
5 1 Inf 0 0 0 NA 0 0 0 NA
6 1 Inf 0 0 0 NA 0 0 0 NA
7 2 2 1 NA NA NA 1 NA NA NA
8 2 2 1 NA NA NA 1 NA NA NA
9 2 3 NA NA 1 1 0 NA 1 1
10 2 3 NA NA 1 1 0 NA 1 1
11 2 3 NA NA 1 1 0 NA 1 1
12 2 Inf 0 NA 0 0 0 NA 0 0
13 2 Inf 0 NA 0 0 0 NA 0 0
14 3 2 NA 1 NA NA NA 1 NA NA
15 3 2 NA 1 NA NA NA 1 NA NA
16 3 3 NA NA NA 1 NA NA NA 1
17 3 3 NA NA NA 1 NA NA NA 1
18 3 Inf NA 0 NA 0 NA 0 NA 0
19 3 Inf NA 0 NA 0 NA 0 NA 0
20 3 Inf NA 0 NA 0 NA 0 NA 0

Changes:

  • Added pre_process_did2, which processes arguments passed to the main methods in did and performs checks to ensure the data is in the correct format, providing helpful error messages when necessary. This function is analogous to pre_process_did but utilizes faster implementations, orders the data, and computes metadata that is used to populate did_cohort_index.
  • Added get_did_tensors, a utility function used by pre_process_did2, which splits the data into a list of outcome tensors and a list of arguments. Tensors are objects with dimensions id_count x 1 x time_periods_count and are used for faster filtering in the computation of the DiD estimator. This is only applicable to panel data.
  • Added validate_args and did_standarization, functions that validate arguments passed to att_gt() and standardize the data format, respectively.
  • Added compute.att_gt2, which processes the (g,t) cell, sends it to estimation, and then handles all post-processing steps to recover the same outputs as compute.att_gt, ensuring that subsequent procedures remain unaffected.
  • Added unit tests to verify that running att_gt with faster_mode = TRUE and faster_mode = FALSE produces the same results.

Evidence:

Panel data; unique ids = 10^order, where order in {2,...,6}, time periods = 10, DR estimation.
image
image

RCS; unique ids = 10^order, where oder in {3,...6}, time periods = 8, DR estimation.
image
image

🚨 This PR may affect workflows that use did under the hood. While all tests are passing, a careful review is recommended. To prevent disruptions in existing workflows, these changes are implemented under the argument faster_mode = TRUE, with the default set to FALSE. This default preserves the current procedures, which are already efficient for most datasets.

@pedrohcgs pedrohcgs merged commit effec10 into bcallaway11:master Nov 13, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants