Even faster did #209

marcelortizv · 2024-10-06T20:56:47Z

This PR introduces a new argument in did called faster_mode, which improves the overall performance of did by optimizing both pre_process_did and compute.att_gt.

The main objective of these enhancements is to implement faster data management when setting up the 2x2 DiD cohort data before running DRDID estimation. This new approach avoids filtering data inside a g-t loop, which can be computationally expensive for large datasets. Instead, faster_mode introduces a new setup that:

Orders the dataset by tname, gname, and idname.
Identifies the position of observations relevant to the current g-t estimation using information on the number of observations in each period, group, and period-group combination.

This process allows us to construct a vector of indicators, did_cohort_index, containing values of 1, 0, or NA to mark treated, untreated, and non-participating units for each g-t cell, respectively. The construction of did_cohort_index considers the structure of the data (panel vs RCS), the base period (universal vs varying), and the control group (never treated vs not-yet-treated).

Below is an example illustrating how did_cohort_index can be manually constructed, considering 3 groups, 3 periods, a universal base period, and both types of control groups:

(g,t) -> pret	T	G	(2,2) -> 1	(2,3) -> 1	(3,1) -> 2	(3,3) -> 2	(2,2) -> 1	(2,3) -> 1	(3,1) -> 2	(3,3) -> 2
1	1	2	1	1	NA	NA	1	1	NA	NA
2	1	2	1	1	NA	NA	1	1	NA	NA
3	1	3	NA	NA	1	NA	0	NA	1	NA
4	1	3	NA	NA	1	NA	0	NA	1	NA
5	1	Inf	0	0	0	NA	0	0	0	NA
6	1	Inf	0	0	0	NA	0	0	0	NA
7	2	2	1	NA	NA	NA	1	NA	NA	NA
8	2	2	1	NA	NA	NA	1	NA	NA	NA
9	2	3	NA	NA	1	1	0	NA	1	1
10	2	3	NA	NA	1	1	0	NA	1	1
11	2	3	NA	NA	1	1	0	NA	1	1
12	2	Inf	0	NA	0	0	0	NA	0	0
13	2	Inf	0	NA	0	0	0	NA	0	0
14	3	2	NA	1	NA	NA	NA	1	NA	NA
15	3	2	NA	1	NA	NA	NA	1	NA	NA
16	3	3	NA	NA	NA	1	NA	NA	NA	1
17	3	3	NA	NA	NA	1	NA	NA	NA	1
18	3	Inf	NA	0	NA	0	NA	0	NA	0
19	3	Inf	NA	0	NA	0	NA	0	NA	0
20	3	Inf	NA	0	NA	0	NA	0	NA	0

Changes:

Added pre_process_did2, which processes arguments passed to the main methods in did and performs checks to ensure the data is in the correct format, providing helpful error messages when necessary. This function is analogous to pre_process_did but utilizes faster implementations, orders the data, and computes metadata that is used to populate did_cohort_index.
Added get_did_tensors, a utility function used by pre_process_did2, which splits the data into a list of outcome tensors and a list of arguments. Tensors are objects with dimensions id_count x 1 x time_periods_count and are used for faster filtering in the computation of the DiD estimator. This is only applicable to panel data.
Added validate_args and did_standarization, functions that validate arguments passed to att_gt() and standardize the data format, respectively.
Added compute.att_gt2, which processes the (g,t) cell, sends it to estimation, and then handles all post-processing steps to recover the same outputs as compute.att_gt, ensuring that subsequent procedures remain unaffected.
Added unit tests to verify that running att_gt with faster_mode = TRUE and faster_mode = FALSE produces the same results.

Evidence:

Panel data; unique ids = 10^order, where order in {2,...,6}, time periods = 10, DR estimation.

RCS; unique ids = 10^order, where oder in {3,...6}, time periods = 8, DR estimation.

🚨 This PR may affect workflows that use did under the hood. While all tests are passing, a careful review is recommended. To prevent disruptions in existing workflows, these changes are implemented under the argument faster_mode = TRUE, with the default set to FALSE. This default preserves the current procedures, which are already efficient for most datasets.

…into even-faster-did

…into pr/209

marcelortizv and others added 30 commits September 13, 2024 19:40

adding new preprocess function

6337fde

adding changes from master

63cb307

adding did_standarization function

955bf44

adding new preprocess and DIDparam, still WIP

88a2a7b

init new compute.att_gt function, still WIP

a222c49

adding run_att_gt_estimation

4a1012a

small fix in att.gt estimation

713e3aa

estimation for panel data

3964e89

just panel data

9f59bfa

adding new for-loop

5ed0bae

adding compute.att_gt for panel data

5d397cc

fast did estimation with balanced panel data

c5c5207

improving documentation for current procedure in panel data

1485868

fixing results for base_period = 'universal'

540d2a8

fixing output according to base_period

884dfaa

fixing control_group index

5d80a0f

adding estimation for rcs data

99bee3d

fixing logic for varying base_period and notyettreated in rcs data

38dca4c

adding documentation and unit tests for PR

b55a022

dropping rcpp exports not needed

ce64d14

fixing version of DRDID

6ed0a02

fixing unit test

bca02ca

changing min R version

627e75e

Update compute.att_gt2.R

10fbcec

Update DESCRIPTION

961f735

Fix some pre-processing

4d2bdfa

fixing bootstrap with faster_mode

fdb17e9

fixing aggregation under faster_mode

5ecc54c

solving not yet treated comparison group unit test

fe2670b

fixing some units treated in fisrt period unit test

ea166ea

marcelortizv and others added 20 commits October 17, 2024 00:09

fixing small comparison group unit test

3faed16

fixing bug where g is a column in data

4837bf2

fixing anticipation unit test

eb8e647

fix malinformed unit test

f7b8f32

fixing unbalanced panel data

0509e77

fixing clustered std. errors

cce42f8

adding fix for unbalanced panel data influence functions

a742f37

enablign clustered std. error with unbalanced panel

cd98b2e

updating pre_process2

618ad7c

fixing check for clustervars

08d7255

Merge branch 'master' into even-faster-did

de4f690

fixing tests

de183ae

fixing last period for notyettreated groups

b5f6306

Update test_sim_data_2_groups.R

14cd551

Update DESCRIPTION

b31a003

fixing construction of cohort index for notyettreated

8cc0f10

Merge branch 'even-faster-did' of https://github.com/marcelortizv/did …

726e4eb

…into even-faster-did

version control

46e2b09

Merge branch 'even-faster-did' of https://github.com/marcelortizv/did …

ea1b507

…into pr/209

version control

d59c156

pedrohcgs merged commit effec10 into bcallaway11:master Nov 13, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Even faster did #209

Even faster did #209

marcelortizv commented Oct 6, 2024

(g,t) -> pret	T	G	(2,2) -> 1	(2,3) -> 1	(3,1) -> 2	(3,3) -> 2	(2,2) -> 1	(2,3) -> 1	(3,1) -> 2	(3,3) -> 2
1	1	2	1	1	NA	NA	1	1	NA	NA
2	1	2	1	1	NA	NA	1	1	NA	NA
3	1	3	NA	NA	1	NA	0	NA	1	NA
4	1	3	NA	NA	1	NA	0	NA	1	NA
5	1	Inf	0	0	0	NA	0	0	0	NA
6	1	Inf	0	0	0	NA	0	0	0	NA
7	2	2	1	NA	NA	NA	1	NA	NA	NA
8	2	2	1	NA	NA	NA	1	NA	NA	NA
9	2	3	NA	NA	1	1	0	NA	1	1
10	2	3	NA	NA	1	1	0	NA	1	1
11	2	3	NA	NA	1	1	0	NA	1	1
12	2	Inf	0	NA	0	0	0	NA	0	0
13	2	Inf	0	NA	0	0	0	NA	0	0
14	3	2	NA	1	NA	NA	NA	1	NA	NA
15	3	2	NA	1	NA	NA	NA	1	NA	NA
16	3	3	NA	NA	NA	1	NA	NA	NA	1
17	3	3	NA	NA	NA	1	NA	NA	NA	1
18	3	Inf	NA	0	NA	0	NA	0	NA	0
19	3	Inf	NA	0	NA	0	NA	0	NA	0
20	3	Inf	NA	0	NA	0	NA	0	NA	0

(g,t) -> pret	T	G	(2,2) -> 1	(2,3) -> 1	(3,1) -> 2	(3,3) -> 2	(2,2) -> 1	(2,3) -> 1	(3,1) -> 2	(3,3) -> 2
1	1	2	1	1	NA	NA	1	1	NA	NA
2	1	2	1	1	NA	NA	1	1	NA	NA
3	1	3	NA	NA	1	NA	0	NA	1	NA
4	1	3	NA	NA	1	NA	0	NA	1	NA
5	1	Inf	0	0	0	NA	0	0	0	NA
6	1	Inf	0	0	0	NA	0	0	0	NA
7	2	2	1	NA	NA	NA	1	NA	NA	NA
8	2	2	1	NA	NA	NA	1	NA	NA	NA
9	2	3	NA	NA	1	1	0	NA	1	1
10	2	3	NA	NA	1	1	0	NA	1	1
11	2	3	NA	NA	1	1	0	NA	1	1
12	2	Inf	0	NA	0	0	0	NA	0	0
13	2	Inf	0	NA	0	0	0	NA	0	0
14	3	2	NA	1	NA	NA	NA	1	NA	NA
15	3	2	NA	1	NA	NA	NA	1	NA	NA
16	3	3	NA	NA	NA	1	NA	NA	NA	1
17	3	3	NA	NA	NA	1	NA	NA	NA	1
18	3	Inf	NA	0	NA	0	NA	0	NA	0
19	3	Inf	NA	0	NA	0	NA	0	NA	0
20	3	Inf	NA	0	NA	0	NA	0	NA	0

Even faster did #209

Even faster did #209

Conversation

marcelortizv commented Oct 6, 2024

(g,t) -> pret	T	G	(2,2) -> 1	(2,3) -> 1	(3,1) -> 2	(3,3) -> 2	(2,2) -> 1	(2,3) -> 1	(3,1) -> 2	(3,3) -> 2
1	1	2	1	1	NA	NA	1	1	NA	NA
2	1	2	1	1	NA	NA	1	1	NA	NA
3	1	3	NA	NA	1	NA	0	NA	1	NA
4	1	3	NA	NA	1	NA	0	NA	1	NA
5	1	Inf	0	0	0	NA	0	0	0	NA
6	1	Inf	0	0	0	NA	0	0	0	NA
7	2	2	1	NA	NA	NA	1	NA	NA	NA
8	2	2	1	NA	NA	NA	1	NA	NA	NA
9	2	3	NA	NA	1	1	0	NA	1	1
10	2	3	NA	NA	1	1	0	NA	1	1
11	2	3	NA	NA	1	1	0	NA	1	1
12	2	Inf	0	NA	0	0	0	NA	0	0
13	2	Inf	0	NA	0	0	0	NA	0	0
14	3	2	NA	1	NA	NA	NA	1	NA	NA
15	3	2	NA	1	NA	NA	NA	1	NA	NA
16	3	3	NA	NA	NA	1	NA	NA	NA	1
17	3	3	NA	NA	NA	1	NA	NA	NA	1
18	3	Inf	NA	0	NA	0	NA	0	NA	0
19	3	Inf	NA	0	NA	0	NA	0	NA	0
20	3	Inf	NA	0	NA	0	NA	0	NA	0