You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inspired by the work by @apoorvalal and @s3alfisc(#574 and here), I suggest adding a new function method that computes first and second stage regressions of 2 SLS estimator using sufficient statistics. The trick is to use the predicted value of endogenous covariate using discrete exogenous covariates and instruments in the first stage regression , and estimate the second stage using this predicted value. Note that the predicted values of the endogenous covariate remains discrete, as they can be collapsed to match the number of unique values in the discrete control and instrument variables. Here is the code example using duckreg by @apoorvalal.
importnumpyasnpimportpandasaspdfromduckreg.estimatorsimportDuckRegressionimportduckdbimportpyfixestaspf# Generate sample datadefgenerate_sample_data(N=10_000_000, seed=12345):
rng=np.random.default_rng(seed)
z=rng.binomial(1, 0.5, size=(N,1))
z2=rng.binomial(1, 0.5, size=(N,1))
d=0.5*z+1.2*z2+rng.normal(size=(N,1))
fx=rng.choice(range(20), (N, 2), True)
y=1.0+2.3*d+fx @ np.array([1, 2]).reshape(2, 1) +rng.normal(size=(N,1))
df=pd.DataFrame(
np.concatenate([y,d,z,z2, fx], axis=1), columns=["y", "d", "z", "z2", "fx1", "fx2"]
).assign(rowid=range(N))
returndf# Function to create and populate DuckDB databasedefcreate_duckdb_database(df, db_name="large_dataset.db", table="data"):
conn=duckdb.connect(db_name)
conn.execute(f"DROP TABLE IF EXISTS {table}")
conn.execute(f"CREATE TABLE {table} AS SELECT * FROM df")
conn.close()
print(f"Data loaded into DuckDB database: {db_name}")
# Generate and save datadf=generate_sample_data()
db_name='large_dataset.db'create_duckdb_database(df, db_name)
db_name='large_dataset.db'conn=duckdb.connect(db_name)
query="SELECT * FROM data limit 5"conn.execute(query).fetchdf()
db_name="large_dataset.db"conn=duckdb.connect(db_name)
q=""" SELECT z, z2, fx1, fx2, COUNT(*) as count, SUM(d) as sum_d, SUM(POW(d, 2)) as sum_d_sq, FROM data GROUP BY z, z2, fx1, fx2"""compressed_df=conn.execute(q).fetchdf()
conn.close()
compressed_df.eval(f"mean_d = sum_d / count", inplace=True)
# Step 1 : Implement First stage reg.m1=DuckRegression(
db_name='large_dataset.db',
table_name='data',
formula="d ~ z + z2 + fx1 + fx2",
cluster_col="",
n_bootstraps=0,
seed=42,
)
m1.fit()
m1.fit_vcov()
results=m1.summary()
restab=pd.DataFrame(
np.c_[results["point_estimate"], results["standard_error"]],
columns=["point_estimate", "standard_error"],
)
# Step 2 : Predict endogenous variable from the first stage.# Define the point estimatesconstant=results["point_estimate"][0]
coef_z=results["point_estimate"][1]
coef_z2=results["point_estimate"][2]
coef_fx1=results["point_estimate"][3]
coef_fx2=results["point_estimate"][4]
# Compute the predicted valuesdf['d_hat'] = (constant+coef_z*df['z'] +coef_z2*df['z2'] +coef_fx1*df['fx1'] +coef_fx2*df['fx2'])
# Select the relevant columns to create the new DataFrameresult_df=df[['z', 'z2', 'fx1', 'fx2', 'd_hat', 'y']]
# Generate and save data in duckdbdb_name='large_dataset.db'create_duckdb_database(result_df, db_name)
db_name='large_dataset.db'conn=duckdb.connect(db_name)
query="SELECT * FROM data limit 5"conn.execute(query).fetchdf()
db_name="large_dataset.db"conn=duckdb.connect(db_name)
q=""" SELECT z, z2, fx1, fx2, d_hat, COUNT(*) as count, SUM(y) as sum_y, SUM(POW(y, 2)) as sum_y_sq, FROM data GROUP BY z, z2, fx1, fx2, d_hat"""compressed_df=conn.execute(q).fetchdf()
conn.close()
compressed_df.eval(f"mean_y = sum_y / count", inplace=True)
print(compressed_df.shape)
compressed_df.head()
# Step 3 : Implement second stage regression using the predicted covariatem2=DuckRegression(
db_name='large_dataset.db',
table_name='data',
formula="y ~ d_hat + fx1 + fx2",
cluster_col="",
n_bootstraps=0,
seed=42,
)
m2.fit()
m2.fit_vcov()
results=m2.summary()
restab=pd.DataFrame(
np.c_[results["point_estimate"], results["standard_error"]],
columns=["point_estimate", "standard_error"],
)
restab
where the point estimates are the same with pyfixest
m_pf=pf.feols("y ~ 1 + fx1 + fx2 | d ~ z + z2", df, vcov="hetero")
m_pf.tidy()
# The above code produces the following resultEstimateStd. ErrortvaluePr(>|t|) 2.5%97.5%CoefficientIntercept1.0000210.0009031107.6155070.00.9982511.001790d2.2996460.0004874725.3788070.02.2986922.300599fx11.0000250.00005518231.8445750.00.9999181.000133fx21.9999730.00005536461.6614870.01.9998652.000080
This process requires extra tasks : the data compression processes twice and generation of predicted values. I'm not familiar with duckdb, but without counting time taken for creating duckdb database, this certainly beats naive implementation of IV regression.
For @apoorvalal and @s3alfisc, do you think that this is worth adding to the codebase? If yes, I would like to work more to figure out how to compute different types of vcov of 2 sls estimator based on Wong et al 2021, and how to optimize on the data compression parts. Also, I would PR this after #574 is merged.
The text was updated successfully, but these errors were encountered:
Inspired by the work by @apoorvalal and @s3alfisc(#574 and here), I suggest adding a new function method that computes first and second stage regressions of 2 SLS estimator using sufficient statistics. The trick is to use the predicted value of endogenous covariate using discrete exogenous covariates and instruments in the first stage regression , and estimate the second stage using this predicted value. Note that the predicted values of the endogenous covariate remains discrete, as they can be collapsed to match the number of unique values in the discrete control and instrument variables. Here is the code example using duckreg by @apoorvalal.
The code above produces the following result.
where the point estimates are the same with pyfixest
This process requires extra tasks : the data compression processes twice and generation of predicted values. I'm not familiar with duckdb, but without counting time taken for creating duckdb database, this certainly beats naive implementation of IV regression.
For @apoorvalal and @s3alfisc, do you think that this is worth adding to the codebase? If yes, I would like to work more to figure out how to compute different types of vcov of 2 sls estimator based on Wong et al 2021, and how to optimize on the data compression parts. Also, I would PR this after #574 is merged.
The text was updated successfully, but these errors were encountered: