Skip to content

rizbicki/FlexCoDE

Repository files navigation

FlexCoDE

Implements FlexCoDE, Flexible Conditional Density Estimator, in R.

FlexCode is a general-purpose methodology for converting any conditional mean point estimator of $z$ to a conditional {\em density} estimator $f(z \vert x)$, where $x$ represents the covariates. The key idea is to expand the unknown function $f(z \vert x)$ in an orthonormal basis ${\phi_i(z)}_{i}$:

$$f(z|x)=\sum_{i}\beta_{i }(x)\phi_i(z). $$

By the orthogonality property, the expansion coefficients are just conditional means

$$\beta_{i }(x) = \mathbb{E}\left[\phi_i(z)|x\right] \equiv \int f(z|x) \phi_i(z) dz.$$

These coefficients can easily be estimated from data by regression.

More on FlexCoDE: Izbicki, R.; Lee, A.B. Converting High-Dimensional Regression to High-Dimensional Conditional Density Estimation. Electronic Journal of Statistics, 2017

To install the package, run

# install.packages("devtools")
devtools::install_github("rizbicki/FlexCoDE")

A simple example:

# generate data
n=1000
d=10
data=matrix(NA,n,d+1)
data[,1:d]=matrix(rnorm(n*d),n,d)
data[,d+1]=data[,1]+rnorm(n,0,0.1)

# determine sample sizes
nTrain=round(0.7*n)
nValidation=round(0.25*n)
nTest=n-nTrain-nValidation

# split data
randomIndex=sample(1:n)
xTrain=data[randomIndex[1:nTrain],1:d]
xValidation=data[randomIndex[(nTrain+1):(nTrain+nValidation)],1:d]
xTest=data[randomIndex[(nTrain+nValidation+1):n],1:d]
zTrain=data[randomIndex[1:nTrain],d+1]
zValidation=data[randomIndex[(nTrain+1):(nTrain+nValidation)],d+1]
zTest=data[randomIndex[(nTrain+nValidation+1):n],d+1]

# Fit nearest neighbors FlexCoDE
fit=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
            nIMax = 20,regressionFunction = regressionFunction.NN)
fit$estimatedRisk
print(fit)
plot(fit,xTest,zTest)

# Plot estimated curves evaluates on new test points, and
# compare with true conditional density
predictedValues=predict(fit,xTest,B=500)
plot(predictedValues$z,predictedValues$CDE[1,])
lines(predictedValues$z,dnorm(predictedValues$z,xTest[1,1],0.1),col=2)

# Fit sparse additive FlexCoDE
fit=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
            nIMax = 30,regressionFunction = regressionFunction.SpAM)
fit$estimatedRisk
print(fit)
plot(fit,xTest,zTest)

# Plot estimated curves evaluates on new test points, and
# compare with true conditional density
predictedValues=predict(fit,xTest,B=500)
plot(predictedValues$z,predictedValues$CDE[1,])
lines(predictedValues$z,dnorm(predictedValues$z,xTest[1,1],0.1),col=2)


# Fit sparse additive FlexCoDE using 4 cores (i.e., using parallel computing)
fit=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
            nIMax = 30,regressionFunction = regressionFunction.SpAM,
            regressionFunction.extra=list(nCores=4))
fit$estimatedRisk
print(fit)
plot(fit,xTest,zTest)

FlexZBoost

FlexZBoost is a particular realization of FlexCode, where we use XGBoost for the regression part, as these techniques scale well for massive data. There is an additional tuning parameter in FlexZBoost: an exponent $\alpha$ that we use to sharpen the computed density estimates $\widehat{f}(z|x)$, according to $\widetilde{f}(z|x) \propto \widehat{f}(z|x)^\alpha$.

A simple example of fitting flexZBoost using the data generated above:

fit=FlexZBoost(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
            nIMax = 30)
fit$bestAlpha
fit$estimatedRisk
plot(fit,xTest,zTest)

An example to Buzzard data

If you are a member of the LSST DESC, you will find information about and pointers to Data Challenge 1 (DC1) data at https://confluence.slac.stanford.edu/pages/viewpage.action?spaceKey=LSSTDESC&title=DC1+resources

library(dplyr)
data.spec=read.table("buzzard_spec_witherrors_mass.txt",header=T)

data.spec.redshift=data.spec$redshift
# computes additional covariates:
data.spec.cov=data.spec %>% mutate(ug = u-g, gr = g-r, ri = r-i, iz = i-z, zy = z-y,
                                   ug.err = sqrt(u.err^2+g.err^2),
                                   gr.err = sqrt(g.err^2+r.err^2),
                                   ri.err = sqrt(r.err^2+i.err^2),
                                   iz.err = sqrt(i.err^2+z.err^2),
                                   zy.err = sqrt(z.err^2+y.err^2))
data.spec.cov=data.spec.cov %>% select(-redshift)

data.photo=read.table("buzzard_phot_witherrors_mass.txt",header=T)
# computes additional covariates:
data.photo.cov=data.photo %>% mutate(ug = u-g, gr = g-r, ri = r-i, iz = i-z, zy = z-y,
                                     ug.err = sqrt(u.err^2+g.err^2),
                                     gr.err = sqrt(g.err^2+r.err^2),
                                     ri.err = sqrt(r.err^2+i.err^2),
                                     iz.err = sqrt(i.err^2+z.err^2),
                                     zy.err = sqrt(z.err^2+y.err^2))
data.photo.redshift=read.table("data/buzzard_truth.txt",header=T)[,1]

# determine sample size for training and for validation:
n=nrow(data.spec.cov)
nTrain=round(0.8*n)
# split data
randomIndex=sample(1:n)
xTrain=data.spec.cov[randomIndex[1:nTrain],]
xValidation=data.spec.cov[-randomIndex[1:nTrain],]
zTrain=data.spec.redshift[randomIndex[1:nTrain]]
zValidation=data.spec.redshift[-randomIndex[1:nTrain]]

fit=FlexZBoost(xTrain=xTrain,zTrain=zTrain,
               xValidation=xValidation,zValidation=zValidation,
               xTest = data.photo.cov,
               zTest= data.photo.redshift,
               nIMax = 40,regressionFunction.extra = list(nCores=5,
                                                                 ninter=2000))
fit$estimatedRisk
plot(fit,data.photo.cov,
     data.photo.redshift)