1 Introduction
Unlike standard supervised learning that models correlations, causal inference seeks to predict the effect of counterfactual interventions not seen in the data. For example, when wanting to estimate the effect of adherence to a prescription of
blockers on the prevention of heart disease, supervised learning may overestimate the true effect because good adherence is also strongly correlated with health consciousness and therefore with good heart health [9]. Figure LABEL:fig:toyshows a simple example of this type and demonstrates how a standard neural network (in blue) fails to correctly estimate the true treatment response curve (in orange) in a toy example. The issue is that standard supervised learning assumes that the residual in the response from the prediction of interest is independent of the features.
One approach to account for this is by adjusting for all confounding factors that cause the dependence, such as via matching [27, 26] or regression, potentially using neural networks [19, 28]. However, this requires that we actually observe all confounders so that treatment is asif random after conditioning on observables. This would mean that in the blocker example, we would need to perfectly measure all latent factors that determine both an individual’s adherence decision and their general healthfulness which is often not possible in practice.
Instrumental variables (IVs) provide an alternative approach to causaleffect identification. If we can find a latent experiment in another variable (the instrument) that influences the treatment (, is relevant) and does not directly affect the outcome (, satisfies exclusion), then we can use this to infer causal effects [2]. In the blocker example [9], the authors used copay cost as an IV. Because they enable analyzing natural experiments under mild assumptions, IVs have been one of the most widely used tools for empirical research in a variety of fields [1]. An important direction of research for IV analysis is to develop methods that can effectively handle complex causal relationships and complex variables like images that necessitate more flexible models like neural networks [17].
In this paper, we tackle this though a new method called DeepGMM that builds upon the optimallyweighted Generalized Method of Moments (GMM) [13], a widely popular method in econometrics that uses the moment conditions implied by the IV model to efficiently estimate causal parameters. Leveraging a new variational reformulation of the efficient GMM with optimal weights, we develop a flexible framework, DeepGMM, for doing IV estimation with neural networks. In contrast to existing approaches, DeepGMM is suited for highdimensional treatments and instruments , as well as for complex causal and interaction effects. DeepGMM is given by the solution to a smooth game between a prediction function and critic function. We prove that approximate equilibria provide consistent estimates of the true causal parameters. We find these equilibria using optimistic gradient descent algorithms for smooth game play [11], and give practical guidance on how to choose the parameters of our algorithm and do model validation. In our empirical evaluation, we demonstrate that DeepGMM’s performance is on par or superior to a large number of existing approaches in standard benchmarks and continues to work in highdimensional settings where other methods fail.
2 Setup and Notation
We assume that our data is generated by
(1) 
where the residual
has zero mean and finite variance, ,
and . However, different to standard supervised learning, we allow for the residual and to be correlated, , , can be endogenous, and therefore . We also assume that we have access to an instrument satisfying(2) 
Moreover, should be relevant, . Our goal is to identify the causal response function from a parametrized family of functions . Examples are linear functions , neural networks where represent weights, and nonparametric classes with infinitedimensional . For convenience, let be such that . Throughout, we measure the performance of an estimated response function by its MSE against the true .
Note that if we additionally have some exogenous context variables , the standard way to model this using Eq. 1 is to include them both in and in as and , where is the endogenous variable and is an IV for it. In the blocker example, if we were interested in the heterogeneity of the effect of adherence over demographics, would include both adherence and demographics whereas would include both copayment and demographics.
2.1 Existing methods for IV estimation
Twostage methods.
One strategy to identifying is based on noting that Eq. 2 implies
(3) 
If we let this becomes . The twostage least squares (2SLS) method [2, §4.1.1] first fits by leastsquares regression of on (with possibly transformed) and then estimates as the coefficient in the regression of on . This, however, fails when one does not know a sufficient basis for . [23, 10] propose nonparametric methods for expanding such a basis but such approaches are limited to lowdimensional settings. [17] instead propose DeepIV, which estimates the conditional density by flexible neuralnetworkparametrized Gaussian mixtures. This may be limited in settings with highdimensional and can suffer from the nonorthogonality of MLE under any misspecification, known as the “forbidden regression” issue [2, §4.6.1] (see Section 5 for discussion).
Moment methods.
The generalized method of moments (GMM) instead leverages the moment conditions satisfied by . Given functions , Eq. 2 implies , giving us
(4) 
A usual assumption when using GMM is that the moment conditions in Eq. 4 are sufficient to uniquely pin down (identify) . To estimate , GMM considers these moments’ empirical counterparts, , and seeks to make all of them small simultaneously, measured by their Euclidean norm :
(5) 
Other vector norms are possible.
[22] propose using and solving the optimization with noregret learning along with an intermittent jitter to moment conditions in a framework they call AGMM (see Section 5 for discussion).However, when there are many moments (many ), using any unweighted vector norm can lead to significant inefficiencies, as we may be wasting modeling resources to make less relevant or duplicate moment conditions small. The optimal combination of moment conditions, yielding minimal variance estimates is in fact given by weighting them by their inverse covariance, and it is sufficient to consistently estimate this covariance. In particular, a celebrated result [13] shows that (with finitelymany moments), using the following norm in Eq. 5 will yield minimal asymptotic variance (efficiency) for any consistent estimate of :
(6) 
Examples of this are the twostep, iterative, and continuously updating GMM estimators [16]. We generically refer to the GMM estimator given in Eq. 5 using the norm given in Eq. 6 as optimallyweighted GMM (OWGMM), or .
Failure of (OW)GMM with Many Moment Conditions.
When is a flexible model such as a highcapacity neural network, many – possibly infinitely many – moment conditions may be needed to identify . However, GMM and OWGMM algorithms fail when we use too many moment conditions. On the one hand, onestep GMM (, Eq. 5 with , ) is saddled with the inefficiency of trying to impossibly control many equallyweighted moments: at the extreme, if we let be all functions of with unit square integral, onestep GMM is simply equivalent to the noncausal leastsquares regression of on . On the other hand, we also cannot hope to learn the optimal weighting: the matrix in Eq. 6 will necessarily be singular and using its pseudoinverse would mean deleting all but moment conditions. Therefore, we cannot simply use infinite or even too many moment conditions in GMM or OWGMM.
3 Methodology
We next present our approach. We start by motivating it using a new reformulation of OWGMM.
3.1 Reformulating OWGMM
Let us start by reinterpreting OWGMM. Consider the vector space of realvalued functions of under the usual operations. Note that, for any , is a linear operator on and
is a bilinear form on . Now, given any subset , consider the following objective function:
(7) 
Lemma 1.
Let be the optimallyweighted norm as in Eq. 6 and let . Then
Corollary 1.
An equivalent formulation of OWGMM is
(8) 
In other words, Lemma 1 provides a variational formulation of the objective function of OWGMM and Corollary 1 provides a saddlepoint formulation of the OWGMM estimate.
3.2 DeepGMM
In this section, we outline the details of our DeepGMM framework. Given our reformulation above in Eq. 8, our approach is to simply replace the set with a more flexible set of functions. Namely we let be the class of all neural networks of a given architecture with varying weights (but not their span). Using a rich class of moment conditions allows us to learn correspondingly a rich . We therefore similarly let be the class of all neural networks of a given architecture with varying weights .
Given these choices, we let be the minimizer in of for any, potentially datadriven, choice . We discuss choosing in Section 4. Since this is no longer closed form, we formulate our algorithm in terms of solving a smooth zerosum game. Formally, our estimator is defined as:
(9)  
Since evaluation is linear, for any , the game’s payoff function is convexconcave in the functions and , although it may not be convexconcave in and as is usually the case when we parametrize functions using neural networks. Solving Eq. 9 can be done with any of a variety of smooth game playing algorithms; we discuss our choice in Section 4.
Notably, our approach has very few tuning parameters: only the models and (, the neural network architectures) and whatever parameters the optimization method uses. In Section 4 we discuss how to select these.
Finally, we highlight that unlike the case for OWGMM as in Lemma 1, our choice of is not a linear subspace of . Indeed, per Lemma 1, replacing with a high or infinitedimensional linear subspace simply corresponds to GMM with many or infinite moments, which fails as discussed in Section 2.1 (in particular, we would generically have unhelpfully). Similarly, enumerating many moment conditions as generated by, say, many neural networks and plugging these into GMM, whether onestep or optimally weighted, will fail for the same reasons. Instead, our approach is to leverage our variational reformulation in Lemma 1 and replace the class of functions with a rich (nonsubspace) set in this new formulation, which is distinct from GMM and avoids these issues. In particular, as long as has bounded complexity, even if its ambient dimension may be infinite, we can guarantee the consistency of our approach. Since the last layer in a network is a linear combination of the penultimate one, our choice of can in some sense be thought of as a union over neural network weights of subspaces spanned by the penultimate layer of nodes.
3.3 Consistency
Before discussing practical considerations in implementating DeepGMM, we first turn to the theoretical question of what consistency guarantees we can provide about our method if we were to approximately solve Eq. 9. We phrase our results for generic boundedcomplexity functional classes ; not necessarily neural networks.
Our main result depends on the following assumptions, which we discuss after stating the result.
Assumption 1 (Identification).
is the unique satisfying for all .
Assumption 2 (Bounded complexity).
and have vanishing Rademacher complexities:
Assumption 3 (Absolutely star shaped).
For every and , we have .
Assumption 4 (Continuity).
For any , are continuous in , respectively.
Assumption 5 (Boundededness).
are all bounded random variables.
Theorem 2.
Suppose Assumptions 3, 4, 5, 2 and 1 hold. Let
by any datadependent sequence with a limit in probability. Let
be any approximate equilibrium in the game Eq. 9, ,Then in probability.
Theorem 2 proves that approximately solving Eq. 9 (with eventually vanishing approximation error) guarantees the consistency of our method. We next discuss the assumptions we made.
Assumption 1 stipulates that the moment conditions given by are sufficient to identify . Note that, by linearity, the moment conditions given by are the same as those given by the subspace so we are actually successfully controlling many or infinite moment conditions, perhaps making the assumption defensible. If we do not assume Assumption 1, the arguments in Theorem 2 easily extend to showing instead that , , we approach some identified that satisfies all moment conditions.
Assumption 2 provides that and , although potentially infinite and even of infinite ambient dimension, have limited complexity. Rademacher complexity is one way to measure function class complexity [4]. Given a bound (envelope) as in Assumption 5, this complexity can also be reduced to other combinatorial complexity measures such VC or pseudodimension via chaining [25]. [5] studied such combinatorial complexity measures of neural networks.
Assumption 3 is needed to ensure that, for any with for some , there also exists an such that . It trivially holds for neural networks by considering their last layer. Assumption 4 similarly holds trivially and helps ensure that the moment conditions cannot simultaneously arbitrarily approach zero far from their true zero point at . Assumption 5 is a purely technical assumption that can likely be relaxed to require only nice (subGaussian) tail behavior. Its latter two requirements can nonetheless be guaranteed by either bounding weights (equivalently, using weight decay) or applying a bounded activation at the output. We do not find doing this is necessary in practice.
4 Practical Considerations in Implementing DeepGMM
Solving the Smooth ZeroSum Game.
In order to solve Eq. 9, we turn to the literature on solving smooth games, which has grown significantly with the recent surge of interest in generative adversarial networks (GANs). In our experiments we use the OAdam algorithm of [11]. For our game objective, we found this algorithm to be more stable than standard alternating descent steps using SGD or Adam.
Using firstorder iterative algorithms for solving Eq. 9
enables us to effectively handle very large datasets. In particular, we implement DeepGMM using PyTorch, which efficiently provides gradients for use in our descent algorithms
[24]. As we see in Section 5, this allows us to handle very large datasets with highdimensional features and instruments where other methods fail.Choosing .
In Eq. 9, we let be any potentially datadriven choice. Since the hope is that , one possible choice is just the solution for another choice of . We can recurse this many times over. In practice, to simulate many such iterations on , we continually update as the previous iterate over steps of our gameplaying algorithm. Note that is nonetheless treated as “constant” and does not enter into the gradient of . Given this approach we can interpret in the premise of Theorem 2 as the final at convergence, since Theorem 2 allows to be fully datadriven.
Hyperparameter Optimization.
The only parameters of our algorithm are the neural network architectures for and and the optimization algorithm parameters (, learning rate). To tune these parameters, we suggest the following general approach. We form a validation surrogate for our variational objective in Eq. 7 by taking instead averages on a validation data set and by replacing with the pool of all iterates
encountered in the learning algorithm for all hyperparameter choice. We then choose the parameters that maximize this validation surrogate
. We discuss this process in more detail in Section B.1.Early Stopping.
We further suggest to use to facilitate early stopping for the learning algorithm. Specifically, we periodically evaluate our iterate using and return the best evaluated iterate.
5 Experiments
In this section, we compare DeepGMM against a wide set of baselines for IV estimation. Our implementation of DeepGMM is publicly available at https://github.com/CausalML/DeepGMM.
We evaluate the various methods on two groups of scenarios: one where are both lowdimensional and one where , , or both are highdimensional images. In the highdimensional scenarios, we use a convolutional architecture in all methods that employ a neural network to accommodate the images. We evaluate performance of an estimated by MSE against the true .
More specifically, we use the following baselines:

DirectNN: Predicts from using a neural network with standard least squares loss.

Vanilla2SLS: Standard twostage least squares on raw .

Poly2SLS: Both and
are expanded via polynomial features, and then 2SLS is done via ridge regressions at each stage. The regularization parameters as well polynomial degrees are picked via crossvalidation at each stage.

AGMM [22]: Uses the publicly available implementation^{1}^{1}1https://github.com/vsyrgkanis/adversarial_gmm of the Adversarial Generalized Method of Moments, which performs noregret learning on the onestep GMM objective Eq. 5 with norm
and an additional jitter step on the moment conditions after each epoch.

DeepIV [17]: We use the latest implementation that was released as part of the econML package.^{2}^{2}2https://github.com/microsoft/EconML
Note that GMM+NN relies on being provided moment conditions. When is lowdimensional, we follow AGMM [22] and expand
via RBF kernels around 10 centroids returned from a Gaussian Mixture model applied to the
data. When is highdimensional, we use the moment conditions given by each of its components.5.1 Lowdimensional scenarios
In this first group of scenarios, we study the case when both the instrument as well as treatment is lowdimensional. Similar to [22], we generated data via the following process:
In other words, only the first instrument has an effect on , and is the confounder breaking independence of and the residual . We keep this data generating process fixed, but vary the true response function between the following cases:
We sample points for train, validation, and test sets each. To avoid numerical issues, we standardize the observed values by removing the mean and scaling to unit variance. We plot the results in Fig. 1. The left column shows the sampled plotted against , with the true in orange. The other columns show in blue the estimated using various methods. Table 1 shows the corresponding MSE over the test set.
First we note that in each case there is sufficient confounding that the DirectNN regression fails badly and a method that can use the IV information to remove confounding is necessary.
Our next substantive observation is that our method performs competitively across scenarios, attaining the lowest MSE in each (except linear where are beat just slightly and only by methods that use a linear model). At the same time, other methods employing neural networks perform well in some scenarios and less well in others. Therefore we conclude that in the low dimensional setting, our method is able to adapt to the scenario and compete with best tuned methods for the scenario.
Overall, we also found that GMM+NN performed well (but not as well as our method). In some sense GMM+NN is a novel method; we are not aware of previous work using (OW)GMM to train a neural network. Whereas GMM+NN needs to be provided moment conditions, our method can be understood as improving further on this by learning the best moment condition over a large class using optimal weighting. Moreover, we found that GMM+NN outperformed AGMM, which uses the same moment conditions. Aside from the jitter step implemented in the AGMM code, it is equivalent to onestep GMM, Eq. 5, with vector norm in place of the standard norm. Its worse performance can perhaps be explained by this change and by its lack of optimal weighting.
In the experiments, the other NNbased method, DeepIV, was consistently outperformed by Poly2SLS across scenarios. This can potentially be attributed to the infamous “forbidden regression” issue. DeepIV relies on learning using a neuralnetworkparametrized Gaussian mixture model and plugging this into Eq. 3. However, without exactly correct specification, only leastsquares is guaranteed to produce orthogonal estimates (with uncorrelated residuals) and ensure Eq. 3 can be used downstream to remove confounding in fitting ; see [2, §4.6.1]. Other MLE estimates (such as Gaussian mixture), unless perfectly specified, may introduce spurious biases. In addition, the second stage regression in DeepIV is computationally heavy. In the next section, we also discuss its limitations with highdimensional .
abs  linear  sin  step  

DirectNN  
Vanilla2SLS  
Poly2SLS  
GMM+NN  
AGMM  
DeepIV  
Our Method 
Lowdimensional scenarios: Test MSE averaged across ten runs with standard errors.
MNIST_{z}  MNIST_{x}  MNIST_{x,z}  
DirectNN  
Vanilla2SLS  
Ridge2SLS  
GMM+NN  
AGMM  –  –  – 
DeepIV  –  –  
Our Method 
5.2 Highdimensional scenarios
We now move on to scenarios based on the MNIST dataset [20] in order to test our method’s ability to deal with structured, highdimensional and variables. For this group of scenarios, we use same data generating process as in Section 5.1 and fix the response function to be abs, but map , , or both and to MNIST images. Let be a transformation function that maps inputs to an integer between 0 and 9, and let be a function that selects a random MNIST image from the digit class . The images are dimensional. The scenarios are then given as:

MNIST_{Z}: as before, .

MNIST_{X}: , as before.

MNIST_{X, Z}: ,
We sampled 20000 points for the training, validation, and test sets and ran each method 10 times with different random seeds. We report the averaged MSEs in Table 2. We failed to run the AGMM code on any of these scenarios, as it crashed and returned overflow errors. Similarly, the DeepIV code produced nan outcomes on any scenario with a highdimensional . Furthermore, because of the size of the examples, we were similarly not able to run Poly2SLS. Instead, we present Vanilla2SLS and Ridge2SLS, where the latter is Poly2SLS with fixed linear degree. Vanilla2SLS failed to produce reasonable numbers for highdimensional because the firststage regression is illposed.
Again, we found that our method performed competitively across scenarios, achieving the lowest MSE in each scenario. In the MNIST_{Z} setting, our method had better MSE than DeepIV. In the MNIST_{X} and MNIST_{X,Z} scenarios, it handily outperformed all other methods. Even if DeepIV had run on these scenarios, it would be at great disadvantage since it models the conditional distribution over images using a Gaussian mixture. This can perhaps be improved using richer conditional density models like [8, 18], but the forbidden regression issue remains nonetheless. Overall, these results highlights our method’s ability to adept not only to each lowdimensional scenario but also to highdimensional scenarios, whether the features, instrument, or both are highdimensional, where other methods break. Aside from our method’s competitive performance, our algorithm was tractable and was able to run on these largescale examples where other algorithms broke computationally.
6 Conclusions
Other related literature and future work.
We believe that our approach can also benefit other applications where momentbased models and GMM is used [14, 15, 15, 6]. Moreover, notice that while DeepGMM is related to GANs [12], the adversarial game that we play is structurally quite different. In some senses, the linear part of our payoff function is similar to that of the Wasserstein GAN [3]; therefore our optimization problem might benefit from a similar approaches to approximating the sup player as employed by WGANs. Another important direction is further investigation into the possible efficiency of DeepGMM or efficient modifications thereof [7].
Conclusions.
We presented DeepGMM as a way to deal with IV analysis with highdimensional variables and complex relationships. The method was based on a new variational reformulation of GMM with optimal weights with the aim of handling many moments and was formulated as the solution to a smooth zerosum game. Our empirical experiments showed that the method is able to adapt to a variety of scenarios, competing with the best tuned method in low dimensional settings and performing well in high dimensional settings where even recent methods break.
References
 Angrist and Krueger [2001] J. D. Angrist and A. B. Krueger. Instrumental variables and the search for identification: From supply and demand to natural experiments. Journal of Economic perspectives, 15(4):69–85, 2001.
 Angrist and Pischke [2008] J. D. Angrist and J.S. Pischke. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton university press, 2008.
 Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.

Bartlett and Mendelson [2002]
P. L. Bartlett and S. Mendelson.
Rademacher and gaussian complexities: Risk bounds and structural
results.
Journal of Machine Learning Research
, 3(Nov):463–482, 2002.  Bartlett et al. [2019] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearlytight vcdimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1–17, 2019.
 Berry et al. [1995] S. Berry, J. Levinsohn, and A. Pakes. Automobile prices in market equilibrium. Econometrica, pages 841–890, 1995.
 Chamberlain [1987] G. Chamberlain. Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34(3):305–334, 1987.
 Chen et al. [2016] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
 Cole et al. [2006] J. A. Cole, H. Norman, L. B. Weatherby, and A. M. Walker. Drug copayment and adherence in chronic heart failure: effect on cost and outcomes. Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy, 26(8):1157–1164, 2006.
 Darolles et al. [2011] S. Darolles, Y. Fan, J.P. Florens, and E. Renault. Nonparametric instrumental regression. Econometrica, 79(5):1541–1565, 2011.
 Daskalakis et al. [2017] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training gans with optimism. arXiv preprint arXiv:1711.00141, 2017.
 Goodfellow et al. [2014] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
 Hansen [1982] L. P. Hansen. Large sample properties of generalized method of moments estimators. Econometrica, pages 1029–1054, 1982.
 Hansen and Sargent [1980] L. P. Hansen and T. J. Sargent. Formulating and estimating dynamic linear rational expectations models. Journal of Economic Dynamics and control, 2:7–46, 1980.
 Hansen and Singleton [1982] L. P. Hansen and K. J. Singleton. Generalized instrumental variables estimation of nonlinear rational expectations models. Econometrica, pages 1269–1286, 1982.
 Hansen et al. [1996] L. P. Hansen, J. Heaton, and A. Yaron. Finitesample properties of some alternative gmm estimators. Journal of Business & Economic Statistics, 14(3):262–280, 1996.
 Hartford et al. [2017] J. Hartford, G. Lewis, K. LeytonBrown, and M. Taddy. Deep iv: A flexible approach for counterfactual prediction. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1414–1423. JMLR. org, 2017.

Isola et al. [2017]
P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros.
Imagetoimage translation with conditional adversarial networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 1125–1134, 2017.  Johansson et al. [2016] F. Johansson, U. Shalit, and D. Sontag. Learning representations for counterfactual inference. In International conference on machine learning, pages 3020–3029, 2016.
 LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Ledoux and Talagrand [2013] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
 Lewis and Syrgkanis [2018] G. Lewis and V. Syrgkanis. Adversarial generalized method of moments. arXiv preprint arXiv:1803.07164, 2018.
 Newey and Powell [2003] W. K. Newey and J. L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003.
 Paszke et al. [2017] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 Pollard [1990] D. Pollard. Empirical processes: theory and applications. In NSFCBMS regional conference series in probability and statistics, pages i–86. JSTOR, 1990.
 Rosenbaum and Rubin [1983] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
 Rubin [1973] D. B. Rubin. Matching to remove bias in observational studies. Biometrics, pages 159–183, 1973.
 Shalit et al. [2017] U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3076–3085, 2017.
Appendix A Omitted Proofs
Proof of Lemma 1.
First note that since , the associated dual norm is . Next, define as shorthand as shorthand for . It follows from the definition of the dual norm that . Therefore we have:
The Lagrangian of this optimization problem is given by:
Taking the derivative of this with respect to shows us that when , this quantity is maximized by . In addition we clearly have strong duality for this problem by Slater’s condition whenever (since in this case is a feasible interior point). This therefore gives us the following dual formulation for :
Taking derivative with respect to we can see that this is minimized by setting . Given this and strong duality, we know it must be the case that . Rearranging terms and doing a change of variables gives us the identity:
Finally, we can note that any vector corresponds to some , such that , and according to this notation we have and . Therefore our required result follows directly from the previous identity. ∎
Proof of Theorem 2.
Define , , and , where refers to the empirical measure (average over the data points) and . We will proceed by proving the following three conditions, and then proving our results in terms of these conditions:


for every we have

We will proceed by proving these conditions one by one. For the first, we can derive the inequality:
Next we will bound these two terms separately, which we will term and . For the first term, we can derive the following bound, where are iid Rademacher random variables, , and are shadow variables:
Note that in the final inequality we apply the inequality . Now given Assumption 5, the functions that map and to the summands in each term are Lipschitz. Now for any function class and Lipschitz function we have , where is the Rademacher complexity of class [21, Thm. 4.12]. Therefore we have
for some constant . Thus given Assumption 2 it must be case that . Now let be some recalculation of where we are allowed to edit the ’th , , and values. Then given Assumption 5 we can derive the following bounded differences inequality:
for some constant . Therefore from McDiarmid’s Inequality we have . Putting this and the previous result for together we get .
Next, define . Recall that from the premise of the theorem we have . Then by Slutsky’s Theorem, the Continuous Mapping Theorem, and Assumption 4 we have . Given this we can bound as follows:
Now we know from Assumption 5 that is uniformly bounded, so it follows that for some constant . Next we can note, again based on our boundedness assumption, that is uniformly bounded. Therefore it follows from the Lebesgue Dominated Convergence Theorem that . Thus we know that both and converge, so we have proven the first of the three conditions, that converges in probability to zero.
For the second condition we will first prove that is the unique minimizer of . Clearly by Assumptions 3 and 1 we have that is the unique minimizer of , since it sets this quantity to zero, and by these assumptions any other value of must have at least one that can be played in response that makes this expectation strictly positive. Now we can see that also, since , and the inside of the supremum is clearly nonpositive but can be set to zero using the zero function for , which is allowed given Assumption 3. Furthermore, for any other , let be some function in such that . If we have then it follows immediately that . Otherwise, consider the function for arbitrary . Since by Assumption 3 this function is also contained in , it follows that:
This expression is a quadratic in that is clearly positive when is sufficiently small, so therefore it still follows that .
Given this, we will prove the second condition by contradiction. If this were false, then for some we would have that , where . This is because from Assumption 1 we know is the unique minimizer of . Given this there must exist some sequence in satisfying . Now by construction is closed, and the corresponding limit parameters must satisfy , since given Assumption 4 is clearly a continuous function of so we can swap function application and limit. However , so . This contradicts the fact that is the unique minimizer of , so we have proven the second condition.
Finally, for the third condition we will use the fact that by assumption satisfies the approximate equilibrium condition:
Now by definition . Therefore,
Thus we have
At this point we have proven all three conditions stated at the start of the proof. For the final part we can first note that from the first and third conditions it easily follows that , since
Comments
There are no comments yet.