Estimating Individual Advertising Effect In E Commerce

Abstract

Online advertising has been the major monetization approach for Internet companies. Advertisers invest budgets to bid for real-time impressions to gain direct and indirect returns. Existing works have been concentrating on optimizing direct returns brought by advertising traffic. However, indirect returns induced by advertising traffic such as influencing the online organic traffic and offline mouth-to-mouth marketing provide extra significant motivation to advertisers. Modeling and quantization of causal effects between the overall advertising return and budget enable the advertisers to spend their money more judiciously. In this paper, we model the overall return as individual advertising effect in causal inference with multiple treatments and bound the expected estimation error with learnable factual loss and distance of treatmentspecific context distributions. Accordingly, a representation and hypothesis network is used to minimize the loss bound. We apply the learned causal effect in the online bidding engine of an industrylevel sponsored search system. Online experiments show that the causal inference based bidding outperforms the existing online bidding algorithm.

1 Introduction

The last two decades have seen the prosperity of e-commerce. Taking Taobao as an example, as the biggest e-commerce marketplace in China [Edquid, 2016], Taobao search service covers over 300 million consumers each day, bringing daily 10 billion search queries and subsequent page views (PVs), providing advertisers sufficient opportunities to promote their commodities online ¹.

In sponsored search advertising, advertisers bid for keywords associated with their commodities (ADs ²) and pay the platform when consumers land their commodity/store homepage by clicking the advertisement (Pay-Per-Click,

PPC). The payment equals the minimum bid price required to keep the advertising slot in the real-time competition [Wilkens et al., 2017]. The returns of advertising can be summarized in two aspects. The direct returns of advertising are the impressions, clicks and conversions occurred upon the advertising PVs. Meanwhile, advertising also yields even more valuable indirect returns by connecting with wider online audience, thus impressing more audience via unobserved social interactions. Additionally, in e-commerce platform like Taobao, there is a ranking index called “sales volume” for organic search traffics, which reflects the purchasing popularity of the commodity among its peers and has been an important shopping guideline for consumers. In this way advertisers can accumulate sales volume via advertising PVs to gain more exposures in gigantic organic search PVs.

The direct and indirect returns motivate advertisers to invest advertising budget to prosper their online business. Despite its significance, however, to the best of our knowledge, existing work mostly focus on optimizing direct returns [Zhang, 2016; Zhu et al., 2017]. This might be caused by the fact that direct returns are seamlessly observable in the closed-loop e-commercial platform like Taobao. Meanwhile, there are so many factors leading to the overall returns, making it intractable to quantify the indirect returns attributed to advertising. Nonetheless, the ability of inferring the overall advertising effect including both direct and indirect returns provide advertisers the opportunities to allocate their advertising budget more wisely in the product life cycle.

E-commercial advertisers are eager to know the growth of advertising returns if they invest more budget via a specified advertising channel. Specifically, in the PPC advertising, the cost is equivalent to the number of clicks occurred in advertising PVs given a relatively stable payment per click. And the overall advertising returns can be observed as the number of total clicks of the advertising AD accumulated in all the online channels. Therefore, we want to infer the individual advertising effect (IAE) via predicting the incremental number of all-channel clicks in a period under the intervention of advertising clicks ³.

^<sup>2A commodity is also called an AD in Taobao, which represents both the commodity and the associated advertisement.

^<sup>3Numerical analysis also shows that the Pearson’s coefficient between the number of advertising clicks and all-channel clicks is approximately 25% larger than that between advertising impressions and all-channel clicks. We hide the detailed coefficient value due to commercial secrets.

The problem of inferring IAE resembles the estimation of individual treatment effect (ITE) in the field of causal inference or learning from observational data [Rubin, 2005]. In causal inference, we only have observational data which contains the past actions, their outcomes and possibly more context. However, we do not know the mechanism which gave rise to the action. In the scenario of advertising, the context might correspond to the features representing the current status of the AD, while the action and outcome are the number of advertising clicks and all-channel clicks (containing all direct and direct returns), respectively. The key difference between IAE and ITE is that actions of the latter are binary or categorical, but those of the former might be continuous and transitive. Furthermore, for any specified context, there exists only one exact action (acquire specific number of advertising clicks) in the data. We can never know exactly the potential advertising outcome if it applies a different action in exactly the same context. Besides, the observed advertising outcomes can be influenced by a lot of factors including online sources such as in or out of Taobao recommendation/organic search, and offline mouth-to-mouth marketing by the audience etc, which is similar to the confounding factor in classic causal inference. Since the effect of advertising is accumulating in the whole-time horizon, we assume that the context together with the action contain all the necessary information to determine the outcome, i.e., the “no-hidden confounding” assumption holds in the analysis.

In this paper we model the causal effect between advertising cost and returns via a formal definition of IAE. We propose a representation network and a hypothesis network combined to predict the individual advertising effect referring [Shalit et al., 2017]. Different from binary or categorical treatments, advertising treatments are continuous and transitive. Relying on this property, we derive a rigorous theoretical upper bound of the expected IAE estimation error by way of a learnable factual regression loss and the distance of context distribution among different treatments. Then the network is trained to minimize the derived theoretical upper bound. Furthermore, we derive a time-varying factor called leverage rate (lvr) based on IAE to reflect the AD-level potential to lever the overall advertising returns. The learned lvr is used in the online bidding engine to achieve better overall advertising performance in Taobao sponsored search. The contributions of this paper can be summarized as follows:

1. We model the problem of predicting the overall advertising return in the framework of causal inference. In this framework, the formal definition of individual advertising effect is given.
1. We derive a general theoretical upper bound on the expected IAE estimation error in advertising scenarios with multiple continuous and transitive treatments. Subsequently, a representation and hypothesis network is learned to predict IAE.
1. IAE-induced lvr is integrated in the online bidding engine, which yields better overall advertising returns compared with the existing bidding engine in Taobao sponsored search.

When considering only direct advertising returns, estimating individual advertising effect has been investigated in both ex ante and ex post way. The key of the ex ante estimation lies in three separate models of predicting the winning rate of specific bids, click-through-rate and conversionrate of the advertising PV [Zhang, 2016; Zhu et al., 2017]. Meanwhile, attribution modeling [Dalessandro et al., 2012; Diemert et al., 2017] corresponds to the ex post estimation of advertising effect, i.e. attributing the later conversion to the previous customer-commodity/store contact. All these works ignore the indirect returns brought by the advertising PVs.

Causal inference has already been used in complex realworld ad-placement systems [Bottou et al., 2013]. In the scenario of estimating individual advertising effect, given a context (AD), a naive way might be direct least square regression to fit the advertising effect, either taking the number of advertising clicks as a feature or separately fit each action. However, such estimation might be biased due to the fact that different contexts should have priority of choosing specific actions in the dataset. To alleviate the bias, propensity score, which characterizes the probability vector of choosing specific actions, [Austin, 2011] is used to evaluate the similarity of two contexts. Therefore, counterfactual samples can be constructed by comparing the propensity scores via various approaches such as nearest neighbor matching [Lopez et al., 2017]. Besides propensity score, various methods such as random forests[Wager and Athey, 2017; Athey and Imbens, 2016] and expensive random control trials [Taddy et al., 2016; Peysakhovich and Lada, 2016] are also used to tackle the binary treatment causal inference.

Recently deep representation is also used to encode the contexts. Atan et al. proposed an auto encoder-decoder network to represent the raw context, to ensure that the propensity score vector of the mapped contexts is similar, therefore removing the selection bias [Atan et al., 2018]. Johansson et al. also designed a deep representation network to embed the original contexts, to guarantee that the distribution of contexts after the representation is similar between two different treatments, as well as the small regression loss [Johansson et al., 2016]. In its later version, a theoretical error bound on the expected ITE is given to yield a more rigorous estimation algorithm [Shalit et al., 2017]. Deep models prove to be advanced but when faced with the multiple treatments, theoretical error bound is non-trivial. In this paper we design a similar network structure as [Shalit et al., 2017] but derive our theoretical upper bound considering the continuous action space in the advertising scenario.

Back to the bidding application, perhaps the most relevant work is lift-based bidding proposed by Xu et al. By predicting the ex ante and ex post click-through-rate of an advertising impression, the bid price is adjusted to be proportional to the lift [Xu et al., 2016]. However, we point out that the observed outcome might also ignore the abundant indirect returns.

3 Individual Advertising Effect Formalization

We adopt the Rubin-Neyman potential outcomes modeling framework [Rubin, 2005] in causal inference but tailor it for the e-commerce advertising scenario. Let $\mathcal{X} \subset \mathbb{R}^d$ be the set of contexts, $\mathcal{T} = \{T_1,...,T_n\}$ the n-action (also known as treatment or intervention) set, $\mathcal{Y} \subset \mathbb{R}$ the observed overall performance index. For each context $x \in \mathcal{X}$ , there is a treatment assignment $T \in \mathcal{T}$ and with n potential outcomes, $Y_{T_1}, Y_{T_2}, ..., Y_{T_n} \in \mathcal{Y}$ . The samples we have can be denoted as $\{x_i, t_i, y_i\}_{i=1}^N$ , where $y_i = Y_{t_i}$ . We do not observe any of the other potential outcomes (i.e., $Y_T$ for $T \neq t_i$ ).

In e-commercial advertising, context x can be features representing the status of an AD in the beginning of the day. Treatment T refers to the number of clicks acquired from the advertising PVs during the day, while potential outcome y might be observed as the overall whole-site clicks obtained by the same AD until the end of the day. Apparently, the potential outcome can be influenced by a lot of factors including online channels such as in or out of Taobao recommendation/organic search, and offline mouth-to-mouth marketing by extroverted audience etc. Specifically, let $T_i = i - 1, i = i - 1$ 1, …, n, where n-1 can be interpreted as the context-specific largest possible advertising clicks in a day. Note that we restrict the advertising effects y to be happened in the same day, but ignores the persisting effects in the far future. This naturally coincides with the advertising logic that advertisers are accustomed to adjusting the budget of an AD day by day. We can also alleviate the influence of persisting dependency by following the “strong ignorability” assumption in causal inference.

Assumption 1. (Strong Ignorability) $Y_{T_1}, Y_{T_2}, ..., Y_{T_n} \perp T | x$ , which means that, given a context x, the potential outcome is independent of the treatment assignment.

Strong ignorability assumption also ensures that there is a positive probability of choosing any action in each context x.

Definition 1. (Individual Advertising Effect, IAE) The IAE of context x from treatment $T_i$ to $T_j$ can be defined as:

$\alpha_{i,j}(x) = m_j(x) - m_i(x), \forall i, j = 1, ..., n,$ (1)

where $m_i(x) = \mathbb{E}[Y_{T_i}|x], \forall i, j = 1, ..., n$ .

For n treatments, we can obtain an antisymmetric matrix $\alpha(x) = [\alpha_{i,j}] \subset \mathbb{R}^{n \times n}$ which corresponds to the IAE in context x. Matrix $\alpha(x)$ has the following properties:

Antisymmetric: $\alpha_{i,j} = -\alpha_{j,i}$ ;
Monotonicity: $\alpha_{i,j} \leq \alpha_{i,k}, \forall j \leq k, i = 1, ..., n; \alpha_{i,j} \geq \alpha_{k,j}, \forall i \leq k, j = 1, ..., n;$
Zero-Diagonal: $\alpha_{i,i} = 0, \forall i = 1,..,n;$
Transitivity: $\alpha_{i,j} + \alpha_{j,k} = \alpha_{i,k}, \forall i, j, k = 1, ..., n$ .

In causal inference, IAE is analogous to individual treatment effect. However, different from classical causal inference with only two interventions of either treatment or nontreatment, multiple actions are available in IAE. Since the action and outcome are accumulated in a day, there is some kind of ambiguity of advertising effect with clicks assigned to different time slots. Therefore, we take the expectation in the

right-hand side of Eqn. (1) to eliminate the ambiguity. In this sense, IAE is the average advertising effect and should be useful among different ADs.

To learn IAE, we further define a representation function $\Phi: \mathcal{X} \to \mathcal{R}$ where $\mathcal{R}$ is the representation space. Let $h: \mathcal{R} \times \mathcal{T} \to \mathcal{Y}$ be a hypothesis function which yields the outcome. Putting it together, we denote $f(x,T) = h(\Phi(x),T)$ .

Definition 2. Given a hypothesis f, the IAE estimation for context x is:

$\hat{\alpha}_{i,j}^{f}(x) = f(x, T_j) - f(x, T_i). \tag{2}$

With a little abuse of notation, we will omit the superscript f and write it as $\hat{\alpha}_{i,j}(x)$ without confusion.

Definition 3. The IAE estimation error of treatment pairs $(T_i, T_j)$ satisfies that:

$\tau_{i,j}(x) = \hat{\alpha}_{i,j}(x) - \alpha_{i,j}(x). \tag{3}$

Definition 4. The expected Precision in Estimation of Heterogeneous Effect (PEHE) [Shalit et al., 2017] loss of f is:

$\epsilon_{PEHE}(f) = \frac{1}{n(n-1)} \sum_{i=1}^{n} \sum_{j=1, j \neq i}^{n} \int_{\mathcal{X}} \tau_{i,j}^{2}(x) p(x) dx.$ (4)

For completeness, we also give the notion of Integral Probability Metric (IPM), which is a class of distance metrics between probability distributions [Shalit et al., 2017]. For two probability density functions p, q defined over $S \subset \mathbb{R}^d$ , and for a function family G as $g: S \to \mathbb{R}$ , it holds that

$IPM_G(p,q) := \sup_{g \in G} |\int_{\mathcal{S}} g(s)(p(s) - q(s))ds|.$

4 Learning to Infer Individual Advertising Effect

The key of learning IAE lies in minimizing the PEHE loss in Eqn. (4). The idea is similar as [Shalit et~al., 2017]. Firstly, we map the original context into the representation space $\mathcal R$ via $\Phi$ . In the new space, we denote the probability density function of representation space $\mathcal R$ given treatment T as $p_{\Phi}^T$ . To remove the selection bias in the original space, the idea is that the distance between $\{p_{\Phi}^{T_i}\}_{i=1}^n$ should be as small as possible, which is guaranteed by the representation network. Given the similar context distribution in the representation space, the hypothesis network should try to minimize the regression loss of fitting the advertising return. The neural network architecture is displayed in Fig. 1.

The architecture resembles that in [Shalit et al., 2017]. However, the key differences lies in two aspects. Firstly, the hypothesis network in that of [Shalit et al., 2017] are separate for treatment/non-treatment. In the advertising scenario, the treatments can be seen as continuous actions and should be generalizable, therefore different treatments share the same hypothesis network. Secondly, IPM for binary treatments are straightforward while it is not obvious for multiple treatments. We simplify the IPM term based on the transitive property of treatment effects. In the following part we will first give the theoretical upper error bound of PEHE error and elaborate the detailed IPM we use, followed by the description of the IAE estimation algorithm.

Figure 1: Neural network architecture for IAE estimation. L is a loss function and $\Phi$ is a representation of the original context x. h represents the hypothesis and f denotes the complete function.

4.1 Loss Error bound

Before analyzing our main result, we first give a lemma considering the binary treatment case.

Lemma 1. [Shalit et al., 2017] Let $\Phi: \mathcal{X} \to \mathcal{R}$ be a one-to-one representation function with inverse $\Psi$ . Let $h: \mathcal{R} \times \{T_i, T_j\} \to \mathcal{Y}$ be an hypothesis. Assume there exists a constant $B_{\Phi}$ such that for $T \in \{T_i, T_j\}$ , the per-unit expected loss functions $\ell_{h,\Phi}(\Psi(r),T)$ obey $\frac{1}{B_{\Phi}} \cdot \ell_{h,\Phi}(\Psi(r),T) \in G$ , where $\ell_{h,\Phi}(x,T) = \int_{\mathcal{Y}} L(h(\Phi(x),T),Y_T)p(Y_T|x)dY_T$ . Assuming that the loss L is the squared loss, we have that