Understanding Lasserre Hierarchy

From a Probabilistic Perspective
Posted on June 5, 2025 by Yu Cong and Hongjie Qing
Tags: ,

Useful links:
  1. https://sites.math.washington.edu/~rothvoss/lecturenotes/lasserresurvey.pdf
  2. https://web.stanford.edu/class/cs369h/
  3. Laurent’s survey [1]
  4. https://rameesh-paul.github.io/sos.pdf
  5. https://www.ams.jhu.edu/~abasu9/papers/main-Lasserre.pdf
  6. Chapter 3 of Ali Kemal Sinop’s PhD thesis

When I started writing this post, I hadn’t found so many “useful links” yet. The content below and link 1.2. only focus on using the Lasserre hierarchy on LPs, while 3.4.5.6. mention more general cases.

1 K[0,1]nK{0,1}nK\subset [0,1]^n \to K\cap \{0,1\}^n

We want to solve a 0-1 integer program. Since this task is NP-hard in general, we usually consider its linear relaxation. Different LP formulations have different integrality gaps. For example, consider the following linear relaxation of the max matching IP in non-bipartite graph.

eδ(v)x(e)1vVx(e)[0,1]eE \begin{aligned} \sum_{e\in \delta(v)} x(e)&\leq 1 & &\forall v\in V\\ x(e)&\in [0,1] & &\forall e\in E \end{aligned} :(, this polytope is not integral. Edmonds proved that the following formulation is integral.

eδ(v)x(e)1vVx(e)[0,1]eEeE[U]x(e)(|U|1)/2UV,|U| odd \begin{aligned} \sum_{e\in \delta(v)} x(e)&\leq 1 & &\forall v\in V\\ x(e)&\in [0,1] & &\forall e\in E\\ \sum_{e\in E[U]} x(e) &\leq (|U|-1)/2 & &\forall U\subset V, |U| \text{ odd} \end{aligned}

Schrijver [2] showed that those odd constraints can be obtained by adding cutting planes to the previous polytope. Fortunately for matching polytope we have a polynomial time separation oracle. However, for harder problems adding cutting planes may make the program NP-hard to solve. Lasserre hierarchy is a method to strengthen the polytope to approaching the integer hull while providing provable good properties and keeping the program polynomial time solvable (if applied constant number of times).

2 Probability Perspective

There is a good interpretation of the linear relaxation of 0-1 integer programs. Let K={xn|Axb}[0,1]nK=\left\{ x\in \mathbb{R}^n| Ax\geq b \right\}\subset [0,1]^n be the polytope of the linear relaxation. The goal of solving the integer program is to describe all possible discrete distribution over K{0,1}nK\cap \left\{ 0,1 \right\}^n. Note that for a fixed distribution the expected position (pPr[Xp(1)=1]xp(1),...,pPr[Xp(n)=1]xp(n))T(\sum_p \mathop{\mathrm{Pr}}[X_p(1)=1] x_p(1),...,\sum_p \mathop{\mathrm{Pr}}[X_p(n)=1] x_p(n))^T is in conv(K{0,1}n)\mathop{\mathrm{conv}}(K\cup \left\{ 0,1 \right\}^n) and iterating over all possible distribution gives us the integer-hull. Hence we can find the integral optimal solution if having access to all distribution over integer points.

For any xKx\in K, xix_i can be seen as the probability of xi=1x_i=1. We only care about the discrete distribution on feasible integral points. However, each xKx\in K only describes some marginal probabilities and this this marginal probability may not be even feasible. Consider the following 2D example. Any point in green area\orange area\text{green area}\setminus \text{orange area} is not a marginal distribution of any possible joint distribution over (0,0),(1,0)(0,0),(1,0) and (1,1)(1,1). The idea is to iteratively prune this area.
2D example

Now we need to think about how to represent all possible joint distribution. One natural way is to use a vector y2ny\in \mathbb{R}^{2^n} for the distribution law of every possible integer point in {0,1}n\left\{ 0,1 \right\}^n. However, this method does not work well with our existing marginal probabilities. Let y2ny\in \mathbb{R}^{2^{n}} be a random vector such that yI=Pr[iI(xi=1)]y_I=\mathop{\mathrm{Pr}}[\bigwedge_{i\in I}(x_i=1)] and y=1y_\emptyset=1. Computing all feasible yy is the same as finding all possible bivariate discrete distribution on the integer points. To make yy a feasible probability from some joint distribution and to make (y{1},...,y{n})TK(y_{\left\{ 1 \right\}},...,y_{\left\{ n \right\}})^T\in K we have to add more constraints.

2.1 Feasible Probability

For now let’s forget about KK and consider y[0,1]2[n]y\in [0,1]^{2^{[n]}} and a discrete distribution DD on {0,1}n\left\{ 0,1 \right\}^n. We want to make yI=PrD[iI(Xi=1)]y_I=\mathop{\mathrm{Pr}}_{D}[\bigwedge_{i\in I}(X_i=1)]. In fact, there is a one to one correspondence between yy and DD. If DD is given, computing yIy_I is easy for any I[n]I\subseteq [n]. If yy is given, recovering the distribution DD is the same as solving a system of 2n2^n linear equations with 2n2^n variables (2n12^n-1 of the the equations come form yIy_I, and the remaining one is pD(p)=1\sum_p D(p)=1.) Thus, with a slight abuse of notation, we will refer to yy as a distribution.

We work with the 2D example first. Let x=(x1,x2)TKx=(x_1,x_2)^T\in K be a marginal distribution. One can see that y=(1,x1,x2,Pr[X1=X2=1])Ty=(1,x_1,x_2,\mathop{\mathrm{Pr}}[X_1=X_2=1])^T and the last number is not arbitrary. In fact, Pr[X1=X2=1]\mathop{\mathrm{Pr}}[X_1=X_2=1] must in range [max(0,x1+x21),min(x1,x2)][\max(0, x_1+x_2-1),\min(x_1,x_2)].

To make sure yy is indeed a probability distribution the moment matrix is considered. The moment matrix M(y)M(y) is of size 2n×2n2^n \times 2^n and M(y)[I,J]M(y)[I,J] is defined as the expectation E[iIJXi]=yIJE[\prod_{i\in I\cup J}X_i]=y_{I\cup J} (the only non-zero term is 1Pr[iIj(Xi=1)]=yIJ1\cdot \mathop{\mathrm{Pr}}[\bigwedge_{i\in I\cup j}(X_i=1)]=y_{I\cup J}). The expectation is taken over the distribution defined by yy.
Lemma1

For any probability distribution yy, the moment matrix is psd.
Proof

We need to verify zTM(y)z0z^T M(y) z\geq 0 for any zz. zTM(y)z=IJzIyIJzJ=IJzIE[iIJXi]zJ=E[(I(zIiIXi))2] \begin{aligned} z^T M(y) z &= \sum_I \sum_J z_I y_{I\cup J} z_J\\ &= \sum_I \sum_J z_I E[\prod_{i\in I\cup J} X_i] z_J\\ &= E\left[\left( \sum_I (z_I \prod_{i\in I} X_i)\right)^2 \right] \end{aligned}

Note that in the proof something like sum of squares appears. Lasserre hierarchy has deep connections with SOS optimization and is also known as sum-of-squares hierarchy.
Lemma2

If M(y)M(y) is psd then yy is a probability distribution.
It is easy to see that yI[0,1]y_I\in [0,1] for all I[n]I\subseteq [n]. Consider the following submatrix [yIyIyI]\begin{bmatrix} \emptyset & y_I\\ y_I & y_I \end{bmatrix}

It is psd since M(y)M(y) is psd. The determinant is yI(1yI)0y_I(1-y_I)\geq 0.

Let PrD[p]\mathop{\mathrm{Pr}}_D[p] be the probability of selecting p{0,1}np\in\left\{ 0,1 \right\}^n in DD. It remains to prove the following system of linear equations has a solution such that PrD[p][0,1]\mathop{\mathrm{Pr}}_D[p]\in [0,1] for all pp.

y[n]=PrD[𝟏]y[n]\{n}=p:i[n1](pi=1)PrD[p]y[n]\{n1}=p:i[n]\{n1}(pi=1)PrD[p]y{1}=p:p1=1PrD[p]y=pPrD[p] \begin{aligned} y_{[n]} &= \mathop{\mathrm{Pr}}_D[\mathbf 1]\\ y_{[n]\setminus \left\{ n \right\}} &= \sum_{p:\bigwedge\limits_{i\in [n-1]}(p_i=1)} \mathop{\mathrm{Pr}}_D[p]\\ y_{[n]\setminus \left\{ n-1 \right\}} &= \sum_{p:\bigwedge\limits_{i\in [n]\setminus \left\{ n-1 \right\}}(p_i=1)} \mathop{\mathrm{Pr}}_D[p]\\ &\vdots \\ y_{\left\{ 1 \right\}} &= \sum_{p:p_1=1} \mathop{\mathrm{Pr}}_D[p]\\ y_\emptyset &= \sum_p \mathop{\mathrm{Pr}}_D[p] \end{aligned} I believe this can be proven with the idea of Lemma 2 here.

2.2 Projection in KK

Let the projection of yy be (y{1},,y{n})T(y_{\left\{ 1 \right\}},\dots,y_{\left\{ n \right\}})^T. For any yy the projection should always lie in KK. One may want to define moment matrices for constraints AxbAx\geq b. This is called the moment matrix of slacks. For simplicity we only consider one linear constraint aTxb0a^Tx-b\geq 0. The moment matrix for this constraint is M(y)=(i=1naiyIJ{i}byIJ)I,J[n]M(y)=\left( \sum_{i=1}^n a_i y_{I\cup J\cup \left\{ i \right\}}-b y_{I\cup J} \right)_{I,J\subseteq [n]}. Then we can do similar arguments.

zTM(y)z=IJzIzJ(i=1naiyIJ{i}byIJ)=IJzIzJ(iaiE[kIJ{i}Xk]bE[kIJXk])=E[IJzIzJ(iaiXib)kIJXk]=E[(iaiXib)(IzIiIXi)2] \begin{aligned} z^T M(y) z &= \sum_I \sum_J z_I z_J (\sum_{i=1}^n a_i y_{I\cup J\cup \left\{ i \right\}}-b y_{I\cup J})\\ &= \sum_I \sum_J z_I z_J (\sum_i a_i E[\prod_{k\in I\cup J\cup\left\{ i \right\}} X_k] - b E[\prod_{k\in I\cup J}X_k] )\\ &= E\left[ \sum_I \sum_J z_I z_J (\sum_i a_i X_i -b) \prod_{k\in I\cup J}X_k \right]\\ &= E\left[ (\sum_i a_i X_i -b) \left(\sum_I z_I \prod_{i\in I} X_i \right)^2 \right] \end{aligned}

Note that we can combine the expectations since they are taken over the same probability distribution. Now assume that we have aTXb0a^TX-b\geq 0.

E[(iaiXib)(IzIiIXi)2]=Pr[](aTXb)(IzIiIXi)20 \begin{aligned} E&\left[ (\sum_i a_i X_i -b) \left(\sum_I z_I \prod_{i\in I} X_i \right)^2 \right]\\ &= \sum \mathop{\mathrm{Pr}}[\cdots](a^T X-b)\left(\sum_I z_I \prod_{i\in I} X_i \right)^2 \geq 0 \end{aligned}

If aTXba^TX\geq b is satisfied, then the corresponding slack moment matrix is psd.

Finally, this is a more formal definiton.
Definition3tt-th level of Lasserre hierarchy

The tt-th level of Lasserre hierarchy Last(K)\mathop{\mathrm{Las}}_t(K) of a convex polytope K={xn|Axb}[0,1]nK=\left\{ x\in \mathbb{R}^n| Ax\geq b \right\}\subset [0,1]^n is the set of vectors y2ny\in \mathbb{R}^{2^n} that make the following matrices psd.
  1. moment matrix Mt(y):=(yIJ)|I|,|J|t0M_t(y):=(y_{I\cup J})_{|I|,|J|\leq t}\succeq 0
  2. moment matrix of slacks Mt(y):=(i=1nAiyIJ{i}byIJ)|I|,|J|t0M_t^\ell(y):=\left( \sum_{i=1}^n A_{\ell i}y_{I\cup J\cup \left\{ i \right\}}-b_\ell y_{I\cup J} \right)_{|I|,|J|\leq t}\succeq 0

Note that the tt-th level of Lasserre hierarchy only involve entries yIy_I with |I|2t+1|I|\leq 2t+1. (+1+1 comes from the moment matrix of slacks) The matrices have dimension (n2t+1)=nO(t)\binom{n}{2t+1}=n^{O(t)} and there are only m+1m+1 matrices. Thus to optimize some objective over the tt-th level of Lasserre hierarchy takes mnO(t)mn^{O(t)} time which is still polynomial in the input size. (The separation oracle computes eigenvalues and eigenvectors. If there is a negative eigenvalue we find the corresponding eigenvector vv and the separating hyperplane is I,JvIvJxI,J=0\sum_{I,J}v_{I}v_{J} x_{I,J}=0. See Example 43.)

3 Properties

Almost everything in this part can be found here.

Suppose that we have the tt-th level of Lasserre hierarchy Last(K)\mathop{\mathrm{Las}}_t(K). Denote by Lastproj(K)\mathop{\mathrm{Las}}_t^{proj}(K) the projection of the tt-th level.
  1. Last(K)\mathop{\mathrm{Las}}_t(K) is convex
  2. yI[0,1]y_I\in [0,1] for all yLast(K)y\in \mathop{\mathrm{Las}}_t(K)
  3. 0yIyJ10\leq y_I \leq y_J \leq 1 for all JIJ\subset I with |I|,|J|t|I|,|J|\leq t
  4. yIJyIyJy_{I\cup J}\leq \sqrt{y_I \cdot y_J}
  5. K{0,1}nLastproj(K)K\cap \left\{ 0,1 \right\}^n \subset \mathop{\mathrm{Las}}_t^{proj}(K) for all t[n]t\in [n]
  6. Lastproj(K)K\mathop{\mathrm{Las}}_t^{proj}(K)\subset K
  7. Lasn(K)Lasn1(K)Las0(K)\mathop{\mathrm{Las}}_n(K)\subset \mathop{\mathrm{Las}}_{n-1}(K)\subset \dots \subset \mathop{\mathrm{Las}}_0(K)

1.2.3. show that yy behaves similarly to a real probability distribution.

4.5.6. show that K{0,1}nLasnproj(K)Lasn1proj(K)Las0proj(K)=KK\cap \left\{ 0,1 \right\}^n \subset \mathop{\mathrm{Las}}_n^{proj}(K)\subset \mathop{\mathrm{Las}}_{n-1}^{proj}(K)\subset \dots \subset \mathop{\mathrm{Las}}_0^{proj}(K) = K.

The goal of this section is to show that K{0,1}n=Lasnproj(K)K\cap \left\{ 0,1 \right\}^n = \mathop{\mathrm{Las}}_n^{proj}(K). When working on the Lasserre hierarchy, instead of considering the projection xix_i solely, we usually perform the analysis on yy.

3.1 Convex Hull and Conditional Probability

Lemma4

For t1t\geq 1, let yLast(K)y\in \mathop{\mathrm{Las}}_t(K) and S[n]S\subset [n] be any subset of variables of size at most tt. then yconv{zLast|S|(K)|zi{0,1}iS}.y\in \mathop{\mathrm{conv}}\left\{ z\in \mathop{\mathrm{Las}}_{t-|S|}(K)| z_i\in \left\{ 0,1 \right\} \forall i\in S \right\}.

For any yLasn(K)y\in\mathop{\mathrm{Las}}_n(K) and S=[n]S=[n], the previous lemma implies the projection of yy is convex combination of integral vectors in K{0,1}nK\cap \left\{ 0,1 \right\}^n. Then it follows that Lasnproj(K)=K{0,1}n\mathop{\mathrm{Las}}_n^{proj}(K)=K\cap \left\{ 0,1 \right\}^n. This also provides proofs for the facts that if Mn(y)0M_n(y)\succeq 0 and Mn(y)0M_n^{\ell}(y)\succeq 0 then yy is indeed a probability distribution and the projection is in KK.
Proof

The proof is constructive and is by induction on the size of SS.
  • S={i}S=\left\{ i \right\}. Assume that y{i}(0,1)y_{\left\{ i \right\}}\in (0,1). For simplicity I use yiy_i for y{i}y_{\left\{ i \right\}}. Define two vectors z(1),z(2)z^{(1)},z^{(2)} as zI(1)=yI{i}yiz^{(1)}_I=\frac{y_{I\cup\left\{ i \right\}}}{y_i} and zI(2)=yIyI{i}1yiz^{(2)}_I=\frac{y_I-y_{I\cup\left\{ i \right\}}}{1-y_i}. One can easily verify that y=yiz(1)+(1yi)z(2),zi(1)=1y=y_i z^{(1)}+(1-y_i)z^{(2)}, z^{(1)}_i=1 and zi(2)=0z^{(2)}_i=0. It remains to verify z(1),z(2)Last1(K)z^{(1)},z^{(2)}\in \mathop{\mathrm{Las}}_{t-1}(K). Since Mt(y)M_t(y) is psd, there must be vectors vI,vJv_I,v_J such that vI,vJ=yIJ\langle v_I,v_J \rangle=y_{I\cup J} for all |I|,|J|t|I|,|J|\leq t. Take vI(1)=vI{i}/yiv_I^{(1)}=v_{I\cup\left\{ i \right\}}/\sqrt{y_i}. We have vI(1),vJ(1)=yIJ{i}yi=Mt1(z(1))[I,J]\langle v_I^{(1)},v_J^{(1)} \rangle=\frac{y_{I\cup J\cup\left\{ i \right\}}}{y_i}=M_{t-1}(z^{(1)})[I,J] for all |I|,|J|t1|I|,|J|\leq t-1. Thus Mt1(z(1))M_{t-1}(z^{(1)}) is psd. Similarly, one can take vI(2)=(vIvI{i})/(1yi)v_I^{(2)}=(v_I-v_{I\cup \left\{ i \right\}})/\sqrt{(1-y_i)} and show Mt1(z(2))M_{t-1}(z^{(2)}) is psd.
    For each moment matrix of slacks one can use exactly the same arguments to show Mt1(z(1))0M_{t-1}^{\ell}(z^{(1)})\succeq 0 and Mt1(z(2))0M_{t-1}^{\ell}(z^{(2)})\succeq 0.
  • For the inductive steps one can see that our arguments for the base case can be applied recursively on z(1),z(2)z^{(1)},z^{(2)}.

yLast(K)y\in \mathop{\mathrm{Las}}_t(K) is a probability distribution if we consider only |I|t|I|\leq t, yI=Pr[iIXi=1]y_I=\mathop{\mathrm{Pr}}[\bigwedge_{i\in I}X_i=1]. The vectors z(1),z(2)z^{(1)},z^{(2)} we constructed in the previous proof can be understood as conditional probabilities. zI(1)=yI{i}yi=Pr[kI{i}Xk=1]Pr[Xi=1]=Pr[kIXk=1|Xi=1]zI(2)=yIyI{i}1yi=Pr[kI(Xk=1)Xi=0]Pr[Xi=0]=Pr[kIXk=1|Xi=0] \begin{aligned} &z^{(1)}_I=\frac{y_{I\cup\left\{ i \right\}}}{y_i}=\frac{\mathop{\mathrm{Pr}}[\bigwedge_{k\in I\cup \left\{ i \right\}}X_k=1]}{\mathop{\mathrm{Pr}}[X_i=1]}=\mathop{\mathrm{Pr}}[\bigwedge_{k\in I}X_k=1 | X_i=1]\\ &z^{(2)}_I=\frac{y_I-y_{I\cup\left\{ i \right\}}}{1-y_i}=\frac{\mathop{\mathrm{Pr}}[\bigwedge_{k\in I} (X_k=1) \land X_i=0]}{\mathop{\mathrm{Pr}}[X_i=0]}=\mathop{\mathrm{Pr}}[\bigwedge_{k\in I}X_k=1 | X_i=0] \end{aligned}

The proof is basically showing that

yI=Pr[Xi=1]Pr[kIXk=1|Xi=1]+Pr[Xi=0]Pr[kIXk=1|Xi=0]=Pr[iIXi=1] \begin{aligned} y_I &= \mathop{\mathrm{Pr}}[X_i=1] \mathop{\mathrm{Pr}}[\bigwedge_{k\in I}X_k=1 | X_i=1]+\mathop{\mathrm{Pr}}[X_i=0] \mathop{\mathrm{Pr}}[\bigwedge_{k\in I}X_k=1 | X_i=0]\\ &= \mathop{\mathrm{Pr}}[\bigwedge_{i\in I}X_i=1] \end{aligned}

For any partially feasible probability distribution yLast(K)y\in\mathop{\mathrm{Las}}_t(K), yi(0,1)y_i \in (0,1) implies that both Xi=0X_i=0 and Xi=1X_i=1 happen with non-zero probability, which in turn impies z(1),z(2)Last1(K)z^{(1)},z^{(2)}\in \mathop{\mathrm{Las}}_{t-1}(K). One can also explicitly express yy as convex combination and see the relation with Möbius inversion, see p9 in this notes.

In Lemma 4, each vector in the convex combination (those with integer value on SS, such as z(1),z(2)z^{(1)},z^{(2)}) can be understood as a partial probability distribution under condition [iI(Xi=1)jJ(Xj=0)][\bigwedge_{i\in I} (X_i=1) \bigwedge_{j\in J}(X_j=0)] where IJ=SI\sqcup J=S, and the probability assigned to it is exactly the chance its condition happens. More formally, Lemma 4 implies the following,
Corollary5

Let yLast(K)y\in\mathop{\mathrm{Las}}_t(K). For any subset S[n]S\subset [n] of size at most tt, there is a distribution D(S)D(S) over {0,1}S\left\{ 0,1 \right\}^S such PrzD(S)[iI(zi=1)]=yIIS \mathop{\mathrm{Pr}}_{z\sim D(S)}\left[ \bigwedge_{i\in I} (z_i=1) \right]=y_I \quad \forall I\subset S

Moreover, this distribution is “locally consistent” since the prabability assigned to each vector only depends on its condition.

Since the constraints in Last\mathop{\mathrm{Las}}_t only concern the psdness of certain matrices, one may naturally think about its decomposition. This leads to a vector representation of yIy_I for all |I|t|I|\leq t and may be helpful in rounding algorithms. For JIJ\subset I, vIv_I lies on the sphere of radius vJ/2=yJ/2\|v_J\|/2=\sqrt{y_J}/2 and center vJ/2v_J /2.

3.2 Decomposition Theorem

We have seen that Lasnproj(K)\mathop{\mathrm{Las}}_n^{proj}(K) is the integer hull. Can we get better upperbounds based on properties of KK? Another easy upperbound is maxxK|ones(x)|+1\max_{x\in K}|\mathop{\mathrm{ones}}(x)|+1, where ones(x)={i|xi=1}\mathop{\mathrm{ones}}(x)=\left\{ i|x_i=1 \right\}. This is because yLast(K)y\in \mathop{\mathrm{Las}}_t(K) is a partial distribution for |I|t|I|\leq t that can be realized as the marginal distribution of some distribution on K{0,1}nK\cap \left\{ 0,1 \right\}^n; if k{0,1}nk\cap \left\{ 0,1 \right\}^n does not contain a point with at least tt ones, we certainly have Pr[iI(Xi=1)]=0\mathop{\mathrm{Pr}}[\bigwedge_{i\in I}(X_i=1)]=0 for |I|t|I|\geq t.

This fact implies that for most hard problems we should not expect Lask\mathop{\mathrm{Las}}_k to give us a integral solution for constant kk.

Karlin, Mathieu and Nguyen [3] proved a more general form of Lemma 4 using similar arguments.
Theorem6Decomposition Theorem

Let yLast(K)y\in \mathop{\mathrm{Las}}_t(K), S[n]S\subset [n] and k[0,t]k\in [0,t] such that k|ones(x)S|k\geq |\mathop{\mathrm{ones}}(x)\cap S| for all xKx\in K. Then yconv{z|zLastk(K);z{i}{0,1}iS}. y\in \mathop{\mathrm{conv}}\left\{ z| z\in \mathop{\mathrm{Las}}_{t-k}(K); z_{\left\{ i \right\}}\in \left\{ 0,1 \right\} \forall i\in S \right\}.

4 Moment Relaxation

In this section we briefly show the non-probabilistic view of Lasserre hierarchy and how this idea is used in polynomial optimization problems.

Everything in this section can be found in useful_link[6].

Consider the following polynomial optimiation problem mina(x)s.t.b(x)0bBx{0,1}n \begin{aligned} \min& & a(x)& & &\\ s.t.& & b(x)&\geq 0 & &\forall b\in B\\ & & x&\in\left\{ 0,1 \right\}^n \end{aligned} where a,b,ca,b,c are polynomials. We want to formulate this problem with SDP.

We can consider polynomials a,ba,b as multilinear polynomials. Since xi{0,1}x_i\in \left\{ 0,1 \right\}, we have xi2=xix_i^2=x_i. Now we can consider enumerating xS=iSxix_S=\prod_{i\in S}x_i and write these polynomials as linear functions. For example, we can rewrite a(x)=S[n]αS:SaSiSxiαS(i)a(x)=\sum_{S\subset [n]}\sum_{\alpha_S:S\to \mathbb{Z}} a_S \prod_{i\in S}x_i^{\alpha_S(i)} as S[n]aSxS\sum_{S\subset [n]} a_S x_S which is linear in the moment sequence (x,x{1},,x[n])(x_\emptyset, x_{\left\{ 1 \right\}},\ldots,x_{[n]}).

Recall that our goal is to find a SDP formulation. A common technique is replace each variable with a vector. We consider the moment vectors [vSγ]S2[n][v_S\in \mathbb{R}^\gamma]_{S\in 2^{[n]}}. Similar to the LP case, we want vA,vB=xAB\langle v_A,v_B \rangle=x_{A\cup B}. This is exactly the Gram decomposition of the moment matrix. There exist such moment vectors iff the moment matrix is psd. For b(x)0b(x)\geq 0, we consider the slack moment matrix Mb(x)=(SbSxIJS)I,JM^b(x)=\left( \sum_S b_S x_{I\cup J\cup S} \right)_{I,J}

Then the program becomes the following SDP

minS[n]aSxSs.t.Mb(x)0bBM(x)0x=1 \begin{aligned} \min& & \sum_{S\subseteq [n]}a_S x_S& & &\\ s.t.& & M^b(x)&\succeq 0 & &\forall b\in B\\ & & M(x)&\succeq 0\\ & & x_{\emptyset}&=1 \end{aligned}

Note that if the max degree of polynomials a,ba,b is at most dd, then the following program is a relaxation of the original polynomial optimiation problem (cf. Corollary 3.2.2.).

minS[n]aSxSs.t.MFb(x)0bBMFVd(x)0x=1 \begin{aligned} \min& & \sum_{S\subseteq [n]}a_S x_S& & &\\ s.t.& & M_{F}^b(x)&\succeq 0 & &\forall b\in B\\ & & M_{F\uplus V_{\leq d}}(x)&\succeq 0\\ & & x_{\emptyset}&=1 \end{aligned} where F2[n]F\subset 2^{[n]}, (A,B)={ab|aA,bB}\uplus(A,B)=\left\{ a\cup b| \forall a\in A,b\in B \right\} is element-wise union and MFM_{F} is the submatrix of M(F)M(F) on entries F×FF\times F. Taking F=([n]t)F=\binom{[n]}{\leq t} gives us Last\mathop{\mathrm{Las}}_t.

5 Applications

5.1 Sparsest Cut

There are lots of applications in the useful links, but none of them discusses sparsest cut [4].
Problem7sparsest cut

Given a vertex set VV and two weight functions c,D:(V2)0c,D:\binom{V}{2} \to \mathbb{R}_{\geq 0}, find TVT\subset V that minimizes the sparsity of TT Φ(T)=u<vcu,v|χT(u)χT(v)|u<vDu,v|χT(u)χT(v)|, \Phi(T)=\frac{\sum_{u < v}c_{u,v}|\chi^T(u)-\chi^T(v)|}{\sum_{u < v}D_{u,v}|\chi^T(u)-\chi^T(v)|}, where χT\chi^T is the indicator vector of TT.

In [4] Guruswami and Sinop describe Lasserre hierarchy in a slightly different way. (Note that useful_link[6] is Sinop’s thesis) We have seen that y[0,1]2[n]y\in [0,1]^{2^{[n]}} is sufficient for describing the joint distribution. However, the total number of events is 3n3^n, since for each variable XiX_i in an event there are 3 possible states, Xi=0,Xi=1X_i=0,X_i=1 and XiX_i is absent.

Instead of using y[0,1]2[n]y\in [0,1]^{2^{[n]}}, they enumerate each of the 3n3^n events and consider the vectors in the Gram decomposition. For each set SVS\subset V of size r+1\leq r+1, and for each 0-1 labeling ff on elements of SS, they define a vector xS(f)x_S(f). Note that S(f)S(f) enumerates all events and one should understand xS(f)x_S(f) as the vector corresponding to yS,f[0,1]3[n]y_{S,f}\in [0,1]^{3^{[n]}} in the Gram decomposition and xS(f),xT(g)=yf(S)g(T)\langle x_S(f), x_T(g) \rangle=y_{f(S)\land g(T)}. Then xS(f)x_S(f) should have the following properties:
  1. if f(S)f(S) and g(T)g(T) are inconsistant, i.e. there is an element eSTe\in S\cap T and f(e)g(e)f(e)\neq g(e), then one should have xS(f),xT(g)=yf(S)g(T)=0\langle x_S(f), x_T(g) \rangle=y_{f(S)\land g(T)}=0.
  2. if f(S)g(T)f(S)\land g(T) and f(A)g(B)f'(A)\land g'(B) are the same event, i.e. AB=STA\cup B=S\cup T and the labels are the same, then xS(f),xT(g)=xA(f),xB(g)\langle x_S(f), x_T(g) \rangle=\langle x_A(f'), x_B(g') \rangle
  3. x2=1\|x_{\emptyset}\|^2=1 here \emptyset is the union of all events.
  4. for all uVu\in V, xu(0)2+xu(1)2=x2=1\|x_u(0)\|^2+\|x_u(1)\|^2=\|x_{\emptyset}\|^2=1.
  5. for SV,uSS\subset V, u\in S and f{0,1}S\{u}f\in \left\{ 0,1 \right\}^{S\setminus \left\{ u \right\}}, xS(f(u=1))+xS(f(u=0))=xS\{u}(f)x_S(f\land (u=1))+x_S(f\land (u=0))=x_{S\setminus \left\{ u \right\}}(f). (Note that two lhs vectors are orthogonal)
Lemma8pseudo probability

Let xLast(V)x\in \mathop{\mathrm{Las}}_t(V) for t0t\geq 0. Then the following holds:
  1. xS(f)2[0,1]\|x_S(f)\|^2 \in [0,1] for all |S|t+1|S|\leq t+1.
  2. xS(f)2xT(g)2\|x_S(f)\|^2 \leq \|x_T(g)\|^2 if TST\subset S and f(t)=g(t)f(t)=g(t) for all tTt\in T.
  3. xS(f)2=h{0,1}TSxT(fh)2\|x_S(f)\|^2 = \sum_{h\in \left\{ 0,1 \right\}^{T-S}} \|x_T(f\land h)\|^2 if STS\subset T.
  4. If S(Vt)S\in \binom{V}{\leq t}, f{0,1}Sf\in \left\{ 0,1 \right\}^S and uSu\notin S, then xS+u(fu=1)+xS+u(fu=0)=xS(f)x_{S+u}(f\land u=1)+x_{S+u}(f\land u=0)=x_{S}(f).
Proof

Let Nt=r=0t+1(Vr)2rN_t=\sum_{r=0}^{t+1}\binom{V}{r}2^r be the number of vectors in xx. Consider the moment matrix MtNt×NtM_t\in \mathbb{R}^{N_t\times N_t}, where each entry Mt[f(S),g(T)]M_t[f(S),g(T)] is xS(f),xT(g)\langle x_S(f),x_T(g)\rangle. The moment matrix is positive semidefinite since vectors in xx form a Gram decomposition of MtM_t.
  1. Consider the following submatrix of MtM_t. [x,xx,xS(f)xS(f),xxS(f),xS(f)]0\begin{bmatrix} \langle x_\emptyset,x_\emptyset\rangle & \langle x_\emptyset,x_S(f)\rangle\\ \langle x_S(f),x_\emptyset\rangle & \langle x_S(f),x_S(f)\rangle \end{bmatrix}\succeq 0 Computing the determinant gives us xS(f)2(1xS(f)2)0\|x_S(f)\|^2(1-\|x_S(f)\|^2)\geq 0.
  2. Again consider certain submatrix of MtM_t. [xT(g),xT(g)xT(g),xS(f)xS(f),xT(g)xS(f),xS(f)]0\begin{bmatrix} \langle{x_T(g)},{x_T(g)}\rangle & \langle{x_T(g)},{x_S(f)}\rangle\\ \langle{x_S(f)},{x_T(g)}\rangle & \langle{x_S(f)},{x_S(f)}\rangle \end{bmatrix}\succeq 0 The determinant is xS(f)2(xT(g)2xS(f)2)0\|x_S(f)\|^2(\|x_T(g)\|^2-\|x_S(f)\|^2)\geq 0.
  3. We only need to show xS(f)2=xS+u(fu=0)2+xS+u(fu=1)2\|x_S(f)\|^2=\|x_{S+u}(f\land u=0)\|^2 +\|x_{S+u}(f\land u=1)\|^2 and the rest follows by induction. Note that xu(0)+xu(1)=xx_u(0)+x_u(1)=x_\emptyset since we have xu(0)2+xu(1)2=x2\|x_u(0)\|^2+\|x_u(1)\|^2=\|x_{\emptyset}\|^2 and they are orthogonal. xS+u(fu=0)2+xS+u(fu=1)2=xS(f),xu(0)+xS(f),xu(1)=xS(f),xu(0)+xu(1)=xS(f),x=xS(f)2 \begin{aligned} \|x_{S+u}(f\land u=0)\|^2 +\|x_{S+u}(f\land u=1)\|^2 &= \langle{x_S(f)},{x_u(0)}\rangle+\langle{x_S(f)},{x_u(1)}\rangle\\ &= \langle{x_S(f)},{x_u(0)+x_u(1)}\rangle\\ &= \langle{x_S(f)},{x_\emptyset}\rangle=\|x_S(f)\|^2 \end{aligned}
  4. Notice that xS+u(fu=1)x_{S+u}(f\land u=1) and xS+u(fu=0)x_{S+u}(f\land u=0) are orthogonal. Denote by xS(f)x_S(f') the projection of ff on the hyperplane spanned by xS+u(fu=1)x_{S+u}(f\land u=1) and xS+u(fu=0)x_{S+u}(f\land u=0). One can verify that f=xS+u(fu=1)+xS+u(fu=0)f'=x_{S+u}(f\land u=1)+x_{S+u}(f\land u=0). Then it remains to show xS(f),xS(f)=xS(f)2\langle x_S(f'), x_S(f)\rangle=\|x_S(f)\|^2, which immediately follows from 3.

Then write xu=x{u}(1)x_u=x_{\left\{ u \right\}}(1). The follwing “SDP” is a relaxation of sparsest cut.

minu<vcu,vxuxv2u<vDu,vxuxv2s.t.u<vDu,vxuxv20xLasr(V) \begin{aligned} \min& & \frac{\sum_{u < v}c_{u,v}\|x_u-x_v\|^2}{\sum_{u < v}D_{u,v}\|x_u-x_v\|^2}\\ s.t.& & \sum_{u < v}D_{u,v}\|x_u-x_v\|^2&\geq 0\\ & & x\in \mathop{\mathrm{Las}}_r(V)& \end{aligned}

Scaling every xS(f)x_S(f) by a factor of the square root of the objective’s denominator gives us a real SDP.

minu<vcu,vxuxv2s.t.u<vDu,vxuxv2=1xLasr(V),x2>0 \begin{aligned} \min& & \sum_{u < v}c_{u,v}\|x_u-x_v\|^2\\ s.t.& & \sum_{u < v}D_{u,v}\|x_u-x_v\|^2&= 1\\ & & x\in \mathop{\mathrm{Las}}_r(V),\|x_\emptyset\|^2&>0 \end{aligned}

The rounding method is too complicated, so it won’t be covered here.

5.2 Matching

This application can be found in section 3.3 of useful_link[1]. We consider the maximum matching IP in non-bipartite graphs. Let K={x0n|eδ(v)xe1vV}K=\left\{ x\in \mathbb{R}_{\geq 0}^n | \sum_{e\in \delta(v)}x_e\geq 1 \; \forall v\in V \right\} be the polytope and consider Last(K)\mathop{\mathrm{Las}}_t(K). In the notes Rothvoss shows the following lemma.
Lemma9

Lastproj(K)(1+12t)conv(K{0,1}n)\mathop{\mathrm{Las}}_t^{proj}(K)\subseteq (1+\frac{1}{2t})\cdot\mathop{\mathrm{conv}}(K\cap \left\{ 0,1 \right\}^n).
Proof

Let yLast(K)y\in \mathop{\mathrm{Las}}_t(K). It suffices to show that eE[U]ye(1+12t)k\sum_{e\in E[U]} y_e\leq (1+\frac{1}{2t})k for all |U|=2k+1|U|=2k+1, since {xK|x satisfies odd constraints}\left\{ x\in K| \text{$x$ satisfies odd constraints} \right\} is the matching polytope. When k>tk>t, the degree constraints imply that eE[U]yek+12(1+12t)k\sum_{e\in E[U]} y_e\leq k+\frac{1}{2} \leq (1+\frac{1}{2t})k. Now consider the case ktk\leq t. Note that for fixed UU, any IE[U]I\subset E[U] of size |I|>k|I|> k has yI=0y_I=0, since it is impossible to find a matching in UU covering more that kk vertices. Then by Lemma 4 yy can be represented as a convex combination of solutions zLas0(K)z\in \mathop{\mathrm{Las}}_0(K) in which ze{0,1}z_e\in \left\{ 0,1 \right\} for all eE[U]e\in E[U]. The convex combination implies that eE[U]yek\sum_{e\in E[U]} y_e\leq k when ktk\leq t.

However, one can see that Lemma 9 is not tight. Las0proj(K)\mathop{\mathrm{Las}}_0^{proj}(K) should be contained in (1+12)conv(K{0,1}n)(1+\frac{1}{2})\cdot\mathop{\mathrm{conv}}(K\cap \left\{ 0,1 \right\}^n) and Lasnproj(K)\mathop{\mathrm{Las}}_n^{proj}(K) should be exactly the integer hull. Can we prove a slightly better gap that matches observations at Las0\mathop{\mathrm{Las}}_0 and Lasn\mathop{\mathrm{Las}}_n? The later part of the proof in fact shows that yLast(K)y\in \mathop{\mathrm{Las}}_t(K) satisfies all odd constraints with |U|2t+1|U|\leq 2t+1. Consider an odd cycle with 2t+32t+3 vertices. (1/2,,1/2)T2t+3(1/2,\ldots,1/2)^T\in \mathbb{R}^{2t+3} is a feasible solution in Last(K)\mathop{\mathrm{Las}}_t(K) and proves a tight lowerbound of k+1/2k+1/2.

6 Questions

6.1 Replace Mt(y)0M_t^\ell(y)\succeq 0 with Lastproj(y)K\mathop{\mathrm{Las}}_t^{proj}(y)\in K

I don’t see any proof relying on the psdness of slack moment matrices…

It turns out that problems occur in the proof of Lemma 4. If Last(K)\mathop{\mathrm{Las}}_t(K) is defined as {y|Mt(y)0,yprojK}\left\{ y|M_t(y)\succeq 0, y^{proj}\in K \right\}, then we cannot guarantee z(1),z(2)Kz^{(1)},z^{(2)}\in K. Without Lemma 4, Lasnproj(K)\mathop{\mathrm{Las}}_n^{proj}(K) may not be exactly K{0,1}nK\cap \left\{ 0,1 \right\}^n and the hierarchy seems less interesting? But an alternative formulation (see Sparsest cut, which entirely ignore the slack moment matrices) still allows good rounding even without Lemma 4. Generally speaking, if the psdness of slack moment matrices is neglected, then we won’t have Law of total probability(Lemma 4); However, we still have “finite additivity property of probability measures”(Lemma 8 (3)).

6.2 Separation Oracle for Implicit KK

Sometimes KK is given in a compact form. For example, consider finding matroid cogirth.

mineExes.t.eBxe1 base Bxe0eE \begin{aligned} \min& & \sum_{e\in E} x_e& & &\\ s.t.& & \sum_{e\in B} x_e&\geq 1 & &\forall \text{ base $B$}\\ & & x_e&\geq 0 & &\forall e\in E \end{aligned}

If KK is only accessable through a separation oracle, is it possible to optimize over Last(K)\mathop{\mathrm{Las}}_t(K) in polynomial time for constant tt?

References

[1]
M. Laurent, A Comparison of the Sherali-Adams, Lovász-Schrijver, and Lasserre Relaxations for 0–1 Programming, Mathematics of Operations Research. 28 (2003) 470–496 10.1287/moor.28.3.470.16391.
[2]
A. Schrijver, Polyhedral proof methods in combinatorial optimization, Discrete Applied Mathematics. 14 (1986) 111–133 10.1016/0166-218X(86)90056-9.
[3]
A.R. Karlin, C. Mathieu, C.T. Nguyen, Integrality Gaps of Linear and Semi-Definite Programming Relaxations for Knapsack, in: Integer Programming and Combinatoral Optimization, Springer, Berlin, Heidelberg, 2011: pp. 301–314 10.1007/978-3-642-20807-2_24.
[4]
V. Guruswami, A.K. Sinop, Approximating non-uniform sparsest cut via generalized spectra, in: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, USA, 2013: pp. 295–305.