On Bregman-Type Distances for Convex Functions and Maximally Monotone Operators

Given two point to set operators, one of which is maximally monotone, we introduce a new distance in their graphs. This new concept reduces to the classical Bregman distance when both operators are the gradient of a convex function. We study the properties of this new distance and establish its continuity properties. We derive its formula for some particular cases, including the case in which both operators are linear monotone and continuous. We also characterize all bi-functions D for which there exists a convex function h such that D is the Bregman distance induced by h.

Bregman distances are induced by convex functions, our new distance is induced by convex representations of one of the maps; however, the way we associate a Bregman distance with a convex representation is completely different from the association of a classical Bregman distance with a differentiable convex function.
Classical Bregman distances have proved to be useful in devising algorithms for convex optimization problems, as well as for variational inequalities, in which the distance plays a penalization role. It is then natural to investigate whether one could introduce a more general notion of Bregman distance which could be useful in algorithms for solving monotone variational inequalities. Our new distance also provides a new interpretation of solutions of variational inequalities. The variational inequality can be formulated as follows. Let X be a Banach space and X * its dual. Given a maximally monotone operator S : X ⇒ X * and a closed and convex set C ⊆ X, a solution of the variational inequality problem V I P (S, C) is a pair (x, v) ∈ X × X * such that x ∈ C, v ∈ Sx and y − x, v ≥ 0, for all y ∈ C. (1.1) The minimization of a convex function f constrained to the set C is a particular instance of the V I P (S, C), with S = ∂f . The variational inequality problem, in turn, is a particular instance of the more general inclusion problem 0 ∈ T (x), with T being the sum of ∂f plus the normal cone N C , where N C (x) := {v ∈ X * : y − x, v ≤ 0, for all y ∈ C}. Combining this definition with (1.1) shows that solutions of the V I P (S, C) are those elements in the graph of S which intersect the graph of −N C . We will show that our distance vanishes at solutions of V I P (S, C), when applied to the maps S and −N C , which gives a new interpretation to solutions of V I P (S, C). Moreover, our notion of Bregman distance extends the classical one. Namely, when both maps are ∂f , it reduces to the Bregman distance induced by f , in the sense of [16]. When f is convex and differentiable, it reduces to the classical Bregman distance, in the sense of [6, Section 6.2].
In the present paper, we study the basic properties of this distance and show some specific examples. We also study classical Bregman distances. We provide two axiomatic characterizations, that is, we give necessary and sufficient conditions for a bifunction defined on the product of a Banach space with itself to be the Bregman distance associated to some convex function. Moreover, we study the correspondence that assigns to each differentiable convex function f its associated Bregman distance The rest of the paper is organized as follows. Section 2 presents some preliminaries on convex functions and maximally monotone operators. In particular, we recall the basic ideas on the representability of monotone operators by convex functions as well as the related notion of enlargement of a maximally monotone operator and its main properties. In Section 2 we introduce and study our new notion of Bregman distance. It contains two subsections: in the first one we consider the particular case when the monotone operators are linear, and the second one is devoted to the study of the lower semicontinuity properties of the newly introduced Bregman distances. Section 3 contains our characterizations of classical Bregman distances and studies the mapping f −→ D f defined above.

Preliminaries
Let (X, · ) and (X * , · * ) be a Banach space and its dual, respectively. Given a point-toset operator T : X ⇒ X * , the set D(T ) := {x ∈ X : T (x) = ∅} is called the domain of T , while G(T ) := {(x, x * ) ∈ X × X * : x * ∈ T (x)} is the graph of T . Fix C a subset of a vector space Z. The indicator function of C is the function δ C : Z → R ∪ {+∞} =: R ∞ defined as δ C (z) := 0 for z ∈ C and δ C (z) := +∞ for z ∈ C. We denote by int C and bdry C the interior and the boundary of C, respectively. Let Given a function f : X → R ∞ , the subdifferential of f is the point-to-set mapping ∂f : X ⇒ X * defined by If C is a closed and convex set, then ∂δ C =: N C , the normal cone to the set C. Namely, Given ε ≥ 0, the ε− subdifferential of f is the point-to-set mapping ∂ ε f : X ⇒ X * defined by 2) For future use we recall the Fenchel-Young inequality for a convex and lower semicon- and If Y is a vector space and x, y ∈ Y with x = y, we denote by [x, y] , ]x, y[ and ]y, x +∞ [ the sets of points λx + (1 − λ)y , with λ ∈ [0, 1], λ ∈ ]0, 1[ and λ ∈ ]0, +∞[, respectively.
In our analysis, we will make use of the concept of enlargement of a maximally monotone operator, which we define next. Since these objects approximate the graph of the operator, it is not surprising that they are useful in analyzing the distance induced by the graphs of these operators. The definition below was introduced in [22]. Definition 2.1 Let T : X ⇒ X * . We say that E : X × I R + ⇒ X * is an enlargement of T when the following hold.
Assume T = ∂f , with f a convex and lower semicontinuous function. In this case the ε-subdifferential ∂ {·} f (·) : X × I R + ⇒ X * , which maps (x, ε) to the set ∂ ε f (x) , is a fundamental example of enlargement.
Another important example of an enlargement is defined as follows. Given an arbitrary maximally monotone operator S : X ⇒ X * , denote by S e : X × R + ⇒ X * the set valued map defined as (the set S e (x, ε) was called S (x) in [11] ). The following fact collects properties of enlargements that we will need in the sequel.

Fact 2.2
The enlargement S e was introduced in [7] for the finite-dimensional case, and extended first to Hilbert spaces in [9,10] and then to Banach spaces in [11]. The following facts (whose proofs can be found in the aforementioned references) hold.
(i) The set S e (x, ε) is weak * -closed for every fixed x and ε.
(ii) If x ∈ int D(S), then the set S e (x, ε) is weak * -compact (see [11] and [6,Theorem 5.3.4]). (iii) The mapping S e is the biggest element in the family E(S) (see [22] and [6,Theorem 5.4.2]). This means that E ⊆ S e for every E ∈ E(S). (iv) Denote by E C (S) the subset of E(S) consisting of all E ∈ E(S) such that E(x, ε) is weak * -closed for every x and every ε ≥ 0. Then for every E ∈ E C (S) we have that E(x, ε) is weak * -compact for every ε ≥ 0 and every x ∈ int D(S). (v) S e (·, ε) is locally bounded in int D(S). Namely, for every x ∈ int D(S) there exists a neighbourhood V of x such that S e (V , ε) is bounded (see [6,Theorem 5.3.4]). Since E(·, ε) ⊂ S e (·, ε) for every E ∈ E(S), local boundedness in int D(S) is inherited by all E ∈ E(S).
Our distance will make use of a family of convex functions associated with maximally monotone operators. We define this family next.

Definition 2.3
Let S : X ⇒ X * be a maximally monotone operator. We say that h : X × X * → R ∞ represents S if the following three conditions hold:

We denote this situation as h ∈ H(S).
Remark 2.4 Fix S : X ⇒ X * a maximally monotone operator. It is well known (see, e.g., [12]) that H(S) has a smallest element and a biggest one. The smallest element is the Fitzpatrick function associated to S: The biggest element is σ S := F S * = cl conv(π +δ G(S) ), where π : X ×X * → R is defined as π(x, v) := x, v . For more details on the family H(S), see [5,12,13,15].

Remark 2.5
An operator T : X ⇒ X * admitting a representing function h satisfying conditions (i), (ii) and (iii) of Definition 2.3 is necessarily monotone [17,Theorem 5], but it may not be maximally monotone. Such monotone operators are called representable. According to [17,Proposition 32], in finite-dimensional spaces, the monotone representable operators are the intersections of arbitrary families of maximally monotone operators. In infinite dimensional Banach spaces, such intersections are still representable, as easily follows from [17, Corollary 10] and the representability of maximally monotone operators, but a representable operator which cannot be expressed as an intersection of maximally monotone operators was presented in [21]. Some further results on representable monotone operators were given in [4].
We will need the following fact. For its proof, see [12, Propositions 2.6 and 3.5].
then v ∈ S e (x, ε). From the latter inclusion and the definition of D ,h T (see Definition 3.1 below) we derive that Fact 2.6 motivates the following definition of enlargement.

Remark 2.7
Recall from [12,13] that to a given maximally monotone operator S : X ⇒ X * and a fixed h ∈ H(S), one can associate the enlargement L h of S defined as follows: The norm-weak * lower semicontinuity of h implies that the graph of L h (·, ·) is closed w.r.t. the strong-weak * convergence . From the minimality of the Fitzpatrick function, it can be seen that one has L F S = S e ; in other words,

A Bregman Distance for Maximally Monotone Operators
We will consider the following notion, which generalizes the concept of Bregman distance as given in [16] (see Proposition 3.5 below).
Definition 3.1 Let S : X ⇒ X * be a maximally monotone operator, and let T :

Remark 3.2 From Remark 2.4 we have that every h ∈ H(S) satisfies the inequalities
hence, we have directly from the definition that Analogous inequalities hold for D ,h T .
From the definitions, we readily obtain the following facts. Recall from [16] that, to a given strictly convex function f : X → R ∞ , we can associate two Bregman distances, defined as follows.
When f is differentiable at y, then we clearly have which is the classical definition of Bregman distance. We prove next that our distances reduce to D f , D f when T = ∂f .

Proposition 3.5 Fix a lsc and strictly convex function
In particular, when f is differentiable at y, for every x ∈ X we have as in the classical definition of Bregman distances.
as wanted. The statement for D f follows the same steps. The last statement is a direct consequence of the definitions.
The following example shows that our distance can become the classical Bregman distance even when h f does not represent T .
Example 3.6 Let X be a Hilbert space and fix λ > 0. Consider the operators S := ∇f and T λ := ∇f + λ I , for f : X → R ∞ a convex, coercive, and differentiable function with open domain. Under these assumptions, we have that S = ∇f is surjective (see., e.g., . Then h f ∈ H(S). Call u y,λ := ∇f (y) + λy and w y,λ be such that ∇f (w y,λ ) = u y,λ . We have where we used Fenchel-Young equality (2.4) in the second equality, and the definition of Bregman distance in the last one. In this way, we can express the distance induced by the operators as a classical Bregman distance.
We have seen in Remark 3.3 that different levels of "overlap" between the sets Sx and T y imply that the distances D ,h T and D ,h T vanish at (x, y). The next result studies the converse situation, i.e., under which conditions the fact that the distance is zero implies the corresponding "overlap" between the sets Sx and T y. (a) Assume that T : X ⇒ X * is locally bounded in int D(T ) and weak * -closed valued (i.e., T z is weakly * closed for all z ∈ D(T )). Assume also that (x, y) ∈ bdryD(S) × bdryD(T ).
Proof Let us prove part (a). Assume that D ,h T (x, y) = 0. The assumption on (x, y) implies that either x or y must be in the interior of the corresponding domain. We consider each case separately. If y ∈ intD(T ) then T y is weak * -compact, so the infimum for v ∈ T y in Definition 3.1 is attained at somev ∈ T y. This attainment, combined with the fact that and we deduce thatv ∈ Sx. Hence T y ∩ Sx = ∅. This proves the claim in the case that y ∈ intD(T ). Assume now that x ∈ intD(S). By Fact 2.2(ii) this implies that the set S e (x, ε) is weak * -compact for every ε ≥ 0. Since D ,h T (x, y) = 0, by Fact 2.6, we have that the weak * -compact sets T y ∩ S e (x, ε) are nonempty for every > 0; hence the family {T y ∩ S e (x, ε)} >0 has the finite intersection property, which implies that T y∩Sx = T y ∩ Proof The implication (ii)→(i) follows from Remark 3.3(b) for T = −N C , x = y and the fact that (ii) entails the existence of v ∈ −N C (x) such that v ∈ S(x). The converse follows from Proposition 3.7(a) for T = −N C and x = y. Remark 3. 9 We see from Proposition 3.7 that, when x / ∈ bdryD(S), having D ,h T (x, x) = 0 results in a nonempty intersection of the sets Sx and T x. Can we say something more when these distances vanish on some open set? A possible way to address this question is by using Theorem 3.10 below.
In the following theorem the maps E T and E S belong to E C (T ) and E C (S), respectively. (see Fact 2.2(iv)). (ii) D ⊆ intD(S) and E T (x, ε) ∩ E S (x, ε) = ∅ for every x ∈ D, ε > 0.
We will use this theorem to establish the coincidence result between the operators. Remark 3.12 According to [3, Theorem 9.7.2, Exercise 9.7.3], if S : X ⇒ X * is a maximally monotone operator of type (NI) (in particular, if the space is reflexive), for every h ∈H(S) we have here d denotes the distance on X × X * defined by d ((x, v) , (y, w)) := x − y 2 + v−w 2 * . Combining this fact with Definition 3.1 we obtain Consequently, we can see D ,h T (x, y) as providing us with an upper estimate of the distance between the sets {x} × T y and G(S). This result gives an alternative proof of Proposition 3.7(a).

The Linear Case
In the next result we compute our distance induced by A and B.
Proof (a) This follows from (1) and (3). Part (b) follows directly from part (a) and Fact 3.14(i) and (ii) for the operator A + instead of A. Part (c) follows directly from Fact 3.14(iii).

Continuity Properties
In this section we assume that X is a reflexive Banach space. Our aim is to establish lower semicontinuity properties of our distances. We show that D ,h T (·, y) and D ,h T (x, ·) are lsc w.r.t. the strong topology in the interior of the domains. On the other hand, D ,h T (·, y) is lsc w.r.t. the weak topology at every x ∈ D(S). We also provide two examples: one showing that D ,h T (x, ·) is not usc in general, and the other showing that D ,h T (x, ·) is not lsc in general.

Remark 3.16
In the next result, we use the Eberlein-Smulian theorem, which states that a subset of a Banach space is weakly compact if and only if it is weakly sequentially compact (see [14, Chapter III, page 18]). We also use the fact that enlargements are locally bounded at a point which is in the interior of their domains. This provides a neighbourhood of the reference point which is norm-closed and bounded, and hence weakly compact (by Bourbaki-Alaoglu's theorem and reflexivity). We then use the Eberlein-Smulian theorem to deduce that the given neighbourhood is in fact weakly sequentially compact. Since Lemmas 3.17 and 3.18 involve the strong topology in X, we can use sequences instead of nets.

Lemma 3.17
Assume that S : X ⇒ X * is maximally monotone and h ∈ H(S), and fix y ∈ D(T ).
(a) Let T : X ⇒ X * be such that T z is weakly closed for any z in its domain. Then the function D ,h T (·, y) : X → R ∞ is lsc at every x ∈ intD(S) with respect to the strong topology in X. (b) The function D ,h T (·, y) : X → R ∞ is lsc at every x ∈ D(S) with respect to the strong topology in X.
Proof Assume (a) is not true. This means that there exists a ∈ R and a sequence x n converging strongly to x such that D ,h T (x, y) > a and D ,h T (x n , y) ≤ a. For n 0 large enough we have that for all n ≥ n 0 . The definition of D ,h T , together with the left hand side of the above expression, imply that for each fixed n ≥ n 0 , there exists v n ∈ T y such that h(x n , v n ) − x n , v n < a + 1 n . (3.12) By Remark 2.7, this implies that Since x ∈ intD(S), we can use Fact 2.2(v) to deduce that the enlargement E(·, a + 1) := L h (·, a + 1) is locally bounded at x. This implies the existence of two closed balls, denoted by B(x, r) ⊂ X and B 0 ⊂ X * , respectively, such that L h (B(x, r), a + 1) ⊂ B 0 . By Remark 3.16, B 0 is weakly sequentially compact. The latter fact, (3.13), and the weak sequential compactness of B 0 imply that {v n } ⊂ B 0 for n large enough, and hence there is a subsequence of {v n } converging weakly to some vector v. Recalling now that {v n } ⊂ T y and the set T y is weakly closed, we deduce that v ∈ T y. By reflexivity and Remark 2.7 , the graph of L h is closed for the strong-weak convergence. Taking limit for n tending to infinity in (3.13) yields v ∈ L h (x, a). Taking limits in (3.12) (for the corresponding strong-weak convergent subsequence), and using the definition of D ,h T gives where we used the fact that v ∈ L h (x, a) in the rightmost inequality. The above expression contradicts our assumption on a, completing the proof of (a).
Assume now that x ∈ D(S) and (b) is not true. For simplicity, write ψ(x) := D ,h T (x, y). The statement that ψ is not (strongly) lower semicontinuous at x means that there exists a < ψ(x) and a sequence x n converging strongly to x such that ψ(x n ) ≤ a. Using the definition, this inequality implies that for every fixed v ∈ T y and all n we have Since h (·, v) is strongly lsc and {x n } converges strongly to x the above inequality yields Since we can do this for every v ∈ T y we deduce that D ,h T (x, y) ≤ a, contradicting our assumptions. Hence (b) holds.
The following result establishes lower semicontinuity of D ,h T (x, ·). This fact is not true for D ,h T (x, ·), as will be shown in Example 3.21.

Lemma 3.18
Assume that S : X ⇒ X * is maximally monotone, h ∈ H(S), and T : X ⇒ X * is locally bounded in the interior of its domain. Suppose also that the graph of T is closed w.r.t. the strong-weak topology. Fix y ∈ intD(T ) and x ∈ D(S). Then the function D ,h T (x, ·) : X → R ∞ is lsc at y with respect to the strong topology in X.
Proof Assume the claim is not true. Since we consider here the norm topology, this means that there exists a ∈ R and a sequence y n converging strongly to y such that D ,h T (x, y) > a and D ,h T (x, y n ) ≤ a. From the second inequality for all n we deduce the existence of v n ∈ T y n such that (3.14) Since y ∈ intD(T ), T is locally bounded at y. Using now a similar argument as the one used in the proof of Lemma 3.17(a), we obtain a subsequence of {v n } converging weakly to some vector v. By the strong-weak closedness of the graph of T we deduce that v ∈ T y. Using the (strong-weak) lsc of h we can write where we used also (3.14) in the last inequality. This expression entails a contradiction and therefore the claim on lower semicontinuity is true.
Example 3.20 below shows that D ,h T (x, ·) may fail to be usc. In both of the next examples, we make use of the following fact (for a proof, see [19]).

Fact 3.19
Assume that X is a Banach space and g : X → R is defined by g(x) := x . Then, where B is the closed unit ball. For every y = 0 we have ∂g(y) = {z ∈ X * : y, z = y }.
If X is a Hilbert space, then for all y = 0 we have ∂g(y) = {∇g(y)} = {y/ y }.
(3.15) Example 3.20 Let X be a Hilbert space with dimension at least two, and let S := T := ∂g, with g as in Fact 3.19. It was proved in [5] (see also [18,Example 5]) that the set H(S) has only one element, which is then necessarily the Fitzpatrick function, given by with δ B denoting the indicator function of the closed unit ball of X. Thus, for x, y ∈ X we have where we used (3.15) in the last equality. If x = 0 then D S,T (x, ·) is not usc at 0, since for every sequence y n = 0 orthogonal to x and strongly converging to 0 one has D T (x, y n ) = x > 0 = D S,T (x, 0).

Example 3.21
Assume X is a Hilbert space and fix a nonzero x ∈ X. Take S = I d and T = ∂g, for g as in Fact 3.19. Using the second equality in (3.10) for A = I d we can write Computing the supremum in the right hand side, we obtain, Take now a nonzero sequence {y n } converging to 0. Since y n is never zero we have from Fact 3.19 that T y n = { y n y n } and hence we can write D ,F S T (x, y n ) = 2 sup z∈X z, (x + y n y n )/2 − z 2 /2 − x, y n y n . As in the previous example, we take again the sequence {y n } orthogonal to x and tending to zero, so the expression above yields Noting that for every nonzero x we have is not lsc at y = 0.

A Characterization of Bregman Distances
In this section we focus on the classical Bregman distance as in (3.8). Our aim is to characterize the bifunctions D(·, ·) for which there exists a convex differentiable function h such that D = D h . We say that a bifunction G : C × C → X * , with C ⊆ X, is additively separable when there exist two functions R and P such that G(x, y) = R(x) + P (y) for every x, y ∈ C. Proof Clearly, properties (a)-(d) are satisfied by every Bregman distance. For (c), notice that ∇ 1 D h (x, y) = ∇h (x) − ∇h (y) . Conversely, assume that properties (a)-(d) hold. By (c), there exist two mappings R, U : C → X * such that for every x, y ∈ C. (4.16) From (d) it follows that R (x) + U (x) = 0 for every x ∈ C, hence (4.16) reduces to for every x, y ∈ C. (4.17) Fix y ∈ C and define h := D (·, y) + ·, R (y) . By (a), the function h is convex and differentiable. By (4.17), one has ∇h = R, and the expression ∇ (D (·, y) + (·, R (y))) (x) = ∇ 1 D (x, y) + R (y) = R(x) (4.18) depends only on x. Therefore the difference D (x, y) + x, R (y) − h (x) depends only on y. Indeed, by (4.18) and the equality ∇h = R, we have   We now give an alternative characterization. = ∇ (D (·, y) + ·, ∇h (y) ) (x).
The above equality implies that