Mar 26, 2026 Theory 3 papers

Theory Digest — Mar 26, 2026

Today’s Digest at a Glance

Today’s digest explores advanced regret bounds and implicit bias phenomena across reinforcement learning, nuclear norm optimization, and bandit feedback settings.

Doubling Episodes for Online Learning

A key technique appearing in today’s MDP paper is the doubling episode framework for adaptive algorithm design. Traditional fixed-horizon approaches suffer from the need to pre-specify episode lengths, which either wastes samples (if too short) or delays learning (if too long). The doubling episodes technique dynamically adapts episode lengths based on observed data complexity.

The core idea maintains visit counts $N^k(s,a)$ for each state-action pair and starts a new episode $k+1$ whenever any count doubles: $N^{k+1}(s,a) \geq 2N^k(s,a)$. This ensures that episode lengths grow exponentially with the intrinsic difficulty of the problem, measured by how quickly the algorithm can distinguish between different transition probabilities.

Mathematically, this creates a natural filtration where episode $k$ ends at time $T_k = \inf{t > T_{k-1} : \exists (s,a) \text{ s.t. } N_t(s,a) \geq 2N_{T_{k-1}}(s,a)}$. The key insight is that this adaptive schedule automatically balances exploration and exploitation - early episodes are short when uncertainty is high, while later episodes become longer as confidence improves.

Nuclear Norm Constrained Optimization

The nuclear norm constraint optimization appearing in today’s multiclass learning paper addresses the challenge of promoting low-rank solutions in matrix learning problems. While $\ell_2$ regularization shrinks parameters uniformly, many machine learning problems benefit from solutions with inherent low-rank structure, such as collaborative filtering or multi-task learning.

The nuclear norm $|W|_* = \sum_i \sigma_i(W)$ (sum of singular values) serves as the convex relaxation of the rank function, analogous to how $\ell_1$ norm relaxes $\ell_0$ sparsity. However, optimizing under nuclear norm constraints requires specialized techniques because the constraint set is not as simple as an $\ell_p$ ball.

The Normalized Steepest Descent (NSD) framework handles this by solving the constrained optimization subproblem $\Delta_t = \arg\max_{|\Delta|_* \leq \gamma} \langle M_t, \Delta \rangle$ at each step, where $M_t$ is the momentum-averaged gradient. This requires computing the projection onto the nuclear norm ball, which can be done efficiently via SVD and soft-thresholding of singular values. The resulting algorithm (NucGD) implicitly biases toward low-rank solutions even when the data is separable.

Two-Point Bandit Feedback Gradient Estimation

The two-point feedback technique in today’s convex optimization paper addresses the fundamental challenge of gradient estimation when only function values (not gradients) are observable. Standard one-point methods using $g_t = \frac{d}{\alpha}[\ell_t(x_t + \alpha u_t) - \ell_t(x_t)]u_t$ suffer from high variance and poor concentration properties.

The two-point approach samples $u_t$ uniformly from the unit sphere and constructs the estimator $g_t = \frac{d(\ell_t(x_t + \alpha u_t) - \ell_t(x_t - \alpha u_t))}{2\alpha} u_t$. This symmetric difference reduces the variance by canceling out the zero-order terms, leaving only the linear component that captures the true gradient direction.

The key mathematical insight is that $\mathbb{E}[g_t] = \nabla \ell_t(x_t) + O(\alpha^2)$ with much better concentration properties than one-point methods. For strongly convex functions with parameter $\mu$, this enables $O(d(\log T + \log(1/\delta))/\mu)$ high-probability regret bounds, resolving a long-standing open question about the achievable rates with limited feedback.

Reading Guide

The first paper introduces FOCUS, demonstrating how doubling episodes can achieve optimal variance-dependent regret in infinite-horizon MDPs by automatically adapting to problem complexity. The second paper applies nuclear norm constraints within the NSD framework to characterize implicit bias in multiclass learning, showing how optimization geometry affects solution structure. The third paper uses two-point gradient estimation to finally achieve optimal high-probability bounds for strongly convex bandit optimization, completing a theoretical picture that remained open since 2010.

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Authors: Guy Zamir, Matthew Zurek, Yudong Chen · Institution: University of Wisconsin-Madison · Category: cs.LG

First algorithm achieving optimal variance-dependent regret bounds for infinite-horizon MDPs, with tight characterization of bias span dependence and fundamental separation between algorithms with and without prior knowledge.

Tags: reinforcement learning theory infinite-horizon MDPs variance-dependent regret average-reward MDPs regret bounds burn-in cost bias span online learning

arXiv · PDF

Problem Formulation

Motivation: Online reinforcement learning in infinite-horizon MDPs is less understood than episodic settings, with existing algorithms suffering from high “burn-in” costs and failing to adapt to problem-specific structure like deterministic environments.

Mathematical setup: Consider a Markov Decision Process $(S, A, P, r, \mu_0)$ where $S$ is finite state space with $S =

$, $A$ is finite action space with $A =

$, $P: S \times A \to \Delta(S)$ is transition kernel, $r: S \times A \to [0,1]$ is reward function, and $\mu_0 \in \Delta(S)$ is initial distribution. The agent interacts for $T$ steps, observing trajectory $(s_1, a_1, \ldots, s_T, a_T)$ where $s_1 \sim \mu_0$, $a_t$ is chosen by the learner, and $s_{t+1} \sim P(\cdot

s_t, a_t)$.

Key assumptions:

The MDP is weakly communicating (for average-reward setting)
State and action spaces are finite
Rewards are bounded in $[0,1]$
Agent has access to states and rewards but not transition probabilities

The optimal bias function $h^*$ satisfies $|h^*|_{sp} < \infty$ where $|\cdot|_{sp}$ is the span semi-norm.

Toy example: When $S = 2, A = 2$ and the MDP is deterministic, the cumulative variance $\text{Var}_\gamma^* = 0$, so regret becomes $O(|h^*|_{sp} SA \log T)$ independent of $T$.

Formal objective: Minimize the average-reward regret
\[\text{Regret}(T) := \sum_{t=1}^T (\rho^* - r(s_t, a_t))\]
or the $\gamma$-regret
\[\text{Regret}_\gamma(T) := \sum_{t=1}^T ((1-\gamma)V_\gamma^*(s_t) - r(s_t, a_t))\]

Method

FOCUS (Fully Optimizing Clipped UCB Solver) is a model-based algorithm using doubling episodes and optimistic value iteration.

Key steps:

Maintain empirical transition estimates $\hat{P}^k_{s,a,s’}$ and visit counts $N^k(s,a)$
Start new episode when any state-action pair’s count doubles: $N(s,a) = 2^i$
At episode $k$, compute optimistic Q-function by iterating the clipped empirical Bellman operator until convergence:
\[\hat{T}_k(Q)(s,a) = r(s,a) + \gamma \hat{P}^k_{s,a}(\text{Clip}_H(MQ)) + \gamma b(s,a,\text{Clip}_H(MQ))\]
where the bonus is
\[b(s,a,V) = \max\left\{4\sqrt{\frac{V(\hat{P}^k_{s,a}, V)U}{\max\{N^k(s,a),1\}}}, \frac{32HU}{\max\{N^k(s,a),1\}}\right\}\]
Take greedy actions: $a_t \in \arg\max_{a} \hat{Q}^k(s_t, a)$

The clipping operator is $(\text{Clip}_H(V))(s) = \min{V(s), \min_{s’} V(s’) + H}$.

Applied to toy example: In a $2$-state deterministic MDP, the variance bonus becomes zero after one visit to each state-action pair, and the algorithm achieves $O(|h^*|_{sp})$ regret independent of $T$.

Novelty & Lineage

Prior work:

“UCRL2” (Auer et al. 2008): achieved $\tilde{O}(DS\sqrt{AT})$ regret with diameter $D$, but suboptimal and high burn-in.
“PMEVI-DT” (Boone & Zhang 2024): first tractable minimax-optimal algorithm for average-reward, but burn-in cost $|h^*|_{sp}^{10}S^{40}A^{20}$.
“UCBVI-$\gamma$” (He et al. 2021): minimax-optimal $\gamma$-regret but with suboptimal dependence on $\frac{1}{1-\gamma}$.

Delta: This paper provides the first variance-dependent regret bounds for infinite-horizon MDPs, achieving regret $\tilde{O}(\sqrt{\text{Var}^*_\gamma SA} + \text{lower-order terms})$ where $\text{Var}^*_\gamma = \sum_{t=1}^T V(P_{s_t,a_t}, V^*_\gamma)$ captures cumulative transition variance.

Theory-specific assessment:
- Main theorem is somewhat predictable given recent episodic variance-dependent results, but the extension to infinite-horizon with optimal span dependence required genuine technical innovation
- Proof technique of “full optimization” (iterating Bellman operator to convergence rather than one-step updates) is genuinely new for this setting and necessary to remove $\frac{1}{1-\gamma}$ factors
- Bounds are tight: upper bounds match lower bounds for bias span dependence in both prior-knowledge and prior-free settings
- Lower bound construction (Theorem 3.8) showing fundamental gap between algorithms with/without prior knowledge is novel and surprising
Verdict: SIGNIFICANT — First optimal variance-dependent bounds for infinite-horizon RL with novel “full optimization” technique and tight characterization of bias span dependence.

Proof Techniques

The proof uses several key techniques:

Full Optimization Analysis: Unlike prior work using one-step value iteration, FOCUS iterates the empirical Bellman operator $\hat{T}_k$ until convergence. This avoids factors of $\frac{1}{1-\gamma}$ by ensuring value estimates fully exploit available data. Key inequality controlling trajectory-dependent terms:
\[\sum_{t=1}^T (\hat{V}_{k(t)}(s_{t+1}) - \hat{V}_{k(t)}(s_t)) \leq \sum_{k=1}^m H \leq \tilde{O}(HSA)\]
Sharp Bernstein-style Bonus: The confidence width uses variance-dependent bonus:
\[b(s,a,V) = \max\left\{4\sqrt{\frac{V(\hat{P}^k_{s,a}, V)U}{\max\{N^k(s,a),1\}}}, \frac{32HU}{\max\{N^k(s,a),1\}}\right\}\]
Span Clipping: The operator $\text{Clip}_H$ ensures $|\hat{V}|_{sp} \leq H$, replacing factors of $\frac{1}{1-\gamma}$ with problem-dependent parameter $H$.
Cumulative Variance Control: Key lemma bounding the variance parameter:
\[\text{Var}^*_\gamma \leq O(\|V^*_\gamma\|_{sp}T + \|V^*_\gamma\|_{sp}^2 \log(T/\delta))\]
Average-to-Discount Reduction: For average-reward setting, uses reduction:
\[\text{Regret}(T) \leq (1-\gamma)\|V^*_\gamma\|_{sp}T + \text{Regret}_\gamma(T)\]
Setting $\gamma = 1 - 1/T$ makes the first term negligible.
Lower Bound Construction: Two-MDP construction where algorithms must explore to distinguish between high-span and low-span instances, leading to $\Omega(|h^*|_{sp}^2 SA)$ burn-in cost without prior knowledge.

Experiments & Validation

Purely theoretical. Empirical validation would test:

Variance-dependent regret on synthetic MDPs with varying stochasticity levels
Comparison of burn-in costs against PMEVI-DT and other baselines
Performance on deterministic vs. stochastic environments to verify $O(1)$ vs. $O(\sqrt{T})$ scaling
Computational efficiency verification of the $O(S^3A^2T)$ runtime claim

Limitations & Open Problems

Limitations:

Lower-order terms contain factor $\Gamma = \max_{s,a}

\text{supp}(P(\cdot

s,a))

$ - TECHNICAL (likely removable but challenging)

Algorithm requires knowledge of time horizon $T$ to set $\gamma = 1-1/T$ - TECHNICAL (could use doubling on time horizon)
Weakly communicating assumption for average-reward bounds - NATURAL (standard in the field)
Computational complexity $O(S^3A^2T)$ is higher than some EVI-based methods - TECHNICAL (due to full optimization)

Open problems:
Remove the $\Gamma$ factor from lower-order terms, as recently done in episodic settings
Extend to function approximation settings while maintaining variance-dependent guarantees

Towards The Implicit Bias on Multiclass Separable Data Under Norm Constraints

Authors: Shengping Xie, Zekun Wu, Quan Chen, Kaixu Tang · Institution: Peking University · Category: cs.LG

Extends normalized steepest descent framework to nuclear norm constraints, deriving NucGD algorithm that implicitly promotes low-rank solutions on multiclass separable data.

Tags: implicit-bias optimization-theory nuclear-norm low-rank-methods multiclass-classification steepest-descent matrix-optimization power-iteration

arXiv · PDF

Problem Formulation

Motivation: Understanding implicit bias in overparameterized models is crucial for generalization theory in deep learning. This work studies how different optimization geometries shape solutions on multiclass separable data, specifically investigating nuclear norm constraints to promote low-rank structures.

Mathematical setup: Consider multiclass classification with training data

\[(x_i, y_i)_{i=1}^n\]

where

\[x_i \in \mathbb{R}^d\]

and

\[y_i \in [k]\]

. The linear classifier is

\[f_W(x) = Wx\]

with weight matrix

\[W \in \mathbb{R}^{k \times d}\]

. Training uses cross-entropy loss:

\[L(W) = -\frac{1}{n} \sum_{i=1}^n \log S_{y_i}(Wx_i)\]

Assumptions:

Data is linearly separable:
\[\exists W : m(W) > 0\]
Each class contains at least one data point
Standard regularity conditions on the loss function

The SVM margin is defined as:
\[m(W) = \min_{i \in [n], y \neq y_i} (w_{y_i}^T x_i - w_y^T x_i)\]
The relative margin under norm
\[\|\cdot\|_{\bullet}\]
is:
\[m_{\bullet}(W) = \frac{m(W)}{\|W\|_{\bullet}}\]
Toy example: When
\[k=3\]
,
\[d=2\]
, and data points are
\[x_1=(1,0), x_2=(0,1), x_3=(-1,-1)\]
with labels
\[y_1=1, y_2=2, y_3=3\]
, the nuclear norm promotes rank-1 solutions
\[W = uv^T\]
while Frobenius norm allows full-rank
\[W\]
.

Formal objective: Find the maximum margin solution:
\[W_{\bullet} = \arg\max_{\|W\|_{\bullet}=1} m(W)\]

Method

The method uses Normalized Steepest Descent (NSD) under nuclear norm constraints. The algorithm alternates between:

Compute gradient:
\[G_t = \nabla L(W_t)\]
Update momentum:
\[M_t = \mu M_{t-1} + (1-\mu)G_t\]
Solve constrained optimization:
\[\Delta_t = \arg\max_{\|\Delta\|_* \leq \gamma} \langle M_t, \Delta \rangle\]
Update weights:
\[W_{t+1} = W_t - \Delta_t\]
Key theorem: For matrix
\[M = \sum_{i=1}^r \sigma_i u_i v_i^T\]
with SVD, the solution is:
\[\Delta_t = \gamma u_1 v_1^T\]
Efficient implementation: To avoid full SVD, use power iteration:
\[p_t = \frac{M_t M_t^T p_{t-1}}{\|M_t M_t^T p_{t-1}\|_2}\]
Then compute:
\[\Delta_t = \gamma \frac{p_t p_t^T M_t}{\|p_t^T M_t\|_2}\]
Application to toy example: For the 3×2 example, if
\[M_t\]
has dominant singular vector
\[u_1, v_1\]
, then NucGD updates
\[W_{t+1} = W_t - \gamma u_1 v_1^T\]
, progressively building a rank-1 solution aligned with the leading singular direction.

Novelty & Lineage

Prior work:

Soudry et al. (2018): Proved GD on binary separable data converges to max ℓ2-margin solution
Fan et al. (2025): Extended to multiclass via NSD framework, covering ℓ2, ℓ∞, spectral norms
Ravi et al. (2024): Direct multiclass extension of Soudry’s result for ℓ2 case

Delta: This paper adds nuclear norm to the NSD framework, deriving NucGD algorithm that promotes low-rank solutions. Key contributions:
theoretical extension to nuclear norm geometry
efficient SVD-free implementation via power iteration
empirical analysis of stochastic effects.

Theory-specific assessment:
- Main result is predictable extension of Fan et al.’s framework to nuclear norm
- Proof technique is routine application of nuclear/spectral norm duality
- No comparison with known lower bounds (none established for this setting)
- The power iteration convergence analysis (Theorem 3) is standard but provides practical value
The connection to low-rank projection methods like GaLore is interesting but largely empirical rather than providing new theoretical insight.

Verdict: INCREMENTAL — Solid extension of existing NSD framework to nuclear norm with practical implementation, but the theoretical advance is straightforward application of known duality results.

Proof Techniques

The main theoretical contributions use standard techniques:

Nuclear norm steepest descent direction: Uses duality between nuclear and spectral norms:
\[\langle M, \Delta \rangle \leq \|M\|_2 \cdot \|\Delta\|_* = \sigma_1 \|\Delta\|_*\]
The maximum is achieved by:
\[\Delta^* = \gamma u_1 v_1^T\]
Rank-1 update representation: Key insight via Remark 1:
\[u_1 v_1^T = \frac{u_1 u_1^T M}{\|u_1^T M\|_2}\]
This connects the nuclear norm update to left multiplication by dominant eigenvector projection.
Power iteration convergence: Uses contraction analysis with spectral gap assumption. For eigenvector estimate
\[p_t\]
:
\[\tan \angle(p_t, u_t) \leq \rho^2 \tan \angle(p_{t-1}, u_t)\]
where
\[\rho = \sigma_2/\sigma_1 < 1\]
is the spectral gap ratio.
Implicit bias convergence: Follows Fan et al.’s framework directly - shows normalized margin converges to maximum:
\[m_*(W_*) - m_*(W_T) \leq O\left(\frac{\log T}{T^{1/2}}\right)\]
The techniques are all standard applications of matrix analysis and convex optimization theory. The power iteration analysis is the most technical contribution but uses well-known contraction arguments.

Experiments & Validation

Datasets: Synthetic multiclass data with k=15 classes, d=25 features, 50 points per class, noise σ=0.1. Data generated by sampling class centers from N(0,I) then points from N(c_y, σ²I).

Baselines: SignGD (ℓ∞), NGD (ℓ2), Muon (spectral), NucGD (nuclear norm).

Key results:

NucGD achieves highest correlation (0.95+) with nuclear norm max-margin solution
Nuclear norm max-margin has ~4 near-zero singular values vs. flatter spectra for other norms
Mini-batch effects: smaller batches can improve alignment for NucGD but hurt Muon stability
Momentum effects: higher μ increases correlation but decreases normalized margin in stochastic setting

Metrics:
- Correlation: cosine similarity between learned W and theoretical W*
- Margin error: relative error in normalized margin values
All experiments on toy synthetic data. No evaluation on realistic datasets or comparison with practical low-rank methods like GaLore in actual training scenarios.

Limitations & Open Problems

Limitations:

Linear model restriction - RESTRICTIVE (significantly limits applicability to deep learning)
Separable data assumption - RESTRICTIVE (rarely satisfied in practice)
Synthetic experimental validation only - RESTRICTIVE (no real-world datasets tested)
Power iteration requires spectral gap assumption - TECHNICAL (σ₂/σ₁ < ρ < 1 uniformly)
Single step power iteration may be insufficient - TECHNICAL (theoretical guarantee needs stronger assumptions)

Open problems:
Extension to deep networks: Can nuclear norm implicit bias persist through nonlinear layers? Current theory limited to linear case.
Practical convergence rates: What are the finite-sample convergence guarantees for the SVD-free implementation under realistic stochastic conditions?

Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback

Authors: Haishan Ye · Institution: Xi’an Jiaotong University · Category: cs.LG

Resolves 14-year-old open problem by providing first high-probability regret bound O(d(log T + log(1/δ))/μ) for strongly convex OCO with two-point bandit feedback.

Tags: online-convex-optimization bandit-optimization high-probability-bounds concentration-inequalities regret-analysis strongly-convex-optimization

arXiv · PDF

Problem Formulation

Motivation: Online Convex Optimization (OCO) with bandit feedback is central to machine learning theory. Two-point bandit feedback allows gradient estimation but achieving tight high-probability regret bounds for strongly convex functions remained open since Agarwal et al. (2010).
Mathematical setup: Let $K \subseteq \mathbb{R}^d$ be a convex, compact constraint set with $K \subseteq DB$ where $B$ is the unit ball. At round $t$, player chooses $x_t \in K$ and adversary selects convex loss $\ell_t : K \to \mathbb{R}$. Player observes only $\ell_t(x_t + \alpha u_t)$ and $\ell_t(x_t - \alpha u_t)$ where $u_t$ is uniform on unit sphere.

Assumptions:
Each $\ell_t$ is $G$-Lipschitz: $ \ell_t(x) - \ell_t(y) \leq G|x-y|$
Each $\ell_t$ is $\mu$-strongly convex: $\ell_t(x) \geq \ell_t(y) + \nabla\ell_t(y)^T(x-y) + \frac{\mu}{2}|x-y|^2$

The gradient estimator is:
\[g_t = \frac{d(\ell_t(x_t + \alpha u_t) - \ell_t(x_t - \alpha u_t))}{2\alpha} u_t\]
Toy example: When $d=2$, $K = B$, $\alpha = 0.1$, and $u_t = (1,0)^T$, the estimator becomes $g_t = 10(\ell_t((x_t^{(1)} + 0.1, x_t^{(2)})) - \ell_t((x_t^{(1)} - 0.1, x_t^{(2)})))(1,0)^T$. The core difficulty is that $g_t$ has heavy tails despite being unbiased for the smoothed gradient.
Formal objective: Minimize regret with probability $1-\delta$:
\[R_T = \sum_{t=1}^T \ell_t(x_t) - \min_{x \in K} \sum_{t=1}^T \ell_t(x)\]

Method

The method uses Online Gradient Descent with two-point feedback (Algorithm 1):

At round $t$, sample $u_t$ uniformly from unit sphere
Query loss at $x_t + \alpha u_t$ and $x_t - \alpha u_t$
Construct gradient estimator:
\[g_t = \frac{d(\ell_t(x_t + \alpha u_t) - \ell_t(x_t - \alpha u_t))}{2\alpha} u_t\]
Update with projection onto shrunk domain:
\[x_{t+1} = \Pi_{(1-\xi)K}(x_t - \eta_t g_t)\]
Parameters: $\eta_t = \frac{2}{\mu t}$, $\alpha = \sqrt{\frac{dG^2\log(1/\delta)}{T\mu}}$, $\xi = \frac{(d+1)\log(1/\delta)}{T}$

Applied to toy example: With $d=2$, $\mu=1$, $G=1$, $T=100$, $\delta=0.1$, we get $\alpha \approx 0.21$, $\eta_t = 2/t$, $\xi \approx 0.07$. The estimator becomes $g_t = 4.76(\ell_t(x_t + 0.21u_t) - \ell_t(x_t - 0.21u_t))u_t$ and updates project onto $0.93K$.

Novelty & Lineage

Step 1 — Prior work:

Agarwal et al. (2010): “Optimal algorithms for online convex optimization with multi-point bandit feedback” achieved $O(d^2 \log T)$ regret in expectation, left high-probability bounds open
Shamir (2017): “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback” achieved $O(d \log T)$ regret in expectation using sophisticated geometric techniques
Hazan et al. (2007): “Logarithmic regret algorithms for online convex optimization” established $O(\log T)$ regret for full-information strongly convex OCO

Step 2 — Delta: This paper provides the first high-probability regret bound of $O(d(\log T + \log(1/\delta))/\mu)$ for two-point bandit feedback, resolving the open problem from Agarwal et al. (2010). Key technical contribution is extending Shamir’s expectation-based techniques to high-probability regime using novel concentration analysis.

Step 3 — Theory-specific assessment:

Main theorem addresses a 14-year-old open problem, but the result seems like a natural extension once proper concentration tools are applied
Proof technique combines Freedman’s inequality with careful variance control - solid but not revolutionary
Bounds match known minimax lower bounds in $T$ and $d$, achieving optimal rates

Verdict: SIGNIFICANT — resolves a long-standing open problem with tight bounds, though the technical approach follows expected paths.

Proof Techniques

The proof strategy has three key stages:

Bias-variance decomposition with careful step size choice: Uses step size $\eta_t = \frac{2}{\mu t}$ (instead of standard $\frac{1}{\mu t}$) to control the telescoping sum:
\[\sum_{t=1}^T \frac{\|x_t - x\|^2 - \|x_{t+1} - x\|^2}{2\eta_t} - \frac{\mu}{2}\sum_{t=1}^T \|x_t - x\|^2\]
High-probability control of bias term via Freedman’s inequality: Key insight is bounding $\sum_{t=1}^T \langle \nabla\hat{\ell}_t(x_t) - g_t, x_t - x \rangle$ using:
\[E[|\langle \nabla\hat{\ell}_t(x_t) - g_t, x_t - x \rangle|^2 | x_t] \leq c_0 dG^2 \|x_t - x\|^2\]
Applying Freedman with $R = 2(d+1)GD$ and variance proxy $\sigma^2 = c_0dG^2 \sum_{t=1}^T |x_t - x|^2$ gives:
\[\sum_{t=1}^T \langle \nabla\hat{\ell}_t(x_t) - g_t, x_t - x \rangle \leq \frac{\mu}{4}\sum_{t=1}^T \|x_t - x\|^2 + O\left(\frac{dG^2\log(1/\delta)}{\mu}\right)\]
High-probability bound on gradient norm sum: Extends Shamir’s technique using subgaussian concentration. Shows $\ell_t(x_t + \alpha u_t) - E[\ell_t(x_t + \alpha u_t)]$ is subgaussian with parameter $O(G\alpha/\sqrt{d})$, leading to:
\[\sum_{t=1}^T \frac{\|g_t\|^2}{\mu t} \leq O\left(\frac{dG^2(\log T + \log(1/\delta))}{\mu}\right)\]

Experiments & Validation

Purely theoretical. Empirical validation would require comparing against:

full-information OGD baseline
one-point bandit methods
other multi-point bandit algorithms on strongly convex functions. Key metrics would be actual regret vs. theoretical bounds across different dimensions and time horizons.

Limitations & Open Problems

Limitations:

Strong convexity assumption - RESTRICTIVE (excludes many ML objectives like hinge loss, logistic regression)
Compact constraint set assumption - NATURAL (standard in OCO theory)
Lipschitz continuity assumption - NATURAL (mild regularity condition)
Two function evaluations per round - TECHNICAL (could potentially extend to fewer queries)
Adversarial setting only - TECHNICAL (stochastic case might allow better rates)

Open problems:
Extend to non-strongly convex functions with optimal $O(\sqrt{T})$ high-probability regret
Achieve similar results with only one function evaluation per round