Towards Parameter-Free Temporal Difference Learning

Imagine you are trying to learn the best route to work in a massive, foggy city. You don't have a perfect map, and the traffic patterns change every day. This is essentially what Reinforcement Learning (RL) does: it teaches an AI agent to make good decisions by trial and error.

A core part of this learning process is called Temporal Difference (TD) Learning. Think of TD learning as a student taking a daily quiz. Every day, the student guesses the "value" of a location (e.g., "Is this street usually fast or slow?"). At the end of the day, they get a little bit of feedback (a reward or a penalty) and update their guess for tomorrow. Over time, their guesses get better.

The Problem: The "Goldilocks" Dilemma

For a long time, making this learning process work well required a very specific, tricky ingredient: the Step Size.

Think of the step size as the size of the step the student takes when updating their guess.

Too big: They overshoot the truth, swinging wildly back and forth like a drunk person trying to walk a straight line.
Too small: They inch forward so slowly that they never learn anything useful before the day is over.
Just right: They learn efficiently.

The problem is that finding the "just right" step size usually requires knowing secret details about the city that the student doesn't have yet. For example, you need to know:

The "Mixing Time": How long does it take for the traffic to settle into a normal pattern? (Is it chaotic for 5 minutes or 5 hours?)
The "Eigenvalue": A mathematical measure of how "stable" the city's layout is.

In the real world, you can't know these numbers beforehand. You have to guess. If you guess wrong, the algorithm fails. This is the "Parameter-Free" problem the paper tries to solve.

The Solution: The "Exponential Decay" Strategy

The authors propose a clever, simple trick: Don't guess a fixed step size. Instead, take huge steps at the beginning and shrink them exponentially as you go.

Imagine you are running a race.

Start: You sprint. You don't care about precision; you just want to get a feel for the terrain. You take massive, bold steps.
Middle: You slow down to a jog. You start noticing the details.
End: You walk very carefully, making tiny, precise adjustments to hit the finish line exactly.

The paper calls this an Exponential Step-Size Schedule. It's like a self-adjusting thermostat that automatically cools down as the room gets comfortable.

Two Scenarios: The Ideal vs. The Real World

The paper tests this idea in two different "worlds":

1. The "Magic Library" (i.i.d. Sampling)

Imagine a library where every book you pick is completely random and independent of the last one. This is the "Ideal World."

The Old Way: Previous methods required you to know exactly how many books were in the library to set your reading speed.
The New Way: The authors show that if you just start fast and slow down exponentially, you learn the best route perfectly without needing to know the library's size. You get the best possible result on your very last guess, not just an average of all your guesses.

2. The "Chaotic City" (Markovian Sampling)

This is the Real World. Here, your next step depends entirely on where you are right now. If you are in a traffic jam, the next minute is also likely a traffic jam. The data is "sticky" and correlated.

The Old Way: To handle this chaos, previous methods had to use "training wheels" (mathematical projections) or throw away data (data dropping) to pretend the chaos didn't exist. They also needed to know the "Mixing Time" (how long the traffic jam lasts) to set their step size.
The New Way: The authors introduce a Regularized TD method. Think of this as adding a tiny bit of "friction" or "damping" to the student's brain. This friction prevents them from getting too crazy when the traffic jams happen.
- Result: The algorithm learns the best route through the chaotic city without needing to know how long the traffic jams last. It doesn't need to throw away data, and it doesn't need to use complex "training wheels."

Why This Matters

No More Guessing: You don't need to be a mathematician to tune this algorithm. It adapts automatically.
Last-Iterate Guarantee: Most old methods said, "If you average all your guesses over the last 100 days, you'll be good." This paper says, "Your final guess on the last day will be the best one." This is crucial for real-time applications where you can't wait to average things out.
Simplicity: It uses a standard algorithm (TD(0)) with just a simple change in how the step size shrinks. No complex projections or data dropping.

The Bottom Line

The paper is like a mechanic telling you: "Stop trying to calibrate your car's engine with a complex manual that requires you to know the exact temperature of the air. Just install this new governor that automatically slows the engine down as it warms up. It works better, it's safer, and you don't need to know anything about the weather to drive it."

They have made Reinforcement Learning more robust, practical, and user-friendly by removing the need for impossible-to-know constants.

1. Problem Statement

Temporal Difference (TD) learning is a cornerstone algorithm for estimating value functions in Reinforcement Learning (RL). While recent finite-time analyses have quantified the convergence rates of TD with linear function approximation, they suffer from significant practical limitations:

Dependency on Problem-Dependent Constants: Existing theoretical guarantees often require step-sizes tuned using hard-to-estimate quantities, such as the minimum eigenvalue of the feature covariance matrix ( $\omega$ ) or the mixing time of the underlying Markov chain ( $\tau_{mix}$ ).
Non-Standard Modifications: To achieve optimal convergence, prior works often rely on impractical algorithmic modifications, such as projecting iterates onto a bounded set or using iterate averaging (Polyak-Ruppert averaging).
Gap Between Theory and Practice: Most practical implementations of TD(0) do not use averaging or projections, yet theoretical guarantees often only apply to the average of iterates, not the last iterate.

The paper aims to design a theoretically principled TD algorithm that is parameter-free (requiring no knowledge of $\omega$ or $\tau_{mix}$ ), avoids non-standard modifications (projections/averaging), and provides convergence guarantees for the last iterate.

2. Methodology

The authors propose using the standard TD(0) algorithm with linear function approximation but replace the traditional constant or polynomially decaying step-sizes with an exponential step-size schedule.

The Exponential Step-Size:
For a fixed horizon $T$ , the step-size at iteration $t$ is defined as:
$\eta_t = \eta_0 \alpha^t$
where $\alpha = T^{-1/T}$ and $\eta_0$ is a constant. This schedule is known to be effective for smooth, strongly convex optimization problems (like SGD) but had not been rigorously analyzed for TD learning in this context.

The analysis is conducted under two sampling regimes:

i.i.d. Sampling: States are sampled independently from the stationary distribution $\mu_\pi$ .
Markovian Sampling: States are collected along a single trajectory, introducing temporal correlations.

Key Technical Innovations:

Optimization Lens: The authors analyze TD(0) using tools from stochastic optimization (similar to SGD analysis) rather than pure stochastic approximation, allowing for tighter control of the last iterate.
Induction for Markovian Noise: For the Markovian setting, they employ a strong induction argument (building on Mitra, 2025) to prove that iterates remain bounded without requiring explicit projection steps.
Regularization for Parameter-Free Guarantees: To remove the dependence on the unknown eigenvalue $\omega$ in the Markovian setting, they introduce a Regularized TD(0) variant. This adds a regularization term $-\lambda w$ to the update direction, where $\lambda$ is a tunable parameter independent of problem constants.

3. Key Contributions

Contribution 1: i.i.d. Sampling Regime

Result: The authors prove that standard TD(0) with exponential step-sizes achieves the optimal bias-variance trade-off for the last iterate.
Advantage: Unlike prior works (e.g., Bhandari et al., 2018) that require knowledge of $\omega$ for fast rates or only guarantee convergence for averaged iterates (Samsonov et al., 2024), this method requires no knowledge of $\omega$ and guarantees convergence for the final output.
Trade-off: The convergence rate includes a mild logarithmic factor ( $\ln T$ ) in the variance term compared to some averaged-iterate bounds, but this is the price paid for a last-iterate guarantee without averaging.

Contribution 2: Markovian Sampling Regime

Standard TD(0): They show that standard TD(0) with exponential step-sizes achieves a fast convergence rate without projections. However, the initial step-size $\eta_0$ still depends on $\omega$ .
Regularized TD(0): To achieve a fully parameter-free solution, they propose a regularized variant. By setting the regularization strength $\lambda \approx 1/\sqrt{T}$ , they eliminate the need to know $\omega$ or $\tau_{mix}$ to set the step-size.
Convergence: The regularized algorithm achieves a convergence rate comparable to prior works but guarantees convergence for the last iterate without requiring:
- Projections onto a bounded set.
- Iterate averaging.
- Knowledge of $\tau_{mix}$ or $\omega$ .
- Data dropping (a sample-inefficient technique used in some prior works).

4. Key Results and Convergence Rates

The paper provides finite-time convergence bounds for the expected squared error $\mathbb{E}[\|w_T - w^*\|^2]$ .

Table 1 Comparison Summary:

Setting	Method	Parameters Needed	Projection?	Averaging?	Convergence Target
i.i.d.	Prior (Bhandari et al.)	$\omega$	No	No	Last (Slow rate)
i.i.d.	Prior (Samsonov et al.)	None	No	Yes	Average
i.i.d.	Ours	None	No	No	Last (Optimal)
Markovian	Prior (Bhandari et al.)	$\omega, \tau_{mix}$	Yes	Yes	Average
Markovian	Prior (Mitra, 2025)	$\omega, \tau_{mix}$	No	Yes	Average
Markovian	Ours (Reg. TD)	None	No	No	Last

Specific Rates:

i.i.d. (Last Iterate):
$O\left( \exp\left(-\frac{\omega T}{\ln T}\right) + \frac{\ln^2 T}{\omega^2 T} \right)$
(Note: The paper denotes the variance term with $\ln^2 T$ dependence).
Markovian (Regularized TD, Last Iterate):
$O\left( \exp\left(-\frac{\omega \sqrt{T}}{\ln^3 T}\right) + \frac{\ln^4 T}{\omega^2 T} \exp\left(\frac{m}{\ln(1/\rho)}\right) \right)$
- The term $\exp(m/\ln(1/\rho))$ represents an exponential dependence on the mixing time, which the authors acknowledge is a limitation of their current analysis compared to linear dependencies in some prior works, but they conjecture this can be improved.

5. Significance and Impact

Bridging Theory and Practice: This work is significant because it provides theoretical guarantees that align with practical implementations. Practitioners rarely use iterate averaging or projections; this paper proves that standard TD(0) with a simple exponential step-size schedule is theoretically sound without these crutches.
Parameter-Free Learning: By removing the need to estimate $\omega$ (which is often unknown and difficult to compute in high-dimensional RL) and $\tau_{mix}$ , the proposed algorithm is more robust and easier to deploy in real-world scenarios.
Last-Iterate Guarantees: In many RL applications, the final policy or value estimate is used directly, not an average. Proving convergence for the last iterate is crucial for the reliability of the algorithm in practice.
New Analytical Tools: The combination of exponential step-sizes with induction-based bounding of Markovian noise offers a new toolkit for analyzing stochastic approximation algorithms in dependent data settings.

In conclusion, the paper demonstrates that exponential step-sizes can unlock optimal convergence properties for TD learning, making it theoretically robust and practically viable without relying on problem-specific constants or non-standard algorithmic modifications.