The STHTD-MP method replaces standard feature covariance metrics with the symmetric part of the behavior-policy Bellman matrix. This shift optimizes the update geometry in primal-dual saddle-point formulations. Researchers found that utilizing behavior-policy transition information stabilizes gradient temporal-difference learning. Practitioners can now achieve faster convergence in off-policy prediction using a single learning rate.