u/maiosi2

Hi everyone!

I am trying to reproduce an example from a textbook step by step, but I cannot get the same result, no matter what i try

The example is about value iteration / adaptive dynamic programming for a discrete-time LQR problem. The system is a 4-state linear system obtained by discretizing a continuous-time model with zero-order hold and sampling time (T_s = 0.01) s.

What the authors say is that they use value iteration and estimate the parameters of (P) online using batch least squares every 15 data points collected from the trajectory. Then they update the controller and continue iterating. According to the book, this converges to the correct Riccati solution. (Exmple is at page 41 of this book :

https://lewisgroup.uta.edu/2019%2006%20RL%20short%20course%20SEU/RL%20papers/Optimal%20Adaptive%20Control-%20Lewis-%20full%20book.pdf)

I tried to reproduce exactly the same procedure in MATLAB, using the same type of quadratic features and solving the least-squares problem every 15 samples, but when I do this I do not have enough independent features to solve the problem correctly. The regression matrix quickly becomes rank deficient or nearly singular because the trajectory converges and the states lose excitation.

If I artificially collect much more data, or use many random resets of the initial condition, or generate many trajectories, then the estimation starts working much better and the learned (P) gets close to the true solution. But the book explicitly seems to suggest that the method works just by updating every 15 points along the trajectory.

And this is my code: https://pastebin.com/YkpUiNYj

These are the result from the book :

https://preview.redd.it/iwvbdmovbh1h1.png?width=709&format=png&auto=webp&s=9fccc356111cbb9aa64d1db02b8e06f59b0c0a5e

And these are mine:

https://preview.redd.it/ph51herzbh1h1.png?width=501&format=png&auto=webp&s=c23d574e53ca6e66ca82ad3eccf9b1b143e7d411

i tried longer simulation time, different initial position, checking for rank but nothing seems to came close to their solution.

Has anyone worked with this before?? Thank a lot for your help!

Hi guys, I know this is a pretty specific topic, but if anyone here has worked on optimal/adaptive control or RL-style value function learning, I’d really appreciate your insight.

I’ve implemented a discrete-time LQR-like setup where a neural network critic (ReLU) learns the optimal value function via TD(0). I validate performance against the analytical solution:

V(x) = x^T P x

With periodic state resets, the critic converges well and captures the expected quadratic structure.

However, when I introduce persistent excitation (e.g., sinusoids or band-limited noise added to the control input), the critic no longer converges to the optimal value function.

And in general just diverges .

This raises a fundamental question:

How I can "excite" the system so that I have data to learn before it converges to zero , is it possible?

More generally:

is this lack of convergence due to a “policy shift”, or is there a principled way to introduce excitation without biasing the value function estimation?

Any thoughts, references, or similar experiences would be super helpful!

Can't reproduce an exmaple from textbook no matter what, Adaptive Dynamic programming / Adaptive optimal Control