C. Szepesvári, Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, vol.4, issue.1, 2010.
DOI : 10.2200/S00268ED1V01Y201005AIM009

M. Wiering and M. Van-otterlo, Reinforcement Learning: State Of the Art, 2012.
DOI : 10.1007/978-3-642-27645-3

L. Bu?oniu, R. Babu?ka, B. D. Schutter, and D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators, 2010.

W. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2007.
DOI : 10.1002/9781118029176

D. Choi and B. Van-roy, A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning, Discrete Event Dynamic Systems, vol.22, issue.1,2,3, pp.207-239, 2006.
DOI : 10.1007/b98840

L. C. Baird, Residual Algorithms: Reinforcement Learning with Function Approximation, International Conference on Machine Learning (ICML), pp.30-37, 1995.
DOI : 10.1016/B978-1-55860-377-6.50013-X

URL : http://www.cs.cmu.edu/People/reinf/ml95/proc/baird.ps

Y. Engel, Algorithms and Representations for Reinforcement Learning, 2005.

M. Geist and O. Pietquin, Kalman Temporal Differences, Journal of Artificial Intelligence Research, 2010.
DOI : 10.1109/adprl.2009.4927543

URL : https://hal.archives-ouvertes.fr/hal-00858687

S. J. Bradtke and A. G. Barto, Linear Least-Squares algorithms for temporal difference learning, Machine Learning, pp.33-57, 1996.
DOI : 10.1007/bf00114723

URL : https://link.springer.com/content/pdf/10.1007%2FBF00114723.pdf

M. Geist and O. Pietquin, Statistically linearized least-squares temporal differences, International Congress on Ultra Modern Telecommunications and Control Systems
DOI : 10.1109/ICUMT.2010.5676598

URL : https://hal.archives-ouvertes.fr/hal-00554338

R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver et al., Fast gradient-descent methods for temporal-difference learning with linear function approximation, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp.993-1000, 2009.
DOI : 10.1145/1553374.1553501

URL : http://webdocs.cs.ualberta.ca/~sutton/papers/gradTD1.pdf

H. Maei, C. Szepesvari, S. Bhatnagar, D. Precup, D. Silver et al., Convergent temporal-difference learning with arbitrary smooth function approximation, Advances in Neural Information Processing Systems (NIPS), pp.1204-1212, 2009.

H. R. Maei and R. S. Sutton, GQ( ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces, Proceedings of the 3d Conference on Artificial General Intelligence (AGI-10), 2010.
DOI : 10.2991/agi.2010.22

H. R. Maei, C. Szepesvari, S. Bhatnagar, and R. S. Sutton, Toward Off-Policy Learning Control with Function Approximation, International Conference on Machine Learning (ICML), 2010.

A. Nedi? and D. P. Bertsekas, Least Squares Policy Evaluation Algorithms with Linear Function Approximation, Discrete Event Dynamic Systems: Theory and Applications, pp.79-110, 2003.

H. Yu and D. P. Bertsekas, Q-Learning Algorithms for Optimal Stopping Based on Least Squares, European Control Conference, 2007.

G. Gordon, Stable Function Approximation in Dynamic Programming, International Conference on Machine Learning (IMCL), 1995.
DOI : 10.1016/B978-1-55860-377-6.50040-2

URL : http://www.cs.berkeley.edu/~pabbeel/cs287-fa09/readings/Gordon-1995.pdf

D. Ernst, P. Geurts, and L. Wehenkel, Tree-Based Batch Mode Reinforcement Learning, Journal of Machine Learning Research, vol.6, pp.503-556, 2005.

R. S. Sutton, Learning to predict by the methods of temporal differences, Machine Learning, pp.9-44, 1988.
DOI : 10.3758/BF03205056

G. A. Rummery and M. Niranjan, Online q-learning using connectionist systems, 1994.

C. J. Watkins and P. Dayan, Q-learning, Machine Learning, pp.279-292, 1992.
DOI : 10.1007/BF00992698

Y. Engel, S. Mannor, and R. Meir, Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning, International Conference on Machine Learning (ICML), pp.154-161, 2003.

J. N. Tsitsiklis and B. Van-roy, An analysis of temporaldifference learning with function approximation, IEEE Transactions on Automatic Control, vol.42, 1997.

G. Tesauro, Temporal difference learning and TD-Gammon, Communications of the ACM, vol.38, issue.3, 1995.
DOI : 10.1145/203330.203343

H. Van-se?en, H. Van-hasselt, S. Whiteson, and M. Wiering, A theoretical and empirical analysis of Expected Sarsa, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.
DOI : 10.1109/ADPRL.2009.4927542

H. Yu, Least Squares Temporal Difference Methods: An Analysis under General Conditions, SIAM Journal on Control and Optimization, vol.50, issue.6, 2010.
DOI : 10.1137/100807879

F. S. Melo, S. P. Meyn, and M. I. Ribeiro, An analysis of reinforcement learning with function approximation, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.664-671, 2009.
DOI : 10.1145/1390156.1390240

A. Antos, C. Szepesvári, and R. Munos, Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning, pp.89-129, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00830201

A. Kruger, On Fréchet subdifferentials, Journal of Mathematical Sciences, vol.116, issue.3, pp.3325-3358, 2003.
DOI : 10.1023/A:1023673105317

T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 1984.

Y. Engel, S. Mannor, and R. Meir, Reinforcement learning with Gaussian processes, Proceedings of the 22nd international conference on Machine learning , ICML '05, 2005.
DOI : 10.1145/1102351.1102377

URL : http://www-ee.technion.ac.il/~rmeir/Publications/EngelMannorMeirICML05.pdf

D. Precup, R. S. Sutton, and S. P. Singh, Eligibility Traces for Off-Policy Policy Evaluation, International Conference on Machine Learning (ICML), pp.759-766, 2000.

S. J. Julier and J. K. Uhlmann, New extension of the Kalman filter to nonlinear systems, Signal Processing, Sensor Fusion, and Target Recognition VI, 1997.
DOI : 10.1117/12.280797

S. J. Julier, The scaled unscented transformation, Proceedings of the 2002 American Control Conference (IEEE Cat. No.CH37301), pp.4555-4559, 2002.
DOI : 10.1109/ACC.2002.1025369

URL : http://www.cs.unc.edu/~welch/kalman/media/pdf/ACC02-IEEE1357.PDF

P. M. Nørgård, N. Poulsen, and O. Ravn, New developments in state estimation for nonlinear systems, Automatica, vol.36, issue.11, pp.1627-1638, 2000.
DOI : 10.1016/S0005-1098(00)00089-3

R. Van-der-merwe, Sigma-point kalman filters for probabilistic inference in dynamic state-space models, 2004.

M. Geist and O. Pietquin, Eligibility traces through colored noises, International Congress on Ultra Modern Telecommunications and Control Systems, 2010.
DOI : 10.1109/ICUMT.2010.5676597

URL : https://hal.archives-ouvertes.fr/hal-00553910

L. Daubigney, M. Gasic, S. Chandramohan, M. Geist, O. Pietquin et al., Uncertainty management for online optimisation of a POMDP-based large-scale spoken dialogue system, Annual Conference of the International Speech Communication Association (Interspeech), pp.1301-1304, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00652194

M. G. Lagoudakis and R. Parr, Least-squares policy iteration, Journal of Machine Learning Research, vol.4, pp.1107-1149, 2003.

T. Söderström and P. Stoica, Instrumental variable methods for system identification, Circuits, Systems, and Signal Processing, pp.1-9, 2002.
DOI : 10.1007/BFb0009019

A. Geramifard, M. Bowling, and R. S. Sutton, Incremental Least-Squares Temporal Difference Learning, Conference of American Association for Artificial Intelligence (AAAI), pp.356-361, 2006.

J. Johns, M. Petrik, and S. Mahadevan, Hybrid least-squares algorithms for approximate policy evaluation, Machine learning, 2009.
DOI : 10.1007/978-3-642-04180-8_9

URL : http://www.cs.duke.edu/%7Ejohns/pubs/johns_ml09.pdf

M. Geist and O. Pietquin, Statistically linearized recursive least squares, 2010 IEEE International Workshop on Machine Learning for Signal Processing
DOI : 10.1109/MLSP.2010.5589236

URL : https://hal.archives-ouvertes.fr/hal-00553168

R. S. Sutton, C. Szepesvari, and H. R. Maei, A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation, Advances in Neural Information Processing Systems (NIPS), 2008.

B. D. Ripley, Stochastic Simulation, 1987.
DOI : 10.1002/9780470316726

A. Samuel, Some studies in machine learning using the game of checkers, IBM Journal on Research and Development, pp.210-229, 1959.

R. Munos, Performance Bounds in Lp norm for Approximate Value Iteration, SIAM Journal on Control and Optimization, 2007.
DOI : 10.1137/040614384

URL : https://hal.archives-ouvertes.fr/inria-00124685

D. Ormoneit and S. Sen, Kernel-Based Reinforcement Learning, Machine Learning, pp.161-178, 2002.

M. Riedmiller, Neural Fitted Q Iteration -First Experiences with a Data Efficient Neural Reinforcement Learning Method Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming, European Conference on Machine Learning (ECML) Labs for Information and Decision Systems, MIT, Tech. Rep. LIDS-P-2349, 1996.

D. P. Bertsekas, V. Borkar, and A. Nedic, Learning and Approximate Dynamic Programming, ch. Improved Temporal Difference Methods with Linear Function Approximation, pp.231-235, 2004.

D. P. Bertsekas and H. Yu, Projected equation methods for approximate solution of large linear systems, Journal of Computational and Applied Mathematics, vol.227, issue.1, pp.27-50, 2007.
DOI : 10.1016/j.cam.2008.07.037

D. P. Bertsekas, Projected Equations, Variational Inequalities, and Temporal Difference Methods, IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.

D. P. De-farias and B. V. Roy, The Linear Programming Approach to Approximate Dynamic Programming, Operations Research, vol.51, issue.6, pp.850-865, 2003.
DOI : 10.1287/opre.51.6.850.24925

V. V. Desai, V. F. Farias, and C. C. Moallemi, The Smoothed Approximate Linear Program, Advances in Neural Information Processing Systems (NIPS), 2009.

B. Scherrer, V. Gabillon, M. Ghavamzadeh, and M. Geist, Approximate Modified Policy Iteration, International Conference on Machine Learning (ICML), 2012.
URL : https://hal.archives-ouvertes.fr/hal-00697169

S. Kakade and J. Langford, Approximately optimal approximate reinforcement learning, International Conference on Machine Learning (ICML), 2002.

A. Barto, R. Sutton, and C. Anderson, Neuronlike adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, vol.13, issue.5, pp.834-846, 1983.
DOI : 10.1109/TSMC.1983.6313077

V. R. Konda and J. N. Tsitsiklis, OnActor-Critic Algorithms, SIAM Journal on Control and Optimization, vol.42, issue.4, pp.1143-1166, 2003.
DOI : 10.1137/S0363012901385691

R. S. Sutton, D. A. Mcallester, S. P. Singh, and Y. Mansour, Policy Gradient Methods for Reinforcement Learning with Function Approximation, Neural Information Processing Systems (NIPS), pp.1057-1063, 1999.

J. Peters and S. Schaal, Natural Actor-Critic, Neurocomputing, vol.71, issue.7-9, pp.1180-1190, 2008.
DOI : 10.1016/j.neucom.2007.11.026

S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, Incremental natural actor-critic algorithms, Advances in Neural Information Processing Systems (NIPS), 2007.
DOI : 10.1016/j.automatica.2009.07.008

URL : http://www.cs.ualberta.ca/~sutton/papers/BSGL-08.pdf

M. Geist and O. Pietquin, Revisiting Natural Actor-Critics with Value Function Approximation, International Conference on Modeling Decisions for Artificial Intelligence (MDAI), ser. LNAI, pp.207-218, 2010.
DOI : 10.1007/11596448_9

URL : https://hal.archives-ouvertes.fr/hal-00554346

B. Scherrer and M. Geist, Recursive Least-Squares Learning with Eligibility Traces, European Workshop on Reinforcement Learning, 2011.
DOI : 10.1007/978-3-642-29946-9_14

URL : https://hal.archives-ouvertes.fr/hal-00644511

R. Munos, Error Bounds for Approximate Policy Iteration, International Conference on Machine Learning (ICML), pp.560-567, 2003.

B. Scherrer, Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view, International Conference on Machine Learning (ICML), 2010.
URL : https://hal.archives-ouvertes.fr/inria-00537403

A. Lazaric, M. Ghavamzadeh, and R. Munos, Finite-Sample Analysis of LSTD, International Conference on Machine Learning (ICML), 2010.
URL : https://hal.archives-ouvertes.fr/inria-00482189

O. Pietquin, M. Geist, and S. Chandramohan, Sample Efficient On-line Learning of Optimal Dialogue Policies with Kalman Temporal Differences, International Joint Conference on Artificial Intelligence (?CAI), pp.1878-1883, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00618252

J. A. Boyan, Technical Update: Least-Squares Temporal Difference Learning, Machine Learning, pp.233-246, 1999.

L. Daubigney, M. Geist, and O. Pietquin, Off-policy learning in large-scale POMDP-based dialogue systems, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.
DOI : 10.1109/ICASSP.2012.6289040

URL : https://hal.archives-ouvertes.fr/hal-00684819

C. W. Phua and R. Fitch, Tracking value function dynamics to improve reinforcement learning with piecewise linear function approximation, Proceedings of the 24th international conference on Machine learning, ICML '07, 2007.
DOI : 10.1145/1273496.1273591

URL : http://imls.engr.oregonstate.edu/www/htdocs/proceedings/icml2007/papers/523.pdf

M. Keramati, A. Dezfouli, and P. Piray, Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes, PLoS Computational Biology, vol.35, issue.5, 2011.
DOI : 10.1371/journal.pcbi.1002055.t002

URL : https://doi.org/10.1371/journal.pcbi.1002055

O. Pietquin, M. Geist, S. Chandramohan, and H. Frezza-buet, Sample-efficient batch reinforcement learning for dialogue management optimization, ACM Transactions on Speech and Language Processing, vol.7, issue.3, pp.1-7, 2011.
DOI : 10.1145/1966407.1966412

URL : https://hal.archives-ouvertes.fr/hal-00617517

M. Kearns and S. Singh, Bias-Variance Error Bounds for Temporal Difference Updates, Conference on Learning Theory (COLT), 2000.

B. Scherrer and M. Geist, Recursive least-squares off-policy learning with eligibility traces, INRIA, Tech. Rep, 2012.
DOI : 10.1007/978-3-642-29946-9_14

URL : http://hal.inria.fr/docs/00/64/45/11/PDF/ewrl.pdf

X. Xu, D. Hu, and X. Lu, Kernel-Based Least Squares Policy Iteration for Reinforcement Learning, IEEE Transactions on Neural Networks, vol.18, issue.4, pp.973-992, 2007.
DOI : 10.1109/TNN.2007.899161

URL : http://www.jilsa.net/files/ieee-tnn-paper-04267723.pdf

T. Jung and D. Polani, Kernelizing LSPE(?), IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL), pp.338-345, 2007.
DOI : 10.1109/adprl.2007.368208

G. Taylor and R. Parr, Kernelized value function approximation for reinforcement learning, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, 2009.
DOI : 10.1145/1553374.1553504

URL : http://www.cs.mcgill.ca/~icml2009/papers/467.pdf

M. Loth, M. Davy, and P. Preux, Sparse Temporal Difference Learning Using LASSO, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007.
DOI : 10.1109/ADPRL.2007.368210

URL : https://hal.archives-ouvertes.fr/inria-00117075

J. Z. Kolter and A. Y. Ng, Regularization and feature selection in least-squares temporal difference learning, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, 2009.
DOI : 10.1145/1553374.1553442

URL : http://www.cs.mcgill.ca/~icml2009/papers/439.pdf

J. Johns, C. Painter-wakefield, and R. Parr, Linear Complementarity for Regularized Policy Evaluation and Improvement, Advances in Neural Information Processing Systems (NIPS), pp.1009-1017, 2010.

M. Geist and B. Scherrer, ???1-Penalized Projected Bellman Residual, European Workshop on Reinforcement Learning (EWRL), 2011.
DOI : 10.1007/978-3-642-29946-9_12

URL : http://hal.inria.fr/docs/00/64/45/07/PDF/gs_ewrl_l1_cr.pdf

M. W. Hoffman, A. Lazaric, M. Ghavamzadeh, and R. Munos, Regularized Least Squares Temporal Difference Learning with Nested ???2 and ???1 Penalization, European Workshop on Reinforcement Learning (EWRL), 2011.
DOI : 10.1007/978-3-642-29946-9_13

M. Geist, B. Scherrer, A. Lazaric, and M. Ghavamzadeh, A Dantzig Selector Approach to Temporal Difference Learning, International Conference on Machine Learning (ICML), 2012.
URL : https://hal.archives-ouvertes.fr/hal-00749480

I. Menache, S. Mannor, and N. Shimkin, Basis Function Adaptation in Temporal Difference Reinforcement Learning, Annals of Operations Research, vol.34, issue.1/2/3, pp.215-238, 2005.
DOI : 10.1109/TSMC.1983.6313077

URL : http://www.ee.technion.ac.il/people/shimkin/preprints/basisadaptation_dec03.pdf

P. W. Keller, S. Mannor, and D. Precup, Automatic basis function construction for approximate dynamic programming and reinforcement learning, Proceedings of the 23rd international conference on Machine learning , ICML '06, pp.449-456, 2006.
DOI : 10.1145/1143844.1143901

URL : http://www.ece.mcgill.ca/~smanno1//public/C-KellerPrecup-NCAICML-2006.pdf

R. Parr, C. Painter-wakefield, L. Li, and M. Littman, Analyzing feature generation for value-function approximation, Proceedings of the 24th international conference on Machine learning, ICML '07, 2007.
DOI : 10.1145/1273496.1273589

URL : http://www.cs.duke.edu/~parr/icml07.pdf

R. Parr, L. Li, G. Taylor, C. Painter-wakefield, and M. L. Littman, An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning, Proceedings of the 25th international conference on Machine learning, ICML '08, 2008.
DOI : 10.1145/1390156.1390251

J. Wu and R. Givan, Automatic Induction of Bellman-Error Features for Probabilistic Planning, Journal of Artificial Intelligence Research, 2010.

S. Mahadevan and M. Maggioni, Proto-value functions, Proceedings of the 22nd international conference on Machine learning , ICML '05, pp.2169-2231, 2007.
DOI : 10.1145/1102351.1102421

D. Bertsekas and D. Castañon, Adaptive aggregation methods for infinite horizon dynamic programming, IEEE Transactions on Automatic Control, vol.34, issue.6, pp.589-598, 1989.
DOI : 10.1109/9.24227

S. Singh, T. Jaakkola, and M. Jordan, Reinforcement learning with soft state aggregation, Advances in neural information processing systems (NIPS), pp.361-368, 1995.

J. Ma and W. B. Powell, Convergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous, Multidimensional States and Actions, 2010.

A. Barreto, D. Precup, and J. Pineau, Reinforcement learning using kernel-based stochastic factorization, Advances in Neural Information Processing Systems (NIPS), 2011.