The question is, why do you need memory for Q-learning, if you can use NA, which, to put it bluntly, already keeps previous experience in itself, to predict actions ( Q[s',a'] )?

https://ru.wikipedia.org/wiki/Q- training

    0