Training vision-based autonomous driving in the real world can be inefficient and impractical. Vehicle simulation can be used to learn in the virtual world and the acquired skills can be transferred to handle real-world scenarios more effectively. Between virtual and real visual domains, common features such as relative distance to road edges and other vehicles over time are consistent. These visual elements are intuitively crucial for human decision making during driving. We hypothesize that these spatio-temporal factors can also be emphasized in transfer learning to improve generalization across domains. First, we propose a CNN+LSTM transfer learning framework to extract the spatio-temporal feature representations of these factors from the LSTM hidden states (that capture the vehicle dynamics) between the network layers. Next, we conduct an ablation study to quantitatively estimate each element’s significance in the classification decision using a cosine similarity metric, which shows to be more consistently correlated to decision confidence than other image similarity metrics. Finally, based on the results of our ablation study, we complement the image sequences with saliency maps and the identified key visual elements of Saliency maps as input to our CNN+LSTM network. Training of our network is initialized with the learned weights from CNN and LSTM latent features (capturing the intrinsic physics of the moving vehicle w.r.t. its surroundings) transferred from one domain to another. Our experiments show that this proposed transfer learning framework better generalizes across unseen domains compared to a baseline CNN model on a binary classification learning task.