I have heard that traditional recurrent neural networks aren't suitable for use in real life. Other types of RNNs (such as LSTM or GRU) are used instead of vanilla RNNs. What is the reason for this?
It is true due to the so-called problem of vanishing gradients. The back-propagation algorithm uses gradient as a source of information to train properly. But with each next neuron in vanilla RNN, the gradient becomes closer and closer to zero. This means that it brings less and less information to each next neuron about the previous neurons. And the training becomes impossible. You cannot train RNN with relatively long sequences of data. While other types of RNN, such as LSTM and GRU, use special mechanism to save the impact of gradient on the neurons. This means that you can use long sequences of data and the information about the first neurons will not be lost quickly.