We can use vanilla dense neural networks, but this is not efficiently. There are at least 2 reasons why we don't use them.
1) Productivity. There is a lot of pixels on the images, so we would need a significant number of neurons to process them. This would lead to enormously large networks that are very slow for training and require huge computational and memory resources. In fact, if you are trying to solve a sandbox problem, like digit recognition, and you have small images (for example, 20x20 pixels, grayscaled), you can successfully use dense neural networks. But when you have RGB images with such dimensions as 1920x1080, it is impossible to use dense neural networks. There will be a lot of neurons and parameters to learn. Convolutions (alongside with Pooling layers) reduce the number of parameters for learning.
2) Quality of the results. Even if we can train dense neural networks for CV tasks, we should understand that CNNs perform better. The logic behind convolutions is that they help to derive high-level features of the images. In other words, CNN looks at some features like the shape of the objects, their color, presence or absence of some objects on the image, etc. A dense neural network looks just on each individual pixel and doesn't see even the neighbor pixels. This leads to poor performance compared with CNNs