The text and the image are actually different, but only for humans. Computers operate with all types of information in the numeric form. Probably you know, that in computer vision, the images are presented as matrices of numbers. If you have an image with 50x50 dimensions, this means that it has 2500 pixels, and if this is the colored image, then it should have 3 additional values for each pixel. Each of these values represents the RGB setting (mix of red, green, and blue color).
Now, when you are using CNN for the image, you specify the convolution, which takes several neighboring pixels and processes them. Then it goes to the second block of pixels and so on. For example, you can choose the dimension of convolution as 2x2, which means that your CNN will analyze 4 blocks of pixels on each step. The aim of this is to generate features which then will be processed by usual fully connected layers of the network. Roughly say, CNN by acting in such a way, can detect such features as different edges on the image, colors, forms, objects, etc. Briefly, the purpose of convolution in CNN is to generate some high-level features from pixels which would be then used by the traditional neural network on the last layers of the CNN.
For the text, each word is also can be presented as a number. It is not important here how to perform this transformation (it is called text vectorization, you can read about this more somewhere in NLP articles and tutorials). The main point here is that by using the convolution of CNN for the text, you look at a couple of neighbor words in the text and try to detect some high-level feature from these simple words. It is harder to imagine these high-level features for the texts compared to the examples of the high-level features I have mentioned earlier. But you can think about these features as about some mini-topics in the text, maybe simple actions, descriptions, etc. Hope you catch the idea.