While the abundance of visual content available on the Internet, and the easy access to such content by all users allows us to find relevant content quickly, it also poses challenges. For example, if a parent wants to restrict the visual content which their child can see, this content needs to either be automatically tagged as offensive or not, or a computer vision algorithm needs to be trained to detect offensive content. One type of potentially offensive content is sexually explicit or provocative imagery. An image may be sexually provocative if it portrays nudity, but the sexual innuendo could also be contained in the body posture or facial expression of the human subject shown in the photo. Existing methods simply analyze skin exposure, but fail to capture the hidden intent behind images. Thus, they are unable to capture several important ways in which an image might be sexually provocative, hence offensive to children. We propose to address this problem by extracting a unified feature descriptor constituting the percentage of skin exposure, the body posture of the human in the image, and his/her gestures and facial expressions. We learn to predict these cues, then train a hierarchical model which combines them. We show in experiments that this model more accurately detects sexual innuendos behind images.