I have a rich text area where the user can type something. I am trying to prevent JavaScript injection using the following regex:
return input == null ? null : input.replaceAll("(?i)<script.*?>.*?</script.*?>", "") // case 1
.replaceAll("(?i)<.*?javascript:.*?>.*?</.*?>", "") // case 2
.replaceAll("(?i)<.*?\\s+on.*?>.*?</.*?>", ""); // case 3
Above, input
is the text from the rich text area and I am using this regex to avoid possible JavaScript injections.
The problem is case 3. If the user's text contains "on"
, all the text before "on"
gets removed.
How can I make the last case more rigid and avoid the above problem?
If you want to remove "on" and everything up to the end of the tag, you can use this: .replaceAll("(?i)(<.?\s+)on.?(>.*?)", "$1$2");
This renders "ACD" as "ACD". But be aware that if someone puts a ">" character inside the script, it will mess up the regex...
EDIT: the moral of my remark is that I would not recommend a custom parsing to remove javascript code. I suggest you get yourself acquainted with the answer to the following question: Java: Best way to remove Javascript from HTML and probably use Jsoup.clean (if it is possible in your environment).