RegEx as a way to better data validation

Someone recently sent me a link to an MSDN article about 10 must have tools.  While looking it over I saw The Regulator listed and started thinking about validation of client data.


Everyone has by now heard that you must validate your data from a client before you act on it, but what does that mean.  It doesn’t mean that you just put a maximum length on your input box in your HTML and call it done.  That is no security at all.  Instead you have to do server side checks against what you expected to recieve from the client.  If you then want to go the extra measure and strip out dangerous characters after that then enjoy!  You can’t catch everything with this latter approach.  Take this example.  Suppose you have a page that takes a value and uses it in the where clause of a dynamic SQL Statement (you shouldn’t of course, but people still do).  Now you want to avoid SQL Injection and for some reason have ruled out parameterized queries using ADO.Net.  Given this classic SQL Injection vulnerable scenario, suppose you are trying to defend against the string like: ‘ OR 1=1 –


Single quotes are easy to strip out and in some cases you can even look for the OR as a literal string.  Suppose further that you have a field that asks for age and uses that to execute our dynamic SQL query.  The shrewd (or maybe just lucky) hacker can put in: O’R 1=1 –


Now this won’t destroy your server, but I am trying to prove a point, not educate hackers.  If your home grown script to kill SQL Injection like characters checks for the literal OR it will find that it isn’t present.  The single quote defeats that check, but it also breaks the OR.  If your next move is to strip out single quotes then your own defenses fix the hacker’s statement.  Also since the datatype of the field isn’t text, I don’t need the single quote anyways, that was just a way to put in a character to hide my OR and that you will strip out for me.  This can be done in billions of ways depending on your order of operations.  What this proves is what Erik Olsen of Microsoft has said in the past and I agree with totally, namely, YOU CAN’T ENSURE SOMETHING ISN’T WHAT YOU WANT, ONLY THAT IT IS WHAT YOU WANT.  If I expect an zip code from inside the United States then I check that the value is 5 characters, all of them digits.  Done, I know it is what I expected and can now move on.  If you have to strip something bad out of user data then maybe you shouldn’t strip it out, but instead should reject the request completely?


If any of this sounds reasonable then you really have to get going on your RegEx skills and that brings us back to the utilities list.  Try The Regulator, find something else, write your own or just learn the RegEx syntax.