AoC 2022 Day 17: Secure Coding III
Input Validation Foundations
Generally, an effective way to validate input is first to know how a specific piece of data is going to be processed by the rest of your application. Data then goes through syntax and semantic validation checks to ensure that the user-provided values are both proper in their syntax (the answer follows the proper context asked by the question) and logical value (the values make sense for the question).
Going back to Day 15:
- You cannot manually type your CV in the input field – the form asks for a file, so it’s simply not what’s being asked
- You cannot just upload any file – it should follow a set of very specific rules, the implementation of which was all discussed in the latter parts of the task.
Then comes whitelisting, where you can be very specific with what your forms would accept and immediately strip or even drop the ones that don’t fit in the predefined allowed category.
HTML5 and Regex
HTML5’s built-in features help a lot with the validation of user-provided input, minimizing the need to rely on JavaScript for the same objective. The <input>
element specifically has an array of very helpful capabilities centered around form validation.
For instance, the <input>
type, which can be set to specifically filter for an email, a URL, or even a file, among others, promptly checks whether or not the user-provided input fits the type of data that the form is asking for, and so, feedback on its validity is immediately returned to the user as a result.
For even more granular control of the input being provided, regular expressions (regex) can be integrated into the mix. Simply use it in the “pattern” attribute within the <input>
element, and you’re all set. Here is a nice resource to get started with regular expressions. A couple of examples are shown below.
1. <input type="text" id="uname" name="uname" pattern="[a-zA-Z0-9]+">
2. <input type="email" id="email" name="email" pattern=".+@tryhackme\.com">
The pattern in the first line of code above is easily one of the most foundational regular expression patterns one can use. The instruction here is to match any strings specifically composed of only letters and numbers – an alphanumeric pattern that is case-insensitive.
The pattern in the second line of code above is a bit more pointed in its instruction, specifying that an email can have any characters at the beginning as long as it ends with “@tryhackme.com”.
Developing regular expressions can be very daunting as its nature is complex; however, its capability to match very specific patterns is what makes it special. Well-built regular expressions introduce a great way to immediately filter out user-provided input that doesn’t fit the specific requirements that you have set.
Regex 101
This section will talk about some regex tips to get you started. To match any lowercase character from the English alphabet, the regex pattern is [a-z]
. We can deconstruct it as follows:
- The square brackets indicate that you’re trying to match one character within the set of characters inside of them. For example, if we’re trying to match any vowel of the English alphabet, we construct our regex as follows:
[aeiou]
. The order of the characters doesn’t matter, and it will match the same. - Square brackets can also accept a range of characters by adding a hyphen, as seen in our original example.
- You can also mix and match sets of characters within the bracket.
[a-zA-Z]
means you want to match any character from the English alphabet regardless of case, while[a-z0-9]
means you want to match any lowercase alphanumeric character.
We also need to talk about regex operators. The simplest one is the wildcard operator, denoted by .
. This means regex will match any character, and it’s quite powerful when used with the operators *
, +
, and {min,max}
. The asterisk or star operator is used if you don’t care if the preceding token matches anything or not, while the plus operator is used if you want to make sure that it matches at least once. The curly braces operator, on the other hand, specifies the number of characters you want to match. Let’s look at the following examples:
- To match a string that is alphanumeric and case-insensitive, our pattern would be
[a-zA-Z0-9]+
. The plus operator means that we want to match a string, and we don’t care how long it is, as long as it’s composed of letters and numbers regardless of their case. - If we want to ensure that the first part of the string is composed of letters and we want it to match regardless if there are numbers thereafter, it would be
^[a-zA-Z]+[0-9]*$
. The^
and$
operators are called anchors and denote the start and end of the string we want to match, respectively. Since we wanted to ensure that the start of the string is composed of only letters, adding the caret operator is required. - If we want to match just lowercase letters that are between 3 and 9 characters in length, our pattern would be
^[a-z]{3,9}$
. - If we want a string that starts with 3 letters followed by any 3 characters, our pattern would be
^[a-zA-Z]{3}.{3}$
.
There’s also the concept of grouping and escaping, denoted by ()
and \
operators, respectively. Grouping is done to manage the matching of specific parts of the regex better while escaping is used so we can match strings that contain regex operators. Finally, there’s the ?
operator, which is used to denote that the preceding token is optional. Let’s look at the following example:
- If we want to match both www.tryhackme.com and tryhackme.com, our pattern would be
^(www\.)?tryhackme\.com$
. This pattern would also avoid matching .tryhackme.com. ^(www\.)?
: The^
operator marks the start of the string, followed by the grouping of www and the escaped.
, and immediately followed by the question mark operator. The grouping allowed the question mark operator to work its magic, matching both strings with or without the www. at the beginning.tryhackme\.com$
: The$
operator marks the end of the string, preceded by the string tryhackme, an escaped.
, and the string com. If we don’t escape the.
operator, the regex engine will think that we want to match any character between tryhackme and com as well.
It’s also imperative to note that the wildcard operator can lead to laziness and, consequently misuse. As such, it’s always better to use character sets through the brackets especially when we’re validating input as we want it to be perfect.
Here’s a table to summarize everything above:
[] | Character Set: matches any single character/range of characters inside |
. | Wildcard: matches any character |
* | Star / Asterisk Quantifier: matches the preceding token zero or more times |
+ | Plus Quantifier: matches the preceding token one or more times |
{min,max} | Curly Brace Quantifier: specifies how many times the preceding token can be repeated |
() | Grouping: groups a specific part of the regex for better management |
\ | Escape: escapes the regex operator so it can be matched |
? | Optional: specifies that the preceding token is optional |
^ | Anchor Beginning: specifies that the consequent token is at the beginning of the string |
$ | Anchor Ending: specifies that the preceding token is at the end of the string |
The Unique Case of Free-Form Text
All of the techniques that we have covered in this task thus far are mainly concerned with structured data – data that we already expect what it should look like. However, compared to the validation of structured data, free text is more complex and not very straightforward.
Structured data have inherent characteristics that allow us to set them apart quite quickly. For example, names shouldn’t consist of special characters (some names have, but these can be whitelisted), and age should only consist of numbers with an absolute maximum of 3 characters. Syntax and semantic validation checks can be immediately done on them, and the chaos suddenly doesn’t seem too bad.
In contrast, free text fields are more free-for-all, so validations checks are more limited, and the challenge of securing it is generally vaguer. Yet, like all great engineers before us, we power through these challenges and make the best of what we’re given. Listed below are some considerations to ponder.
- First, we start again with the question of how this piece of data is going to be processed by the rest of the application.
- What will be the context for which this free-form text field will be used? Free text fields for blog posts are tackled very differently than text fields used for a small comment section or a descriptive text field.
- Is the free text field necessary for your business purposes in the first place, or could it be implemented differently while still achieving the same goal? This is essential to know, too, as part of writing secure code is avoiding writing vulnerable ones!
Then we go to the tricky part. Since syntax and semantic checks are pretty much impossible due to the nature of free-form text fields, our best bet in having a bit of control is through whitelisting. This OWASP Cheat Sheet lists the general techniques to perform whitelisting, which can be summarized as follows:
- Ensure no invalid characters are present through proper encoding, and
- Whitelist expected characters and character sets
Client-side vs Server-side
Our discussion until this point has been geared towards using input validation to ensure that the pieces of data that we’re getting from the users are not garbage. Proper input validation on the client-side limits misuse and minimizes the attack surface of the application, however, it doesn’t actively cover malicious use cases.
As opposed to the above examples of validating input implemented on the client side, output escaping and using parametrized queries are implemented on the server side, adding computational load to the server, but ultimately helping in securing the application as a whole. This is done as an exercise of inherent distrust in the user-provided input despite validation.
The example below uses the built-in htmlentities()
PHP function to escape any character that can affect the application.
$name = htmlentities($_GET['name'], ENT_QUOTES | ENT_HTML5, "UTF-8")
On the other hand, the one below uses the HTMLPurifier library to help with some use cases, such as our <textarea>
conundrum above.
Both examples are done on the server side to escape characters and purify content, respectively both effective ways to prevent user misuse and XSS attacks. Implementing both client and server-side validations ensures that no stones are left unturned.
This exact case highlights the importance of layering defenses and shows point-blank the limitation of client-side input validation. It also gives rise to the notion that input validation is not a one-stop shop for securing your application.
Specific cases warrant specific security requirements, and so on this factor alone, it’s already apparent that there exist multiple ways of attacking the challenge of securing user-provided input, and most of the time, they are implemented on top of each other.
Input validation is an important layer of security that all production-level applications should have. However, no matter how ‘perfect’ your input validation is for your specific use case, there simply are limitations to all kinds of security implementations – and input validation is not exempt from it.
You cannot make your application fully XSS-proof through input validation alone. Controls may be put in place at the input level that may lessen the attack surface of the application, but it doesn’t fully remediate it. Full remediation of an XSS vulnerability in your application requires additional layers of defenses, such as mitigation on the browser level (via secure cookies, content-security-policy headers, etc.) and escaping and purifying user-provided input.