So I've been building a C# html sanitizer using html agility with a white list. It works fine, except for cases like these:
I DO want to allow the src attribute, just not malicious stuff within it obviously. All of the stuff i've looked up has just recommended a whitelist for tags and their attributes. How would you handle something like this though? I know this won't work in any newer browser, but i'm not very familiar with security and i'm sure there are some other clever things attackers could do.
Cross-site scripting (XSS) attacks exploit vulnerabilities in Web page validation by injecting client-side script code. Common vulnerabilities that make your Web applications susceptible to cross-site scripting attacks include failing to properly validate input, failing to encode output, and trusting the data retrieved from a shared database. To protect your application against cross-site scripting attacks, assume that all input is malicious. Constrain and validate all input. Encode all output that could, potentially, include HTML characters. This includes data read from files and databases.
One of the most serious examples of a cross-site scripting attack occurs when an attacker writes script to retrieve the authentication cookie that provides access to a trusted site and then posts the cookie to a Web address known to the attacker. This enables the attacker to spoof the legitimate user's identity and gain illicit access to the Web site.
Common vulnerabilities that make your Web application susceptible to cross-site scripting attacks include:
The two most important countermeasures to prevent cross-site scripting attacks are to:
Summary of Steps
To prevent cross-site scripting, perform the following steps:
Step 1. Check that ASP.NET request validation is enabled.
Step 2. Review ASP.NET code that generates HTML output.
Step 3. Determine whether HTML output includes input parameters.
Step 4. Review potentially dangerous HTML tags and attributes.
Step 5. Evaluate countermeasures.
For detail see the 2nd reference.
You can safely allow the src attribute, provided that you sanitize and handle the input properly. To do this you should first sanitize it through a whitelist of valid URL characters, canonicalize it, and then verify that it points to a valid image.
The whitelist you mentioned is the first step (and an important one at that). To implement the whitelist, simply strip out every character that isn't valid for a URL. Also verify that the URL is properly formed, meaning that it points to a valid resource that the user should be able to access. For example, the user shouldn't be accessing a local file on the server by passing in
file://sensitive.txt or something. If http or https are the only protocols that should be used, check that the URL starts with those. If you are extra paranoid you may reject the request altogether since it is obvious it has been tampered with. Whitelisting is important, but whitelisting alone however will not keep the feature secure.
Canonicalization is important because many attacks depend on submitting URLs that eventually take you to a certain location, but may abuse the computer's innate lack of reasoning to get at things it shouldn't. This will also help to eliminate duplicated paths to the same resource which may improve performance (or at least allow you to improve performance by not rechecking a known file that hasn't changed since the last time you checked it. Be careful with this though because it is possible to spoof a last modified date so an attacker could swap a malicious file in after you've already "checked and trusted" it).
To verify that you are pointing to a valid image, open the file and read in the first few bytes. Do not simply trust the file extension, though do check it first before opening the file (for performance and for security). Every image format has a certain pattern of bytes that you can check. A good one to look at first is JPEG. It may still be possible for a malicious user to put shellcode or other attack code in an image file that contains the proper headers, but it is much more difficult to do. This will be a performance bottleneck so plan appropriately if you implement this.