zen of coding

Save now, sanitize later…

Good discussion came up on IRC today, regarding sanitizing of the data.

I’d like to clear up a little confusion and hopefully set the record straight on the sanitizing of data.

First, we need to keep in mind that there are two types of “evil” data, which most web developers are worried about.
One can hurt your database and the other one that can hurt your HTML pages.

Therefore, it is generally advisable to protect your web application against:

1. SQL Injection
and
2. XSS, or cross-site scripting.

It is, however, important not to mix up the two and realize at which point it is appropriate to protect yourself (or your app) from harm.

First, the good news. With CakePHP you are automatically protected from SQL injection IF you use CakePHP’s ORM methods (such as find() and save()) and proper array notation as outlined in the manual. The bad news is that if you are playing around with plain SQL or do not adhere to proper CakePHP syntax of writing queries, you could still be vulnerable to an SQL injection attack.

To be a little more specific, CakePHP will quote and escape all fields and values in your queries, assuming you’re following the rules, to keep your DB nice and safe.
(If you care about more details, see here).
It is important to note, that updateAll() does not escape the fields, so be cautious when using it.

So, technically you have to worry very little about SQL injection when working with CakePHP.

Now, what about user inputting some evil scripts into your blog to hijack your site?

I think the most important point here is that any HTML script, tag or trick is perfectly safe to store in your database as is.
The basic rule is that you do not need to sanitize HTML data in any way before saving it into the database.

When you are concerned about XSS it is the output of data from the database that you need to worry about. If you display raw data from the DB, it is quite possible that you’ll have “evil scripts” injected into your page. With CakePHP it is easy to combat enough with a handy h() method.

Are there cases where you must sanitize HTML before saving it to the database?
Probably, but I can’t really think of any.
Even if your application requires it, you might want to create two fields in the relevant table raw_data and sanitized_data, as an example.

P.S. AD7six just proposed an excellent question, which he also answered just as well… Why allow script code to be stored in some user’s profile? The answer is that having this raw (although unwanted) data is useful for tracking and dealing with malicious users. That’s where having two fields in the table to store sanitized and raw data might come in very handy. If you remove the script tags prior to storing the user information, it’ll become much harder to find and flag users who are obviously trying to screw with your app.

  • The only issue I see with save first then filter is potentially performance. I allow HTML and Markdown in my blog comments so I run them through HTMLPurifier to strip any truly heinous XSS or other script attacks. It would seem that doing this before save would get me considerable performance savings than running every page view through HTMLPurifier. Just a thought.

  • There is also $html->clean() which will remove things like script tags and inline js events. h() is a good sledgehammer, but if you need to allow some inline HTML using a cleaner/purifier is a good option.

  • @Mark Story

    Thanks, that’s a good pointer.

    @darren_n

    That’s a good example, in this case you obviously know what needs to be done. However, the problem (and hence some general rules) is that people are experiencing problems where source data gets replaced, mangled or “eaten” by the DB, without realizing the potential issues once that data needs to be returned as it was meant to be.

  • Howard

    In terms of sanitizing the output but allowing specific elements, i devised 2 simple regex strings which could probably be worked on some. Let me know what you all think in terms of efficiency using these vs a full fledged purifier

    To limit only specific tags you can use this regex string:
    @(<(?!/)|@i and to prevent specific element attributes, all you need to do is @(target=|onabort|onblur|onchange|onclick|ondblclick|ondragdrop|onerror|onfocus|onkeydown|onkeypress|onkeyup|onload|onmousedown|onmousemove|onmouseout|onmouseover|onmouseup|onmove|onreset|onresize|onselect|onselect|onsubmit|onunload)@i The latter could probably be improved upon with non-grouping lookaheads and lookbehinds to keep the system clean, but it's fast, and clean.

  • Howard

    Let’s try the first one again, it seems it was caught by the very same filters discussed here :)

    @({opentag}(?!/)|{opentag}/)(?!p|strong|span|em|br|a|img|div|ol|ul|li|sup|sub|style|object|param|embed).+?{closetag}@i

    Replace {closetag} with >
    Replace {opentag} with < we'll see if this one works..

  • @Howard

    Nice, thanks for sharing. I think that’ll work quite well.

  • It’s funny, I was just having this argument with a colleague the other day. I sit pretty firmly (strong opinions, loosely held and all that) in your camp, but he offered a compelling argument that I hope he’ll explain (I’ve sent him this link).

    I’d rather not store sanitized data (in favor of scrubbing it on output) because:

    1. I frequently need to access data from multiple sources (not just the app). I don’t want other sources pulling data sanitized specifically for the app.
    2. I like having access to raw data, as you say, to see who’s gaming the system in a worst case scenario or, in a better case, to better debug an issue. If something they enter causes a problem, but is now gone, I can’t debug as effectively.
    3. It could impact the user experience if I scrub something that they know they entered.

    The hybrid solution of storing both feels like clutter to me since it’s easy enough to scrub on the outbound trip, but maybe that’s the happy medium in all of this. It’s certainly one I could live with if I were forced to make that decision.

  • @Rob Wilkerson

    Very interesting. There are certainly exceptions to every rule, but the goal here was to explain the difference between escaping (saving) and sanitizing (output). Especially in terms of cake, since there is a little automagic involved.

    As darren_n pointed out, there are certainly cases where you might need to sanitize on input (performance wise). Still, as you say, I see no specific reason not to store “evil” data and sanitized data together.

  • Thanks guys, I was confused with cake Sanitize Object specially with clean() method. Sanitizing output now seems logical to me.

  • @Md. Rayhan Chowdhury

    Glad it helped.

  • I am not an expert web dev, so please correct me if I am wrong, but this is how I see it.

    1) I am less interested in trying to monitor which users are trying to inject bad HTML, than in preventing _anyone_ from doing it. I assume all users are evil. :)

    2) If the data is displayed more often than it is modified, escaping it prior to the save and displaying as-is later on, will yield better performance than escaping on every display.

    Therefore, the sensible default position seems to be escaping the saved data, not escaping displayed data.

  • @Chris

    As explained, you are risking stripping away important data, which was not meant to be harmful, but now is no longer recorded.

    Nowadays the performance issue with running something like h() on your output data is simply negligible. Especially if you use caching as needed.

    Yet, there are definitely exceptions to this…

  • RedCapeMan

    I might disagree as well (and I understand each is entitled to his/her opinion) on the same principle that I do not leave my door unlocked and set a trap inside my house to to try catch the robber

    • Petruza

      This analogy would be true for SQL injection, because like the robber, they can do permanent damage. XSS attacks on the other hand, are only harmful for the user for whom they’re rendered, and if they aren’t rendered or they are rendered escaped, they are harmless, but you may still want to have that code, in case it **looks** like an XSS attack but is actually valuable information for your application.

  • From my experience with content management systems, there is one big reason to sanitize on output – you might change the rules.

    If you make them more restrictive you will at a minimum have to write a script to re-process all the stored data and clean it again. If you make them less restrictive you may have years of data that was “cleaned” and can’t be changed back (more likely people were limiting themselves but you never know).

    If you sanitize on output then you just change one line of code or a configuration option and it works.

  • TuningGuide

    In my case I sadly can’t rely on my collegues that the sanitize the data before the use it, so I have to sanitize them before I save them. To your P.S. statement: the h method double encodes entities if the second param is defaultly true, so if you decode the data you know what the user have had input originally.

%d bloggers like this: