The Problem With innerHTML

The innerHTML property is extremely popular because it provides a simple way to completely replace the contents of an HTML element. Another way to do that is to use the DOM Level 2 API (removeChild, createElement, appendChild) but using innerHTML is by far the easiest and most efficient way to modify the DOM tree. However, innerHTML has few problems of its own that you need to be aware of:

  • Improper handling of the innerHTML property can enable script-injection attacks on Internet Explorer when the HTML string contains a script tag marked as deffered: <script defer>...<script>
  • Setting innerHTML will destroy existing HTML elements that have event handlers attached to them, potentially creating a memory leak on some browsers.

There are a few other minor drawbacks worth mentioning:

  • You don’t get back a reference to the element(s) you just created, forcing you to add code to retrieve those references manually (using the DOM APIs…)
  • You can’t set the innerHTML property on all HTML elements on all browsers (for instance, Internet Explorer won’t let you set the innerHTML property of a table row element)

I am more concerned with the security and memory issues associated with using the innerHTML property. Obviously, this problem is nothing new, and very bright people have already figured out ways to work around some of these problems.

Douglas Crockford wrote a purge function that takes care of breaking some circular references caused by attaching event handlers to HTML elements, allowing the garbage collector to release all the memory associated with these HTML elements.

Removing the script tags from the HTML string is not as easy as it seems. A regular expression should do the trick, although it’s hard to know whether it covers all possible cases. Here is the one I came up with:

/<script[^>]*>[\S\s]*?<\/script[^>]*>/ig

Now, let’s put these two techniques together in a single setInnerHTML function (Update: Thanks to those who commented on this article. I fixed the errors/holes you mentioned, and also decided to bind the setInnerHTML function to YAHOO.util.Dom)

YAHOO.util.Dom.setInnerHTML = function (el, html) {
    el = YAHOO.util.Dom.get(el);
    if (!el || typeof html !== 'string') {
        return null;
    }

    // Break circular references.
    (function (o) {

        var a = o.attributes, i, l, n, c;
        if (a) {
            l = a.length;
            for (i = 0; i < l; i += 1) {
                n = a[i].name;
                if (typeof o[n] === 'function') {
                    o[n] = null;
                }
            }
        }

        a = o.childNodes;

        if (a) {
            l = a.length;
            for (i = 0; i < l; i += 1) {
                c = o.childNodes[i];

                // Purge child nodes.
                arguments.callee(c);

                // Removes all listeners attached to the element via YUI's addListener.
                YAHOO.util.Event.purgeElement(c);
            }
        }

    })(el);

    // Remove scripts from HTML string, and set innerHTML property
    el.innerHTML = html.replace(/<script[^>]*>[\S\s]*?<\/script[^>]*>/ig, "");

    // Return a reference to the first child
    return el.firstChild;
};

Voila! Let me know if there is anything else that should be part of this function, or if I missed anything obvious in the regular expression.

Update: There are obviously many more ways to inject malicious code in a web page. The setInnerHTML function barely normalizes the <script> tag execution behavior across all A-grade browsers. If you are going to inject HTML code that cannot be trusted, make sure you sanitize it first on the server side. There are many libraries available for this.

Update: IE8 has a new toStaticHTML function attached to the window object that removes any potentially executable content from an HTML string!

Medical Devices http://www.kalinka-store.com/.

30 thoughts on “The Problem With innerHTML

  1. Andrew Mattie

    Forgive me for asking what might seem like a dumb question, but what’s the point of replacing script tags in the innerHTML? If you had to replace script tags, it would mean that you couldn’t trust it to be free of XSS attacks (which is certainly a valid concern). However, if the code wasn’t trustworthy, couldn’t someone just set the innerHTML to something like ”? Or, better yet, the innerHTML could be something like ”

    Additionally, it seems that your regex may have a slight flaw in it. A (.) in JS’s regex parser doesn’t match a \n, so if I throw in at least one of those into my evil script, I can get stuff through the regex. Just as a check, I threw up a quick test page to see what happens w/ that regex and some \n chars: http://www.akmattie.net/blog/testing/regex-script-replacement.html. I think a better solution might be to use /]*>(.|\n)*/g or /]*>(.|\n)*?/g (depending on your preference for greedy vs lazy in case the evil person tries to throw in a \n//\n into the attempted XSS attack).

    Anyway, am I seeing that correctly?

  2. Anonymous for the time being

    Regarding the regular expression for scripts – how about ingoring case with «i» modifier and detecting unclosed script tags? And what is «\\?» for? «»?

  3. Anonymous for the time being

    Argh it even doesn’t allow me to post ending script tag with zero-width joiners between every letter. I’ll try with a space: «».

  4. Pingback: The Problem With innerHTML | David Bisset: Web Designer, Coder, Wordpress Guru

  5. Robert Nyman

    Good to see that you acknowledge that, in real-life scenarios, innerHTML is sometimes the best approach (especially with certain AJAX scenarios). Douglas solution is great, and nice with your addition.

    I can’t help of being wary of using a variable named html, though. Seems like a perfect scenario where IE would get a messed up reference in certain cases… :-)

  6. tony

    Although a great idea, the code still references Douglas’ purge function (which in your example) is not included.

    I too wonder about the script stuff….

    a.) As long as I am not exposing my method to other 3rd parties to use, then I don’t see the worry.
    b.) as mentioned by others, it should be a case insensitive regex
    c.) since I may want to insert script tags (quite likely), what I would rather have, is a regex to remove the defer attribute!

  7. Nicholas C. Zakas

    Doug’s purge method works only when you’ve attached event handlers directly to the element, via element.onclick, etc. If you attach events using attachEvent(), the circular references remain and there’s no way to know that the event handlers are still attached before changing innerHTML.

    Another problem with innerHTML is that it doesn’t stay true to the string you’ve used. It will change tag case, add/remove attributes, etc., meaning that you can’t necessarily test innerHTML to see if it contains the text you had previous set it to.

    Conclusion: innerHTML is nifty but far from perfect. :)

    P.S. Looks like you’ve got some spam comments.

  8. Julien Lecomte Post author

    @Nicholas

    Indeed, I forgot to mention that important point. I would like to see something similar to setInnerHTML as part of the YUI library. That way, events attached using the DOM Level 2 API could be automatically removed. Your second point is less of an issue for most uses of the innerHTML property.

    I don’t see any spam comments.

  9. Mathieu 'p01' Henri

    Script tags are just one out of many ways to inject malicious code. Stripping only the script tag is sub par. You must also be careful of IFRAMEs, OBJECT, EMBED, IMG … and even some attributes ( some browsers still allow script execution in style=”background:url(javascript:…);”.

    The only way to sanitize properly markup to be injected is to use a whitelist of tags and attributes … or simply don’t inject markup using innerHTML.

  10. Pingback: Javascript News » Blog Archive » The problem with innerHTML

  11. Pingback: The problem with innerHTML « outaTiME

  12. Pingback: Ajax Girl » Blog Archive » The problem with innerHTML

  13. Pingback: Julien Lecomte de Yahoo! nos cuenta los problemas de innerHTML - elWebmaster.com

  14. Dylan Oudyk

    Nice article, now I won’t feel so dirty using innerHTML as much as I do. Regarding Nicholas’ comment couldn’t you use YUI’s YAHOO.util.Event.purgeElement method to remove the event handlers set with attachEvent (or would that only be for those set with the Event utility). Either way it’s a nice solution but will definitely be a task to forge a silver bullet to solve all scenarios.

    thanks again,

    Dylan

  15. Oliver Tse

    To Nicholas’ point, you break circular references differently with DOM 0 events ( ex. inline onclick events, etc. ) than with DOM 2 events ( ex. attachEvent, addEventListener ).

    So, to be effective, you need a way to do both.

  16. Kyle Butt

    Another Caveat with innerHTML. Internet Explorer normalizes all the space when inserting into innerHTML. Including a pre-tag. I ran into this with an app I wrote that processed xml using xslt, and inserted the results into the tree. The xslt preserved space, and went back and changed \n to ‘s after the fact, but that didn’t work in IE.

  17. Miguel Ventura

    There isn’t really much use on using regex to filter tags. Regex aren’t powerful enough on their own, without grammars (<a title=”x”>bbb<a title=”"> kind of stuff would filter too much) and therefore you can always get around it. Try filtering something like

    html = “<scrivoid(0);pt defer=\”true\”>alert(‘foo’);”

    So there’s no real security benefit. Filtering scripts before adding them to any DOM structure (even one that doesn’t belong to the document tree) won’t be an easy job…

  18. Pingback: Blog do Márcio d’Ávila

  19. Steven Levithan

    /<script[^>]*>((.|[\r\n])*?)<\\?\/script>/ig

    …should be:

    /<script[^>]*>[\S\s]*?<\/script>/ig

    For one, it’s more readable, and secondly it’s much more efficient. If you want to also match self-closed script elements you could use:

    /<script[^>]*(?:\/>|>[\S\s]*?<\/script>)/ig

  20. Steven Levithan

    One more security hole to plug is </script> tags with whitespace or other attributes, which I believe browsers allow (at least whitespace). So you could modify the last regex to /<script[^>]*(?:\/>|>[\S\s]*?<\/script[^>]*>)/ig

  21. Amit Jaiswal

    What I have noticed while using innerHTML is that if I create a select box through innerHTML in a DIV which is in a form and then I submit the for the select box is not captured in request(JAVA) in case of Firefox but gets captured in case of IE.
    Is this due to
    “You don’t get back a reference to the element(s) you just created, forcing you to add code to retrieve those references manually (using the DOM APIs…)”

  22. Joseph

    Really nice article Julien!

    but, simple using innerHTML is not enough, especially it is problematic with html form elements, “table” and “option” elements.

    I think your code needs some improvements to fix all noted problems with various tags. see code below:


    wrapper = document.createElement('div');

    if (/^t(body|foot|head)$/i.test(el.tagName)) {
    wrapper.innerHTML = '' + html + '';
    wrapper = wrapper.getElementsByTagName('tbody')[0];
    } else if (/^select$/i.test(el.tagName)) {
    wrapper.innerHTML = '' + html + '';
    wrapper = wrapper.getElementsByTagName('select')[0];
    } else {
    wrapper.innerHTML = html;
    }

    while(wrapper.firstChild) {
    el.appendChild(wrapper.firstChild);
    }

  23. Joseph

    I’m sorry, code was posted incorrectly:

    wrapper = document.createElement('div');

    if (/^t(body|foot|head)$/i.test(el.tagName)) {
    wrapper.innerHTML = '<table>' + html + '</table>';
    wrapper = wrapper.getElementsByTagName('tbody')[0];
    } else if (/^select$/i.test(el.tagName)) {
    wrapper.innerHTML = '<select name="tmp">' + html + '</select>';
    wrapper = wrapper.getElementsByTagName('select')[0];
    } else {
    wrapper.innerHTML = html;
    }

    while(wrapper.firstChild) {
    el.appendChild(wrapper.firstChild);
    }

  24. Frank Thuerigen

    @Joseph,

    if you have too many table or select/option entries – recreate the whole table or select. If it is still too many, your app has a faulty design anyway, since bloated selects are not user-friendly and max column tables should be replaced by an optimized design as well.

    my $0.02 of course.

    Merry Christmas time everybody,

    Frankie / Berlin / Germany

Comments are closed.