Many feed elements may contain HTML markup, and many feed aggregators use a web browser (or browser component) to display content. By default, Universal Feed Parser sanitizes HTML markup in several elements, removing HTML tags and attributes that could introduce Javascript or other security risks.
These elements are sanitized by default:
The following HTML tags are allowed by default (all others are stripped): a, abbr, acronym, address, area, b, big, blockquote, br, button, caption, center, cite, code, col, colgroup, dd, del, dfn, dir, div, dl, dt, em, fieldset, font, form, h1, h2, h3, h4, h5, h6, hr, i, img, input, ins, kbd, label, legend, li, map, menu, ol, optgroup, option, p, pre, q, s, samp, select, small, span, strike, strong, sub, sup, table, tbody, td, textarea, tfoot, th, thead, tr, tt, u, ul, var
The following HTML attributes are allowed by default (all others are stripped): abbr, accept, accept-charset, accesskey, action, align, alt, axis, border, cellpadding, cellspacing, char, charoff, charset, checked, cite, class, clear, cols, colspan, color, compact, coords, datetime, dir, disabled, enctype, for, frame, headers, height, href, hreflang, hspace, id, ismap, label, lang, longdesc, maxlength, media, method, multiple, name, nohref, noshade, nowrap, prompt, readonly, rel, rev, rows, rowspan, rules, scope, selected, shape, size, span, src, start, summary, tabindex, target, title, type, usemap, valign, value, vspace, width
![]() | |
The unit tests for HTML sanitizing show many different examples of dangerous markup that Universal Feed Parser sanitizes by default. |
One emerging technology that affects feed parsing is the inclusion of microformats within syndicated content. Briefly, publishers can add additional semantics to their HTML content using rel and class attributes. Universal Feed Parser does not currently parse microformat content within embedded HTML markup, but it doesn't destroy it either. Both the rel and class attributes survive HTML sanitizing, so applications built on Universal Feed Parser that wish to parse microformat content are free to do so.
I am often asked why Universal Feed Parser is so hard-assed about HTML sanitizing. This topic usually comes up when someone notices that Universal Feed Parser strips all style attributes by default.
Here is an incomplete list of potentially dangerous HTML tags and attributes:
- script, which can contain malicious script
- applet, embed, and object, which can automatically download and execute malicious code
- meta, which can contain malicious redirects
- onload, onunload, and all other on* attributes, which can contain malicious script
- style, link, and the style attribute, which can contain malicious script
style? Yes, style. CSS definitions can contain executable code.
![link to this example [link]](images/permalink.gif)
Example: Embedding Javascript in CSS
This sample is taken from http://feedparser.org/docs/examples/rss20.xml:
<description>Watch out for <span style="background: url(javascript:window.location='http://example.org/')"> nasty tricks</span></description>
This sample is more advanced, and does not contain the keyword javascript: that many naive HTML sanitizers scan for:
<description>Watch out for <span style="any: expression(window.location='http://example.org/')"> nasty tricks</span></description>
Internet Explorer for Windows will execute the Javascript in both of these examples.
Now consider that in HTML, attribute values may be entity-encoded in several different ways.
![link to this example [link]](images/permalink.gif)
Example: Embedding encoded Javascript in CSS
To a browser, this:
<span style="any: expression(window.location='http://example.org/')">
is the same as this (without the line breaks):
<span style="any: expre ssion(window .location='h ttp://exampl e.org/')">
which is the same as this (without the line breaks):
<span style="any: expr ession(win dow.locati on='http:/ /example.o rg/')">
And so on, plus several other variations, plus every combination of every variation.
The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it. This is why Universal Feed Parser uses a whitelist and not a blacklist. I am reasonably confident that none of the elements or attributes on the whitelist are security risks. I am not at all confident about elements or attributes that I have not explicitly investigated. And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code. I will not attempt to preserve “just the good styles”. All styles are stripped.