Written by Rhodri James; article also available at kynesim.blogspot.co.uk.
I'm in the intriguing position of being employed to work on a free software project by the Linux Foundation. I'm expanding the test coverage of the Expat library under the eagle eye of maintainer Sebastian Pipping, which is an excellent way of discovering how the brilliant, twisted and almost entirely undocumented internals work. (There will be articles. Many articles.)
One thing you expect to come across when writing tests is the occasional bug. I came across a serious one in the parsing of external parameter entities. CVE-2017-9233 (to give it its formal identification) says that bad XML in an external entity will cause the parser to go into an infinite loop and never return control to the application.
For those of you who haven't been saturated in XML terminology for the last however long, an example is in order. Suppose you have the following trivial piece of XML:
<!DOCTYPE doc SYSTEM "http://example.com/level1.dtd">
<doc/>
And the DTD level1.dtd
that it reads:
<!ELEMENT doc EMPTY>
<!ENTITY % e SYSTEM "http://example.com/level2.ent">
%e;
And the external entity definition in yet another resource, level2.ent
:
<!ELEMENT el EMPTY>
<el/>
Now a quick riffle through the XML standards will tell you that
ordinary elements such as <el>
aren't allowed in DTDs. All you can
legally use are the DTD declarations <!ELEMENT>
, <!ATTLIST>
,
<!ENTITY>
and <!NOTATION>
. When our entity %e
in level.dtd gets
substituted in, it will put the <el/>
element straight into the DTD,
leaving us with a malformed DTD. The parser should detect this and
reject the whole shebang.
What actually happened is hard to follow in the source code. (Many
articles.) When parsing reached the 'e' of <el/>
, the tokenizer
recognised it as a valid start of an ordinary element (which is true)
and returned the appropriate value, XML_TOK_INSTANCE_START
.
The calling code knew how to recognise the limited number of tokens that are legal and most of the tokens that aren't legal in a DTD, but unfortunately it missed this specific case. Without any better instructions from its decision logic, the code assumed that it had found something valid and tried tokenising again, in case there was more text to be parsed.
It did this expecting that its internal pointers were updated to point
to that unparsed text. That would have been true if it had dealt with
something legal, but since "<e" isn't a legal part of a DTD the
pointers had been left alone. The tokenizer therefore saw "<e" again,
returned XML_TOK_INSTANCE_START
again, and the whole thing repeated
until the application was killed.
In brief:
"How does this affect me?" you may well ask. "I don't use Expat. I don't even write in C. How can I possibly be affected?" The answer is, you may well be using Expat unknowingly. There are wrapper libraries for Python, Java and many other languages for Expat. Many other application and libraries (particularly C and C++ libraries like Poco, libDOM or libwww) use Expat under the hood.
The libraries often enable external entity parsing, which does potentially allow this bug to bite your application. Some of them will, or used to, download the URIs for you, which is a serious problem. Some of the applications are just parsing local configuration files, but even those might follow a DTD if one was inserted into the configuration file somehow. Other applications explicitly use external DTDs, leaving you vulnerable if those sites are compromised or malicious.
In short, it's entirely possible for you to be using Expat and not know it.
First and most obviously, upgrade. The current version of libexpat has patched this bug, and a few other things; read the change log for details.
The other thing you should always do is consider how you use your XML parser. I am not a security expert by any stretch of the imagination, but even I know not to download arbitrary URIs, for example. Reading "https://www.w3c.org/xml/fluffy.ent" is probably safe; reading "http://evil.haxx0rs.org/xml/i-pwn-u.dtd" probably isn't. You should check that any library you use doesn't automatically download arbitrary URIs for you, and you should be careful about what URIs you do allow to be downloaded.
If you want a moral from this article: always test. Murphy's Law ensures there will always be bugs in your code, and you should make at least some effort to find them before they annoy other people. And if by chance you should be one of those people annoyed by a bug, let the author know. He or she will usually be glad for the feedback and want it fixed as much as you do!