Character data may be split across multiple handler calls

If you register a character data handler with XML_SetCharacterDataHandler(), you might expect that you would be called just once for each section of character data. Unfortunately you would be wrong. Expat can and will split up character data in an arbitrary manner, presenting each set of characters in separate calls. The introductory parser walkthrough shows an example of that, as the parser makes separate calls to the handler for a newline and for a pair of spaces.

Your character data handler will need to be able to cope with this potentially broken-up input. The usual approach is to accumulate characters into a user-defined data structure — the test suite has a good example of this in its CharData structure and support functions — and process the whole thing in one go using start and end element handlers. If you know how your character data will be structured, you may well be able to do better than that.

External entity sub-parsers must be created after parsing has started

There is a note in the description of XML_ExternalEntityParserCreate() in expat.h that is easy to miss:

XML_ExternalEntityParserCreate() can be called at any point after the first call to an ExternalEntityRefHandler [...]

Unfortunately this conflicts with a common programming pattern, that of creating everything ahead of time; in particular, creating sub-parsers before starting to parse the input, which will be well before any ExternalEntityRefHandler is called. This appears to work, but the sub-parsers will be unable to communicate their results back to the main parser.

There are a number of "fixes" to this issue that generally cause more serious problems of their own. The best solution is to do what expat.h says and not create a sub-parser until the appropriate handler is called. The test suite has a lot of examples of creating and disposing of parsers in the ExternalEntityRefHandler itself.

Strings pass to handlers are only temporary

Most of the handlers that users register are passed strings of one form or another. A start element handler, for instance, is passed the name of the element that has started and an array of attribute names and values. All of these strings are allocated in dynamic memory, so handlers should not store pointers to them for later use. If you really need one of these string parameters for later, make a copy of the string itself (and remember to free that when you're done!).

Element declaration handlers must free their content models

The description of the XML_ElementDeclHandler type in expat.h includes the following remark:

It's the caller's responsibility to free model when finished with it.

This is a little misleading; "the caller" in this context is the user's program, not the library as you might assume. It is up to the user code to call XML_FreeContentModel() on the model parameter passed to an element declaration handler, to free the dynamic memory used for the model.

Note that this does not mean that the content model must be freed in the element declaration handler itself. If the user wishes to keep content models around, for example to validate elements later on, they are perfectly at liberty to do so.

XML 1.1 and XML 1.0 Fifth Edition are not supported

Expat supports XML 1.0 Fourth Edition. It does not support:

If Expat is asked to parse documents that are targetting a version of the XML standard younger than XML 1.0 Fourth Edition, you may receive parse errors.