Sunday, March 15, 2009

UTF-8 Parser Stress Testing

If your product accepts UTF-8 text as valid input, then Markus Kuhn is your friend. His "UTF-8 decoder capability and stress test" is a textfile containing dozens of malformed UTF-8 byte sequences. This file should be part of your standard acceptance suite. You need to know if your UTF-8 parser fails (for instance, crashes) on malformed input.

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

And if you're weak on the whole concept of Unicode and character encoding in general, Joel Spolsky's primer is a good place to start: http://www.joelonsoftware.com/articles/Unicode.html

No comments:

Post a Comment