|
|
|
Hmmm...ok, I tried reproducing this yesterday with that string, to no avail, but I'll try again. In the meantime, I'm attaching a patch to install.xml which has new compiled libraries with the aforementioned possible fix.
Also reproduced for me with prebuilt Snowglobe 1.2.0 (2778) Sep 22 2009 00:11:52 (Snowglobe Test Build)
Ok, I've reproed the problem, and verified that
Revision 2784
Fix for Files affected: The patch in the current jsoncpp lib solves the problem for escape sequences bellow \u0080 (128). For code points above that, the UTF-8 encoding is different than the actual codepoint (which is what the variable {unicode} is set to.) To be sure we're not inserting invalid UTF-8 strings, it needs to either discard code points above 127, or properly encode them as UTF-8.
I have only ever seen google translate use the sequence \u0026 , but I don't think that's guaranteed to stay that way. I'm attaching 2 alternative patches, one that does UTF-8 encoding and one that just discards >128. I tested the UTF-8 encoding function pretty thoroughly, but I wrote it as a learning exercise, so it might be doing something stupid. Thickbrick, could you submit your patch upstream? Since the problem you've identified is a theoretical one rather than one we've seen in the wild, I'm hesitant to further get out of sync with the main jsoncpp folks.
The case where we have a repro is fixed.
I'd be more comfortable if we at least discard the character if unicode > 127 (i.e. the non_UTF-8 patch I attached above)
If Google ever decides to change this behavior (is it documented somewhere?) and send an escape sequence for any character > 127, a lot of viewers in the wild will start seeing either question marks (not so bad) or random glyphs (slightly worse.) Anyway, I'll try to get my patch into upstream Jsoncpp. |
|||||||||||||||||||||||||||||||||||||||||||||
To repro: with your translate language set to English, have an object say "Estoy corriendo" (might not mean anything - I don't speak spanish, but it tranlates to "I'm running").
For me (Debian linux, system locale set to en_US.UTF-8) this produces the following:
I have only seen this happen for the "&" character, but if google ever does that with code points higher than 127, jsoncpp needs to do proper UTF-8 encoding, and not just append the char.