• All submissions to this site are governed by Second Life Project Contribution Agreement. By submitting patches and other information using this site, you acknowledge that you have read, understood, and agreed to those terms.
Issue Details (XML | Word | Printable)

Key: SNOW-241
Type: Sub-task Sub-task
Status: Resolved Resolved
Resolution: Fixed
Priority: Normal Normal
Assignee: Rob Linden
Reporter: Rob Linden
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
6. Second Life Snowglobe - SNOW
SNOW-93

Problems with unicode in automatic translations

Created: 22/Sep/09 04:49 PM   Updated: 01/Oct/09 06:35 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: Snowglobe 1.2

File Attachments: 1. Text File SNOW-241-v2.patch (2 kB)
2. Text File SNOW-241.patch (2 kB)
3. File SNOW-241_jsoncpp_escaped_non_UTF-8.diff (0.6 kB)
4. File SNOW-241_jsoncpp_escaped_utf-8.diff (3 kB)



 Description  « Hide
Thickbrick wrote in a comment in SNOW-93:

Google sometimes returns stuff like "\u0026#39;" which fails to be substituted into an apostrophe.
for example, the result of this:
http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=Estoy%20corriendo&langpair=|en
Maybe this is more of a google bug, but it's pretty noticeable.

  • Maybe add a string_replace_all( translation, "
    u0026","&"); before doing html entities replacment.
  • Like Rob said above, possibly use generic html entity stripping.

I'm not able to repro this problem yet, so even though I've got a possible fix (see http://bitbucket.org/rob_linden/jsoncpp/changeset/28fca04e7817/ ), I'd like to repro the problem before committing.



 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
Thickbrick Sleaford added a comment - 23/Sep/09 05:52 AM
To clear up what I wrote there: What I am seeing is that any translation that includes an "&#NN;" entity, the "&" character comes from google as \u0026. When this isn't replaced with a "&" by jsoncpp, the html entity replacement fails (in LLTranslate::TranslationReceiver::completedRaw())

To repro: with your translate language set to English, have an object say "Estoy corriendo" (might not mean anything - I don't speak spanish, but it tranlates to "I'm running").

For me (Debian linux, system locale set to en_US.UTF-8) this produces the following:

Object: Estoy corriendo (I#39;m running)

I have only seen this happen for the "&" character, but if google ever does that with code points higher than 127, jsoncpp needs to do proper UTF-8 encoding, and not just append the char.


Rob Linden added a comment - 23/Sep/09 02:04 PM
Hmmm...ok, I tried reproducing this yesterday with that string, to no avail, but I'll try again. In the meantime, I'm attaching a patch to install.xml which has new compiled libraries with the aforementioned possible fix.

Thickbrick Sleaford added a comment - 23/Sep/09 02:36 PM
Also reproduced for me with prebuilt Snowglobe 1.2.0 (2778) Sep 22 2009 00:11:52 (Snowglobe Test Build)

Rob Linden added a comment - 23/Sep/09 05:52 PM
Ok, I've reproed the problem, and verified that SNOW-241-v2.patch fixes the problem (at least on Mac and Linux)

lindenrobot added a comment - 23/Sep/09 06:26 PM
Revision 2784 by rob.linden on 2009-09-23 20:26:55 -0500 (Wed, 23 Sep 2009)

Fix for SNOW-241 (Problems with unicode in automatic translations)

Files affected:
U projects/2009/snowglobe/trunk/install.xml


Thickbrick Sleaford added a comment - 24/Sep/09 08:24 AM - edited
The patch in the current jsoncpp lib solves the problem for escape sequences bellow \u0080 (128). For code points above that, the UTF-8 encoding is different than the actual codepoint (which is what the variable {unicode} is set to.) To be sure we're not inserting invalid UTF-8 strings, it needs to either discard code points above 127, or properly encode them as UTF-8.

I have only ever seen google translate use the sequence \u0026 , but I don't think that's guaranteed to stay that way.

I'm attaching 2 alternative patches, one that does UTF-8 encoding and one that just discards >128.

I tested the UTF-8 encoding function pretty thoroughly, but I wrote it as a learning exercise, so it might be doing something stupid.


Rob Linden added a comment - 30/Sep/09 11:33 AM
Thickbrick, could you submit your patch upstream? Since the problem you've identified is a theoretical one rather than one we've seen in the wild, I'm hesitant to further get out of sync with the main jsoncpp folks.

Rob Linden added a comment - 30/Sep/09 11:34 AM
The case where we have a repro is fixed.

Thickbrick Sleaford added a comment - 01/Oct/09 06:35 PM
I'd be more comfortable if we at least discard the character if unicode > 127 (i.e. the non_UTF-8 patch I attached above)

If Google ever decides to change this behavior (is it documented somewhere?) and send an escape sequence for any character > 127, a lot of viewers in the wild will start seeing either question marks (not so bad) or random glyphs (slightly worse.)

Anyway, I'll try to get my patch into upstream Jsoncpp.