Archived

This forum has been archived. Please start a new discussion on GitHub.

accents problems with Freeze::map

Hi!

I am trying to implement an application using Freeze and I am experiencing problems when working with accentuated characters like "é" or "è".

When inserting a string for a key (or for a value) in a Freeze::map, I can not retrieve this string anymore (the string appears as empty) using Freeze, but the insertion was done in the db, as I can check with db_dump.

I am not sure if this is a problem with Freeze, with Berkeley DB, with my locale settings or something with conflicting utf-8 / iso-8859-15, but anyhow, I cannot make that works. Below I am pasting informations that can be relevant.

I am using linux, kernel version 2.4 (Mandrake 9.1). It seems to me that this Mandrake distribution has troubles with encoding, but I don't know enough about the subject for the moment.

The following code shows the problem:

#include <Freeze/Freeze.h>
#include <StringStringMap.h>
#include <iostream>
int
main(int argc, char* argv[])
{
Ice::CommunicatorPtr communicator=Ice::initialize(argc,argv);
Freeze::DBEnvironmentPtr dbEnv=Freeze::initialize(communicator,"db");
Freeze::DBPtr simpleDB=dbEnv->openDB("simple",true);
StringStringMap Map(simpleDB);

Map.clear();

Map.insert(std::make_pair("yes","elephant"));
Map.insert(std::make_pair("non","élephant"));

for(StringStringMap::const_iterator p=Map.begin();p!=Map.end();++p)
{
std::cout<<p->first<<" "<<p->second<<std::endl;
}

simpleDB->close();
dbEnv->close();
communicator->destroy();
return 0;
}

When running this code, the output is:

non
yes elephant

instead of

non élephant
yes elephant


The command "db_dump -p simple" output:

VERSION=3
format=print
type=btree
db_pagesize=4096
HEADER=END
\0a<Key>non</Key>
\0a<Value>\e9lephant</Value>
\0a<Key>yes</Key>
\0a<Value>elephant</Value>
DATA=END


The "StringStringMap" class was generated with:
slice2freeze --dict StringStringMap,string,string StringStringMap

My c++ compiler is gcc 3.3.

The "locale" command output:

LANG=fr_CH.ISO-8859-15
LC_CTYPE=fr_CH.ISO-8859-15
LC_NUMERIC=fr_CH.ISO-8859-15
LC_TIME=fr_CH.ISO-8859-15
LC_COLLATE=fr_CH.ISO-8859-15
LC_MONETARY=fr_CH.ISO-8859-15
LC_MESSAGES=fr_CH.ISO-8859-15
LC_PAPER=fr_CH.ISO-8859-15
LC_NAME=fr_CH.ISO-8859-15
LC_ADDRESS=fr_CH.ISO-8859-15
LC_TELEPHONE=fr_CH.ISO-8859-15
LC_MEASUREMENT=fr_CH.ISO-8859-15
LC_IDENTIFICATION=fr_CH.ISO-8859-15
LC_ALL=


I hope I give you enough informations.

Thank you in advance for any hint.

Comments

  • marc
    marc Florida
    Can you try the following C++ code:

    std::string s = "élephant";
    std::cout << s << endl;

    What does it print?

    As a general note, I recommend to use unicode and std::wstring, and then convert to std::string:

    std::wstring ws = ... "élephant" in unicode format ...
    std::string s = IceUtil::wstringToString(ws); // s now holds "élephant" in utf-8

    I don't think that this is the cause of your problems. But the on-the-wire string representation for the Ice protocol is UTF-8, not ISO-8859-15. If you don't use UTF-8, you won't be able to interoperate for example with Ice for Java if you use non-ASCII strings.
  • Originally posted by marc
    Can you try the following C++ code:

    std::string s = "élephant";
    std::cout << s << endl;

    What does it print?

    it prints:
    élephant
    As a general note, I recommend to use unicode and std::wstring, and then convert to std::string:

    std::wstring ws = ... "élephant" in unicode format ...
    std::string s = IceUtil::wstringToString(ws); // s now holds "élephant" in utf-8

    I tried the following modification of my previous code example:
    (...)
    std::string ss="élephant";
    std::wstring ws(ss.begin(),ss.end());// in unicode format ...
    Map.insert(std::make_pair("yes","elephant"));
    Map.insert(std::make_pair("non",IceUtil::wstringToString(ws)));
    (...)
    But the output is still wrong. By the way:
    std::string ss="élephant";
    std::wstring ws(ss.begin(),ss.end());// in unicode format ...
    std::string s = IceUtil::wstringToString(ws); // s now holds "élephant" in utf-8
    std::cout << s << std::endl;

    prints:
    élephant
    as well...
    I don't think that this is the cause of your problems. But the on-the-wire string representation for the Ice protocol is UTF-8, not ISO-8859-15. If you don't use UTF-8, you won't be able to interoperate for example with Ice for Java if you use non-ASCII strings.
    I think I will have to learn a little bit about wchar_t, UTF-8, and all that stuff. That was on my "to learn" list anyway...

    Thank you.
  • marc
    marc Florida
    Originally posted by sylvain

    I tried the following modification of my previous code example:
    (...)
    std::string ss="élephant";
    std::wstring ws(ss.begin(),ss.end());// in unicode format ...
    Map.insert(std::make_pair("yes","elephant"));
    Map.insert(std::make_pair("non",IceUtil::wstringToString(ws)));
    (...)

    That won't work. You must put "élephant" in Unicode format into your editor.

    The reason why all this doesn't work, is because you are using "élephant" in ISO format, but the XML encoding in Freeze expects it in Unicode format.

    You can either use an editor that support Unicode, or you have to enter the escape sequence to represent "élephant" in Unicode. Or you must look for a method that conversts ISO to Unicode.

    Note that future versions of Freeze will be less sensitive with respect to such problems, when we use binary encodings for Freeze. However, it's still wrong, because you are using ISO strings where Ice expects Unicode.

    Cheers,
    Marc
  • Obviously, I should dive a little bit more into the subject...

    But there is something I don't understand. It seems that the Ice runtime correctly received the accentuated word "élephant" because in the db the entry is
    "\0a<Value>\e9lephant</Value>"
    for "élephant" so with a \e9 for the é (\00e9 is the unicode for 'é' no?)
    and
    "\0a<Value>elephant</Value>"
    for "elephant".

    So why does the iteration through the map retreive an empty string for the value "\e9lephant"? This string has been written down by the Ice runtime, so the Ice runtime should be able to retrieve it. For iso/unicode mismatch reason, I don't expect the exact "élephant" string to be retrieved, but maybe something like "lephant" ...

    I did some tries with accentuated word like "mariés" where the accentuated letter is not the first, but I still get an empty string. Where are my non-accentuated characters gone?

    regards,
    Sylvain
  • Just for your information in the case of this is relevant:

    I tested a little further by including a Java client that sends a string to a (c++) servant that insert it in a StringStringMap in the same way than the code in my first post, then print out the whole content of the Freeze::map.

    The Java client is just a text field (swing) and a button. The content of the text field is sent to the servant when the button is clicked using a simple operation from an Ice interface: "void add(string s)"

    My Java system is the one bundled with the Sun NetBeasn 3.5 IDE (so its jsdk/jre 1.4.2).

    When I enter "élephant" the servant says it received : " élephant " and put it in the db, but is unable to retrieve it from the db. More precisely: the value returned is an empty string (not a weird string). The same for word with accentudated characters not at the beginning of the word.

    A "db_dump -p" of the db file output:

    \0a<Value>\c3\a9lephant</Value>

    I don't know exactly how my java handles the accentuated character nor which charset it use, but I think it should be independant from the charset of the system...

    Anyway, as far as I understand it the problem is the same: the Ice runtime is writing down in the db something it can not retrieve. If I make a distributed program, how could I ensure that all the clients are running on systems with the correct charset?


    Regards,
    Sylvain
  • mes
    mes California
    Hi,

    I've been able to reproduce this problem. It looks like an issue with Xerces-C++, which is used in the XML encoding of Freeze maps. As an alternative, you can use the binary encoding by specifying the option --binary to slice2freeze.

    I will reply again when I've resolved this issue.

    - Mark
  • mes
    mes California
    This was caused by a bug in Ice, and has been fixed for the next release. If you would like a patch sooner, please let me know.

    Thanks for the bug report.

    - Mark