Archived

This forum has been archived. Please start a new discussion on GitHub.

Python and Unicode

For languages other than C++, Ice encodes strings in their native Unicode representation, so applications can transparently use characters from non-English alphabets.
...says chapter 32.21 of the Ice manual, but if that is the concept shouldn't IcePy map Slice strings to Python Unicode strings instead of 8-bit strings?

For example, the Python implementation of the "Hello World" demo server:
class PrinterI(Demo.Printer):
  def printString(self, s, current=None):
    print s
only works with a UTF-8 locale. 's' is an 8-bit string which, as long as the printString() operation is called from a correctly written client, will always be in UTF-8. A more correct server implementation should look something like:
class PrinterI(Demo.Printer):
  def printString(self, s, current=None):
    print s.decode('utf8')
which would, however, generate a UnicodeEncodeError if a client sends a string with characters that are not representable in the server's locale, so more effort is needed to have robust printing in the server (the best I've been able to come up with is
print s.decode('utf8').encode(locale.getpreferredencoding(), 'replace')
which is not that trivial any more...).

Likewise, in a Python client, I would like to be able to
printer.printString(u"Hällö Wörld!")
directly (after setting the proper coding for the Python script, of course) instead of
printer.printString(u"Hällö Wörld!".encode('utf8'))
but this gives me a "ValueError: invalid value for argument 1 in operation `printString'" from Ice.

Alternatively, if IcePy uses 8-bit strings for Slice strings, it should provide an automatic string conversion facility as in C++. Our applications have to run with a Latin1 locale for legacy reasons. In C++ this works very nice and transparent after installing a UTF-8 <-> Latin1 StringConverter, but in Python it gets ugly and increases the potential for mistakes (are encode/decode correctly applied to all strings that go over the Ice interface?).

I guess the best option I currently have is to patch the C++ code of the IcePy module to install a StringConverter there?

In any case, it would be nice if IcePy could marshal Unicode strings to Slice strings instead of raising a ValueError.

Comments

  • bernard
    bernard Jupiter, FL
    Hi Christian,

    Thanks for your analysis: these issues will be addressed in Ice 3.3.0.

    We'll provide the ability to plug-in a string converter (with the underlying Ice for C++ communicator), and you'll be able to pass Unicode strings as in parameters for remote operations that take Slice strings.

    Best regards,
    Bernard
  • bernard
    bernard Jupiter, FL
    Hi Christian,
    which would, however, generate a UnicodeEncodeError if a client sends a string with characters that are not representable in the server's locale, so more effort is needed to have robust printing in the server (the best I've been able to come up with is
    print s.decode('utf8').encode(locale.getpreferredencoding(), 'replace')
    
    
    which is not that trivial any more...).

    We provide 3 string converter implementations in Ice 3.3:
    • UnicodeWstringConverter
      Converts UTF-16 or UTF-32 wstrings to/from UTF-8 sequences. By default, it's "lenient", i.e. some malformed input sequences are transformed into the Unicode replacement character. In 3.3.0, you'll be able to get the strict behavior as well (no replacement character).
    • IconvStringConverter
      Converts narrow or wstrings from the specified iconv encoding to/from UTF-8. It's always strict, i.e. if there is no mapping, you get an exception.
    • WindowsStringConverter
      Converts narrow strings encoded in a given code-page to/from UTF-8. Like with the Iconv converter, if there is no mapping, you get an exception.

    So if you want to use a replacement character, you'll probably need to write your own C++ string converter.

    Cheers,
    Bernard
  • Hi Bernard,

    Thanks for taking the time to look into this. I'll be looking forward to Ice 3.3, then. :)