Hmm, bring up a python interpreter and type the last line in with the imports. Works fine.
It’s the first line that is the problem, I typoed it and made it utf-7 not utf-8. I suppose it means that it is case-insenstive. Still, it wasn’t too clear to me at least, what was going on.
@Craig : UTF-7 is simply another encoding for Unicode, which only uses 7 bits per byte. It is not very used but it would be an alternative to UTF-8 or UTF-16 + Base64. Now, indeed, the character plus (just character plus, or Unicode character plus if you like, but not UTF-8 plus, at it is an abstract character, independant from the encoding) is encoded in UTF-7 as two bytes that would be read +- when incorrectly decoding them as ASCII.
What you say is true but his file is probably not actually saved as utf7 so the conversion goes the other way. So the problem is that b’+ ‘, when interpreted as UTF-7 yields just u’ ‘: b’+’ denotes start of base64 block, the block ends on the first non-[A-Za-z+/] character which is the space immediately behind the +, so just the + gets consumed.
11 thoughts on “A python utf gotcha”
Craig Small: A python utf gotcha http://t.co/IrIKiEfuii #debian #linux
Small Dropbear | A python utf gotcha http://t.co/IrIKiEfuii
Planet Debian: Craig Small: A python utf gotcha http://t.co/wYz0DiYgqH
The problem isn’t that it is case-insenstive, but that u’+’ encoded in UTF-7 is b’+-‘.
Ahh, is that it? I thought it was getting DateTime mixed with datetime but that was a furphy.
So UTF-8 plus is UTF-7 plus,minus.
That must do some odd things! Thanks for the note.
@Craig : UTF-7 is simply another encoding for Unicode, which only uses 7 bits per byte. It is not very used but it would be an alternative to UTF-8 or UTF-16 + Base64. Now, indeed, the character plus (just character plus, or Unicode character plus if you like, but not UTF-8 plus, at it is an abstract character, independant from the encoding) is encoded in UTF-7 as two bytes that would be read +- when incorrectly decoding them as ASCII.
What you say is true but his file is probably not actually saved as utf7 so the conversion goes the other way. So the problem is that b’+ ‘, when interpreted as UTF-7 yields just u’ ‘: b’+’ denotes start of base64 block, the block ends on the first non-[A-Za-z+/] character which is the space immediately behind the +, so just the + gets consumed.
@planetdebian: Craig Small: A python utf gotcha http://t.co/iRX60BBj2R #arsipweb
Craig Small: A python utf gotcha: This one had me stumped for a while:
# -*- coding: utf-7 -*-
import dateti… http://t.co/RVVaBvZ5LR
Craig Small: A python utf gotcha: This one had me stumped for a while:
# -*- coding: utf-7 -*-
import dateti… http://t.co/usOHiL9NpY
$ echo ‘default_due_date = datetime.datetime.now() + datetime.timedelta(days=30)’| iconv -f utf7 -t utf8
default_due_date = datetime.datetime.now() datetime.timedelta(days=30)
The ‘ + ‘ sequence becomes just ‘ ‘.