Python Standalone Packager 0

I need them so infrequently, I forget what different projects are out there:

android2po: Managing Android translations 2

I’ve always liked gettext a lot. Rather than asking you to maintain a database of strings, assigning an id to each, it simply uses the original strings itself as the string id. To me, it’s a classical example of choosing practicality over purity.

The Android localization system, of course, uses the former approach. Each string is a resource with an id, each each language, essentially, has one or more XML files with the proper localized string mapped to each id.

For my apps, I initially used to have only the original English version, and a German translation, those being the only languages I speak, more or less anyway. Now, whenever I added a new English string, or changed an existing one, I immediately updated the German version as well – simply enough.

For A World Of Photo, I decided to ask the community for help with translations into more languages. Clearly, things were not so simple anymore.

See, with gettext, when the set of strings an app uses changes as part of a new version, you can simply “merge” the new string catalog into each of the translations. Strings that have been removed from the app are removed from the translations files, new strings are added, and strings that have been changed are flagged as “fuzzy”, at least to the extend that the merge tool detects it as a change, rather than a completely new string. That last part is possible because each translation file contains contains not only the translations, but also the original string that was translated. Remember, it’s the string that is the database key.

As a result, translators simply have to go through the list of new or fuzzy, update those, and they’re done.

Now, Android’s system has no equivalent tools. Frankly, I wonder how other people do this. I mean, you surely don’t want have your localization team go through the full list of strings every time you release a new version. Even if you decide you don’t need to ability to detect strings that have changed (you could simply have a policy of using a new id when such a change is necessary), you still need tools to merge changes in your main strings.xml file into each language’s XML resource with new/removed strings (do any such tools exist?).

I suppose you could also ask have your translators work off a diff, but that seems inconvenient. There’s this huge ecosystem around gettext with all kinds of desktop and web apps that could be utilized.

Google seems to use something internally, because Android’s own string resources are marked with msgid= attributes.

So, I decided the best way for me to deal with this would be to simply convert Android’s XML resources to gettext, do the translations, then import the result back to Android. I found out that the OpenIntents project was doing the same, essentially using a generic xml2po tool found somewhere in the depths of gnome-doc-utils. I kinda got it to work, but ran into a lot of little issues; in the end it felt just too hacky.
The final thing that convinced me that writing a special purpose tool might be worth my while was the fact that Android’s XML resource format has a bunch of different escaping rules and peculiarities (which I plan to write a separate post on), with which translators shouldn’t really have to deal with.

So, have a look at android2po. You can install via PyPi:

easy_install android2po

There’s also a README file which explains the basic usage; which is really just a2po init, a2po export and a2po import calls, though at this point there’s also various configuration options that should make it really quite flexible.

The biggest thing it doesn’t support yet are the <plurals> tags, mainly because I didn’t need them myself yet. Apart from that, I do believe it should work just fine for most projects.

git fast-import: Empty path component found in input 0

When you get “fatal: Empty path component found in input” errors from git fast-import, check that your export tool doesn’t write out path values that start with a slash. In my case, my rule file for svn-all-fast-export matched paths like “/project/trunk”, when I should’ve used “/project/trunk/” (note the trailing slash).

Pro-Tip for svn-all-fast-export: Use –metadata=no to get rid of the svn info in the generated git commits. It’s not really advertised as an option.

Fun with encodings in MySQL 1

Since my post timezones in MySQL turned out to be so useful (I keep checking it out every other month), I thought it would be time well spent if I jotted down some notes about another area that sends me googling every time I run into it: Encodings in MySQL.

[One] Text columns in MySQL are annotated with the encoding their data is supposed to be in. If a column doesn’t specify an encoding, a default can be given on the table, database and server levels.

MySQL will use it’s knowledge about the encoding of a column to make sure that data getting in an out is properly transcoded to whatever encoding the client is using. This is determined by variables:

  • character_set_results determines the encoding used for the bytes sent to the client.
  • character_set_client let’s the server know which encoding the bytes use that it receives from the client.
  • There’s also character_set_connection, which is described in the documentation as the encoding the server translates an incoming statement to. I’m not sure though why MySQL can’t just directly convert from the client charset to the charset used by the relevant columns, which it ultimately will do anyway. I can only imagine that this setting might be useful with respect to binary columns.

So to recap the process, the MySQL client sends a sequence of bytes to the server, the server considers those bytes to use whatever “character_set_client” is set to, will convert the data to the encoding that the target column declares to use, and will then store the result again as an encoded byte string.

[Two] To change the encoding of a complete table, you can use:

ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name [COLLATE collation_name];
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Or modify a single column only:

ALTER TABLE tbl_name MODIFY column1 CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Note that in both cases, MySQL will automatically convert the data stored from the old to the new encoding. This is usually what you want, except when you have one particular problem:

[Three] The data actually stored in the database uses a different encoding than the one that is declared in the column meta data, or worse, the stored data has no valid encoding at all.

You see, it’s quite easy to get MySQL to invalidly encoded data. For example, if you declare a column as “utf8″, set the character_set_client setting to “latin1″, and then send data in “utf8″, MySQL will apply a Latin1ToUtf8() transcoding function to the utf8 source data before storing it. So effectively, your text has now been encoded in utf8 twice.

Since the default for charset_set_client is usually latin1, all you have to do is pipe an UTF8-encoded SQL dump (one that doesn’t set the proper charset variables) into the mysql command line client, and you have a mess.

On the bright side, you can fix those issues directly through MySQL as well. If you have a simple encoding mismatch, it may be enough to simply change the charset declaration of the columns to match the encoding the data actually uses:

ALTER TABLE tbl_name MODIFY column1 VARCHAR(100) CHARACTER SET binary;
ALTER TABLE tbl_name MODIFY column1 VARCHAR(100) CHARACTER SET utf8;

If you naively apply a ALTER TABLE with the wanted charset, MySQL will automatically transcode the data based on what it incorrectly thinks the column’s charset is. So the trick here is to use two steps, and convert to “binary” first, which essentially amounts to “no encoding”. As a result, MySQL won’t touch the data. It simply drops the encoding annotation in the first statement, and sets a new one in the second.

If the data in the table is actually incorrectly encoded, you can fix this using CONVERT().

First, you may want to investigate what actually is wrong, i.e. what data exactly is stored as opposed to what you like to see stored. You can determine the actual bytes, avoiding any charset conversion by MySQL, using:

SELECT HEX(column1) FROM tbl_name;

Now take for example the case above. Say the column is declared as UTF8, but the UTF8 data we sent was incorrectly passed through a Latin1ToUtf8() conversion. The following statement will then reverse the effect:

UPDATE tbl_name SET
    column1=CONVERT(CONVERT(CONVERT(column1 USING latin1) USING binary) USING utf8)

CONVERT transcodes the text from whatever encoding MySQL thinks the data is currently in to the encoding given by USING. We use the same binary-trick as before. MySQL thinks the data in column1 is UTF8, so the innermost CONVERT will apply a Utf8ToLatin1() transcoding, reversing the Latin1ToUtf8() function that should never have happened in the first place. However, MySQL now thinks the result is in latin1. If we were to just save that into an UTF8 column, it would be converted back right away. So we first drop the charset annotation by switching to binary, and then we set the charset to utf8, which should now match what the data actually contains. If you wonder whether that last step can be omitted – yes, I believe so. We could just write the data returned by the second CONVERT call directly, it should have the same effect.

The MySQL documentation also has a bunch of info an charsets and converting.

Twisted Twistd Autoreload 0

While working on the Twisted server for A World Of Photo, I quickly began missing the convenience of having it automatically restart during development when I had made changes to the code. It turns out that the autoreload module that Django uses is actually pretty generic [1]. One thing Twisted doesn’t like is that the code which checks for file changes is run inside the main thread, and the actual app in a separate thread. That’s easily reversed though. You can find a patched version on bitbucket.

Then, all you need is a simple twistd wrapper:

from twisted.scripts import twistd
from pyutils import autoreload

autoreload.main(twistd.run)

[1] http://twistedmatrix.com/trac/ticket/4072

Django Tree Libraries 2

django-mptt

  • Nested Set trees.
  • A register() call is used to set things up; it ads the necessary fields to the model.
  • A tree model still has a foreign key to itself. This is the API you use to manage the tree. Signals are used that the hidden tree fields are updated when the parent ForeignKey changes. No add_child() required.
  • Using the foreign key to self means that deletion is handled automatically be Django and/or the database. The other libraries need to implement a custom Queryset subclass to handle deletes.

django-treebeard

  • Has an awesome name.
  • In addition to the common Nested Set/MPTT approach, supports two other tree implementations. Materialized Path in particular is interesting.
  • You inherit your models from abstract base classes, which I like.
  • The tree has to be managed manually, that is, there are specific APIs like add_child() you have to call.
  • Unfortunately, those APIs are classmethods on the model rather than the Django-way, putting them into the model manager.

django-easy-tree

  • Apparently a fork of django-treebeard, but only supports Nested Set trees.
  • But has a prettier API that fits very well into Django: Nicer class names, properly puts methods into the manager when they belong there, options are specified inside “Meta” rather than on the model itself.
  • Has an interesting concept of validators. Included is a SingleRootAllowedValidator.
  • No tests!

Clearly, somebody needs to write a django-treebeard that uses the django-easy-tree API design and django-mptt’s signal approach.

Speeding up Django tests using a RAM-bound MySQL server 0

A while ago, Django’s testing framework got transaction-based rollback, which obviously did wonders in terms of test performance. One thing that still bothered me though was the slow, initial table setup. For example, in a modestly sized project of mine with about 40 tables, this would take up to almost a minute. In particular when writing new tests, which is going to be an iterative process, that’s really not acceptable.

Now, one obvious things to do is using an in-memory SQLite database for testing purposes. I’ve tried that at times, but ultimately, various MySQL-specific stuff and raw SQL queries always made this an unsatisfying experience.

I’ve now finally realized that there is an easy solution, and I’m perplexed it didn’t occur to me earlier (maybe Linux, to which I’ve recently switched, just puts these kinds of options closer to one’s grasp). And it really is pretty straightforward: Mount a tmpfs, run a second MySQL instance on a different socket/port using this mount as a data dir, and tell Django to use it.

I’ve put shell script that I’m using on github.

You might want to customize the location of the data directory or the bind options, then simply do:

sudo ./mysqld-ram.sh

and when you’re done, shutdown with Ctrl+C.

The tables which previously took a minute to setup, now only need two and a half seconds. It even cuts the runtime of the actual tests, which were already using transaction-rollback before, in half. Not surprisingly, I notice that my motivation to actually write tests and keep them up-to-date has noticeably improved.

Windows BCD file is just a registry hive 0

The Vista/Windows 7 Boot Manager data in Boot/BCD is simply a registry hive and can be read using a tool like reged.

It contains stuff like /Description/TreatAsSystem, /Description/GuidCache and a whole bunch of guids under /Objects. Presumably, the actually interesting data is there, but unfortunately, it’s all binary.

A guy named Geoff Chapell has some info on what it all might mean.

Downloading drivers for old 3ware products 0

On the download site, make sure to select the All Releases and go through the form wizard. Do not use Click here to View all our products – it’ll lead you to a huge, inpenetrable list of possible downloads for some products.

Use different terminal colors while inside ssh 0

Since this was a source of confusion for me in the past, I like to make it visually obvious when I’m inside an ssh session in gnome-terminal, vs. on the local machine. This is the best solution I have found so far:

ssh-done() {
        setterm -term linux -inversescreen off;
}
ssh() {
        setterm -term linux -inversescreen on;
        /usr/bin/env ssh $*;
        ssh-done;
}

The reason why ssh-done is exposed as a separate function is that when ending ssh through Ctrl+C (for example, while at the password prompt), this gives you the ability to manually reset the terminal to normal again.

setterm in theory would also allow you to manually select a foreground and background color, though this didn’t work to well for me; in particular, it broke in various cases when commands tried to colorize their own output.

Totally awesome would be the ability to script gnome-terminal to switch the profile, but this doesn’t seem to exist yet.

-->