VBA Unicode done right

IMPORTANT UPDATE: It seems that this is one of the most popular posts on this site, so VBA's useless unicode support is obviously a big issue. Seeing as it's so commonly searched for, and seeing as many people don't know the first thing about Perl, I've knocked up a small Windows tool that does the job. You can find it at the updated version of this post. Everything below this line is obsolete!


Some time ago I wrote a short article on forcing VBA to use some form of unicode to allow simple insertion of non-latin1 text into VBA modules. It sort of worked, for the most part, kind of. Well, not really.

The problem, you see, is that the VBA editor is extremely stupid so won't display non-latin1 characters unless that's your machine's native codepage. Couple that with the compiler, which is also pretty stupid, that actually breaks encoding rules if you have the audacity to sign your modules with a proper certificate. It basically forces a revert to the native codepage making non-latin1 text look ... well, ridiculous.

So how do you do it properly? Really? You want to know? OK. Well, you're supposed to use ChrW$() to generate unicode characters individually. Yes, really. Like this.

Say you want to include the Chinese character 汉 in your latin1 text. The only way to do it is to use the text like this:

    MyString = "This is my string with a " &
    ChrW$(&H6C49) & " in it."

Imagine if you have entire sentences made up of such characters. Doesn't bear thinking about, does it? However, I've made a simple way to work with it. The trick is to write your VBA modules outside the editor using UTF-8 encoding. For instance, this little proto-module:

Sub Tomato
    MsgBox("Chinese or the Sinitic language(s) (汉语/漢語, pinyin: Hànyǔ; 中文, pinyin: Zhōngwén) can be considered a language or a language family and is originally the indigenous languages spoken by the Han Chinese in China.");
End Sub

That won't work with VBA due to the Chinese characters. However, they display fine in UTF-8. Hrm. What to do. Well, you can use my little utility, convert.pl. I've included the full listing inline here.

#!perl

use strict;

my $in = shift;
my $out = shift;

die("usage: convert.pl <infile> <outfile>") unless length $in && length $out;
die("Input and Output files must differ") if $in eq $out;

open FH, "<:utf8", $in or die("Couldn't open input file '$in'");
open OFH, ">", $out or die("Couldn't open output file '$out'");

print "Converting $in to $out ...\n";

my $cnt = 0;
my $err = 0;
while (<FH>) {

    $cnt++;
    my $line = $_;
    my $converted = "";
    for (my $i = 0; $i < length $line; $i++) {
        my $chr = substr($line, $i, 1);
        my $ord = ord($chr);
        if ($ord > 0xFF) {
            $converted .= sprintf(qq|" & ChrW\$(&H%X) & "|, $ord);
        } else {
            $converted .= $chr;
        }
    }

    $converted =~ s/& "" &/&/g;
    $converted =~ s/"" & //g;
    $converted =~ s/ & ""//g;

    print OFH $converted;
    if (length($converted) > 1000) {
        print STDERR "Line $cnt is ",length($converted)," characters long.\n";
        $err = 1;
    }
}

print STDERR "\nAt least one warning was raised. Please check that all\n".
    "noted lines compile, or are split over multiple lines\nif they do not.\n" if $err > 0;

print "\n$in successfully converted to $out.\nPlease check the file for errors.\n";

You can also download convert.pl directly.

This nice little utility converts all characters outside the normal latin1 ASCII range on Windows (which, let's face it, is the only place VBA runs) and converts them to ChrW$() sequences. It has to be run on a latin1 machine to make it all match up, I think, but other than that it'll be fine. Consider the module I pasted in a while back. The output of the command to parse that results in the following output:

$ perl convert.pl chinese.txt chinese-converted.txt 
Converting chinese.txt to chinese-converted.txt ...

chinese.txt successfully converted to chinese-converted.txt.
Please check the file for errors.
$

The resulting file, chinese-converted.txt, contains the following:

Sub Tomato
    MsgBox("Chinese or the Sinitic language(s) (" & ChrW$(&H6C49) & ChrW$(&H8BED) & "/" & ChrW$(&H6F22) & ChrW$(&H8A9E) & ", pinyin: Hàny" & ChrW$(&H1D4) & "; " & ChrW$(&H4E2D) & ChrW$(&H6587) & ", pinyin: Zh" & ChrW$(&H14D) & "ngwén) can be considered a language or a language family and is originally the indigenous languages spoken by the Han Chinese in China.");
End Sub

That should be fine for inclusion directly in VBA and all will work properly. One problem is that the extension of characters from 1 character to around 18 means that some of the lines are too long for the VBA editor. The convert.pl script will warn you if lines are longer than 1000 characters so you can manually check that they compile.

I'd also recommend safely storing your UTF-8 originals. They can come in very handy when you later need to change some text and can't read ChrW$() sequences by eye.