搜尋此網誌

2018年1月28日 星期日

Convert UTF-16LE to UTF-8

Jack Jr. reported an interesting problem: Under manager@ms14:www/secure, he tried to add an RCS keyword $Date$ to index.htm, hoping that RCS will automatically update the modified time. 
However, it did not work. The string was not replaced. What puzzles him is, when he created a new file for testing, everything works fine.

Let us inspect the checked-out index.htm file. When you edit it by vim, on the status line you see "[converted][dos]". The message "[dos]" simply indicates that this file was created in DOS/Windows environment, where newline characters in text files consists of CR+LF (\015\012). On the contrary, in Unix newline characters are a single LF. However, this is not the cause which prevented RCS to work properly.

Let's inspect the contents of index.htm, with a better tool. 工欲善其事,必先利其器. You may use the "dump.cpp" program you developed in the VoIP class, which we adopted to inspect WAV files. You may also use a Unix command "od -t x1 index.htm" to inspect each byte. Now you should see the problem. Each character is stored as two bytes. The ASCII code of the string "Date" is "44 61 74 65", but it was stored as "44 00 61 00 74 00 65 00". That's why RCS failed to recognize this keyword.

By inspecting the beginning of the index.htm file, we see "FF FE", which is BOM (byte order mark, https://en.wikipedia.org/wiki/Byte_order_mark). By checking some article (https://unix.stackexchange.com/…/process-a-file-that-starts…) we learned that this indicates that the file is encoded in UTF-16, little endian!

So the solution should be simple. We need to convert this file from UTF-16LE to UTF-8. Let's type the command "mv index.htm index.bak; iconv -f UTF-16LE -t UTF-8 index.bak > index.htm". 

Before we check-in the new UTF-8 file, please inspect its size. You can see that its size is almost half of the UTF-16 file. Now check in and check out index.htm. The converted file works fine with RCS keywords now.