A Developer's Diary

Nov 4, 2012

Byte Order Mark (BOM) character

BOM character
1. BOM is a Unicode character used to identify the endianness of the text file or stream.
2. The UTF-8 representation of the BOM is the byte sequence 0xEF, 0xBB, 0xBF.
3. A text editor using ISO-8859-1 as character encoding will display the characters  for BOM.
4. BOM has no meaning in UTF-8 apart from signalling that the byte stream that follows is encoded in UTF-8

import java.nio.charset.Charset;

/**
 * File: BOM.java
 * 
 * The following class converts a string having bom character
 * from ISO-8859-1 encoding type to UTF-8 and back
 */
public class BOM
{
    public static void main(String[] args) throws Exception
    {
        System.out.println("Default Encoding: " + Charset.defaultCharset());

        //
        // Displays a simple string with bom prepended.
        // Uses system default character encoding
        //
        String bomString = "Hello World";
        System.out.println(bomString + " Length: " + bomString.length());

        //
        // convert string with bom character to utf string
        //
        byte[] byteArrayISO = bomString.getBytes("ISO-8859-1");
        String utfString = new String(byteArrayISO, "UTF-8");
        System.out.println(utfString + " Length: " + utfString.length());

        //
        // convert the utf string back to windows character encoding
        //
        byte[] byteArrayUTF = utfString.getBytes("UTF-8");
        String winString = new String(byteArrayUTF, "ISO-8859-1");
        System.out.println(winString + " Length: " + winString.length());
    }
}

Output of the above program when run on a UTF-8 supported console
$ java BOM
Default Encoding: windows-1252
Hello World Length: 17
Hello World Length: 14
Hello World Length: 17

No comments :

Post a Comment