Reading a Unicode File One Character at a Time using C#

As part of a recent project, I needed to convert a file that used a custom format into a plain text file. The majority of the custom file consisted of sentences with Unicode encoding characters. But there were some custom bytes added too that I wasn’t sure of, and I wanted to remove if possible.

So I needed to dissect the custom file one character at a time. I was using the C# language and I found that the information available on the Internet was not much help.

I ended up using a low level approach where I read each 2-byte character, one at a time. I used the C# BinaryReader class. The code looks like this:

static void Main(string[] args)
{
  string fn = "..\\..\\WeirdFile.phr";

  FileStream ifs = new FileStream(fn, FileMode.Open);
  using (BinaryReader br = new BinaryReader(ifs,
    Encoding.Unicode))
  {
    byte[] bytes = new byte[2];
    long len = br.BaseStream.Length;
    while (br.BaseStream.Position < len)
    {
      bytes = br.ReadBytes(2);
      char[] c = Encoding.Unicode.GetChars(bytes);
      Console.Write(c);
    }
  }

  Console.WriteLine("\nDone \n");
  Console.ReadLine();
}

There’s actually quite a lot going on in this short amount of code. My next step is to use this code to examine the file and try and figure out the meaning of the bytes that aren’t part of the text.

This entry was posted in Miscellaneous. Bookmark the permalink.