Wednesday, August 06, 2008

Calling Java from JRuby to do the dirty work

I recently had to try and read a UTF-16LE-BOM encoded text file from Ruby. I couldn't figure out how to get Ruby to deal with the double byte characters in the file. But I did know that Java has very good UTF support so I decided to let Java do the heavy lifting for me. I ended up with something like this:
require 'java'

import java.io.InputStreamReader
import java.io.ByteArrayInputStream
import java.io.BufferedReader

data = ''
File.open("text.txt", 'rb'){|f| data = f.read}

#strip the BOM (Byte Order Marker)
data.slice!(0..1)
#let Java deal with the UTF-16 encoding
reader = BufferedReader.new(
InputStreamReader.new(
ByteArrayInputStream.new(data.to_java_bytes), 'UTF-16LE'))

while ((s = reader.read_line) != nil)
puts s
end
It may not be the perfect solution (I'd like to see a regular Ruby solution) but I suppose leveraging the Java libraries was the whole point of JRuby in the first place wasn't it!

(Executing File.open("text.txt", 'rb'){|f| puts f.read} looked fine on my Mac, except for the BOM, but looked terrible in the console on Windows. The solution above actually converts the text from double byte to single byte characters)