Thursday, September 27, 2007

Marching on...

I used to develop applications with PowerBuilder (PB) for about 8 years.  I started with PB 1.0.  In the beginning it was new and exciting.  Windows was new.  Client/Server was new and even object oriented development in a mainstream tool was relatively new (at least in the corporate software development scene).  But as technology progressed I began to see exciting things elsewhere.  At the time Delphi was the cool new product.  The speed, the beauty, the elegance... ahhh I was smitten.  But the other PB developers around me didn't see it.  I fought for Delphi and even did one or two project with it.  But unfortunately Delphi wouldn't become my mainstream development tool.  PB was entrenched and the developers around it protected it with a passion.  It was disappointing. 

Then Java came along.  To my eye Java shared a lot in common with Delphi.  I liked it too.  But still the PB developers around me couldn't seem to appreciate this new tool. In fact, a lot of them were actually pretty hostile to anything that might threaten the sacred cow that was PB.  Fortunately for me the hype of the Internet and the .COM boom turned Java into something people couldn't ignore.  And I decided to ride the wave and finally leave PowerBuilder behind.

Now another 8 years has gone by and I find myself in the middle of a community of Java developers who are extraordinarily similar to that crowd of PowerBuilder developers.  They only see Java.  They've built a career around it, they're entrenched and they protect it with a passion.  For me, I see the appeal of other languages and tools.  In particular Ruby is beautiful and Rails elegantly solves many of my Java web development headaches.  But many of the people around me don't see it and quite frankly I doubt that they've even looked.  And that's the disappointing thing.  They're just like the PB guys with the blinders on.  They're hostile to anything that isn't Java.

In my current group there seems to be a new love forming for GWT.  Personally, the Google Web Toolkit and its unmistakable Swing/AWT flavor of Web development just seems wrong.  It's definitely web development for Java developers and I can see why they like it, but to me it's an artificial abstraction that doesn't sit well with me. Ruby and Rails isn't perfect either, but even a couple years after I started looking at it I still think it's better than GWT.  Rails embraces the browser technology of HTML, CSS, and Javascript and makes it easy to work with.  GWT on the other hand, puts me behind a Java facade where I can pretend to be developing Swing and it will generate the HTML and Javascript for me.  Ick.  No thanks Google.

Anyway, my point is that I think I'm getting to another point in my career where I need to migrate to a new place with like-minded people.  I'm tired of trying to get others to see what I think is self-evident if only they'd look...

Friday, September 07, 2007

Java & Ruby HTTP Clients: Part 2

Several months ago I wrote an article about how to create an HTTP client in Java or Ruby. I included examples in both languages for getting by BASIC and FORM based authentication. I also showed you how to resubmit the value of an HTTP cookie that many websites use to store state.

What I didn't mention is that quite often web applications store state not in a cookie but within the HTML itself. In order for you to programmatically interact with the website, you'll need to get that data out of the HTML and put it in your next request.

Java

So the basic recipe is to to use HttpClient (the way I demonstrated last time) to get the raw HTML. Then feed that text into NekoHTML, an HTML parser. It can correct the various problems you see in old-school HTML, namely unbalanced tags, missing parent tags, and mismatched elements.

Neko returns a standard Document object, but personally, I don't care much for the standard XML API. Instead I prefer to use the simpler API of DOM4J. So the next step is to take the XML that NekoHTML provides and feed it into DOM4J so that I can use XPath expressions to find what I need.

Now the question becomes what data do you need to post. Well my answer is to simply look through the FORM you're trying to submit, and resubmit everything. Look for 'hidden' tags, input tags of type 'text', 'password', and 'select', gather the names and values of all those tags. Override the ones where you need to provide the information (like id and password for example) and then do the POST.

Ruby

The Ruby approach is exactly the same: get the raw HTML, parse it, use XPath to get the name-value pairs of the FORM elements, override some of the values, and resubmit. The only additional rubygem from my last article is hpricot. It provides the same functionality of NekoHTML and DOM4J in Java. The typical script might look like this:


require 'net/http'
require 'rubygems'
require 'hpricot'

res = Net::HTTP.new('myserver', 80)
# res.set_debug_output $stderr #uncomment this to get console debug info
res.start do |http|
#go to the first page
get = Net::HTTP::Get.new('/home.aspx')
response = http.request(get)

#collect the cookie information
cookies = ''
response.response['set-cookie'].split(';').each{|c|
cookies += c.split(/path=.*?,/).last.strip + ';'
}

#collect the existing form data
doc = Hpricot(response.body)
form_data = {}
['text', 'password', 'hidden'].each{|t|
elements = doc.search("//form[@name='loginForm']//input[@type='#{t}']")
elements.each{ |e|
form_data[e['name']] = e['value'].to_s
}
}

#override some of the values
form_data['username'] = 'my_username'
form_data['password'] = 'my_secret'

#login
post = Net::HTTP::Post.new('/login.aspx')
post.set_form_data(form_data)
post['Cookie'] = cookies
puts http.request(post)
end

In this case all I did was print out the resulting HTML. At the very least you'd probably do the hpricot thing one more time to retrieve the data in which you're interested.

Other

Finally one last gotcha... At least half the websites I've tried to scrape do something "interesting" with Javascript to set various form elements. Since we don't have a Javascript engine executing you should expect that you'll have to parse the HTML and Javascript yourself to figure out what's going on, and set the fields manually in your script. I highly recommend Firefox and the "Web Developer" and Firebug plugins for inspecting the HTML, JS files and the HTTP Requests that the browser submits.

P.S.: I was a bit lazy this time and didn't provide any Java code. If you're really having trouble and can't get it working, leave me a message and I'll put together an example. Secondly, there are other Java HTML parsers out there that may work just as well or maybe even better than Neko, but since I don't have any personal experience with them I didn't mention them. If you like something else, leave a message.