What I didn't mention is that quite often web applications store state not in a cookie but within the HTML itself. In order for you to programmatically interact with the website, you'll need to get that data out of the HTML and put it in your next request.
Java
So the basic recipe is to to use HttpClient (the way I demonstrated last time) to get the raw HTML. Then feed that text into NekoHTML, an HTML parser. It can correct the various problems you see in old-school HTML, namely unbalanced tags, missing parent tags, and mismatched elements.
Neko returns a standard Document object, but personally, I don't care much for the standard XML API. Instead I prefer to use the simpler API of DOM4J. So the next step is to take the XML that NekoHTML provides and feed it into DOM4J so that I can use XPath expressions to find what I need.
Now the question becomes what data do you need to post. Well my answer is to simply look through the FORM you're trying to submit, and resubmit everything. Look for 'hidden' tags, input tags of type 'text', 'password', and 'select', gather the names and values of all those tags. Override the ones where you need to provide the information (like id and password for example) and then do the POST.
Ruby
The Ruby approach is exactly the same: get the raw HTML, parse it, use XPath to get the name-value pairs of the FORM elements, override some of the values, and resubmit. The only additional rubygem from my last article is hpricot. It provides the same functionality of NekoHTML and DOM4J in Java. The typical script might look like this:
require 'net/http'
require 'rubygems'
require 'hpricot'
res = Net::HTTP.new('myserver', 80)
# res.set_debug_output $stderr #uncomment this to get console debug info
res.start do |http|
#go to the first page
get = Net::HTTP::Get.new('/home.aspx')
response = http.request(get)
#collect the cookie information
cookies = ''
response.response['set-cookie'].split(';').each{|c|
cookies += c.split(/path=.*?,/).last.strip + ';'
}
#collect the existing form data
doc = Hpricot(response.body)
form_data = {}
['text', 'password', 'hidden'].each{|t|
elements = doc.search("//form[@name='loginForm']//input[@type='#{t}']")
elements.each{ |e|
form_data[e['name']] = e['value'].to_s
}
}
#override some of the values
form_data['username'] = 'my_username'
form_data['password'] = 'my_secret'
#login
post = Net::HTTP::Post.new('/login.aspx')
post.set_form_data(form_data)
post['Cookie'] = cookies
puts http.request(post)
end
In this case all I did was print out the resulting HTML. At the very least you'd probably do the hpricot thing one more time to retrieve the data in which you're interested.
Other
Finally one last gotcha... At least half the websites I've tried to scrape do something "interesting" with Javascript to set various form elements. Since we don't have a Javascript engine executing you should expect that you'll have to parse the HTML and Javascript yourself to figure out what's going on, and set the fields manually in your script. I highly recommend Firefox and the "Web Developer" and Firebug plugins for inspecting the HTML, JS files and the HTTP Requests that the browser submits.
P.S.: I was a bit lazy this time and didn't provide any Java code. If you're really having trouble and can't get it working, leave me a message and I'll put together an example. Secondly, there are other Java HTML parsers out there that may work just as well or maybe even better than Neko, but since I don't have any personal experience with them I didn't mention them. If you like something else, leave a message.
3 comments:
Take a look at the mechanize gem, which builds on plain Net::HTTP and Hpricot to give you a much simpler interface, handling all the cookies and form stuff for you.
Mechanize is a great gem. I only discovered it a couple weeks after I made this post.
This was very helpful to me today. Thank you very much!
Post a Comment