In a mythical IT universe designed around a service oriented architecture (SOA) you could assemble several loosely coupled, autonomous services into a business solution. Each service would communicate in a platform and technology agnostic manner and XML would be the lingua franca. For example, if you needed to integrate data from several business partners you could call various SOAP or RESTful services, get structured XML, and then transform the results into something you could use.
But unfortunately that's not what happens in the real world. Instead of nice composable services you usually get web sites targeted at people not machines. That means you need to automate what would typically be browser conversations with various websites to get the data you need. And then you need to deal with the format of the resulting data. If you're lucky, the data may be structured in a comma separated value (CSV) text file. But undoubtedly you'll have to parse unstructured text representations of a report or get the data out of an excel spreadsheet or even a PDF. It's not pretty.
But lets forget that unpleasantness for the moment and deal with the first problem you'll encounter in trying to automate a browser conversation, getting by the various authentication mechanisms. Let's look at BASIC authentication first. It's pretty common and well supported in the Java-based
Jakarta Commons HttpClient library and the
Ruby Net::HTTP Standard Library.
BASIC AuthenticationTo start things off I created a simple Java-based "Dynamic Web Project" using the
Eclipse Web Tools Project (WTP) plugins. To keep things simple I created a servlet that returns a string. Then I configured the application to protect the url for that servlet with basic authentication in web.xml. Then in
Tomcat's server.xml I modified the context element for my web app to include a reference to the default Tomcat in-memory user database:
<Context docBase="MyWebApp" path="/MyWebApp" [...]
<Realm className="org.apache.catalina.realm.UserDatabaseRealm"
debug="0" resourceName="UserDatabase"/>
</Context>
With that in place, and the server running we're able to write a couple of methods to access the servlet. In Java:
public static void basicAuthDemo()
throws HttpException, IOException{
HttpClient client = new HttpClient();
List<String> authPrefs = new ArrayList<String>();
authPrefs.add(AuthPolicy.BASIC);
client.getParams().setParameter(
AuthPolicy.AUTH_SCHEME_PRIORITY, authPrefs);
client.getState().setCredentials(
new AuthScope("localhost", 8080, "localhost:8080"),
new UsernamePasswordCredentials("tomcat", "tomcat")
);
GetMethod get = new GetMethod(
"http://localhost:8080/MyWebApp/myservlet");
get.setDoAuthentication(true);
client.executeMethod(get);
System.out.println(get.getResponseBodyAsString());
get.releaseConnection();
}
In this example I limited HttpClient's default authentication mechanism to BASIC. I know what my target system uses so why complicate matters with DIGEST or NTLM? Then it was a simple matter of defining the credentials and telling HttpClient to automatically use them and then executing the Http GET method. It looks surprisingly similar in Ruby:
def basic_auth_demo
url = URI.parse('http://localhost:8080/MyWebApp/myservlet')
get = Net::HTTP::Get.new(url.path)
get.basic_auth('tomcat','tomcat')
response = Net::HTTP.new(url.host, url.port).start do |http|
http.request(get)
end
puts response.body
end
The difference between the two implementations is that the Java HttpClient is doing some housekeeping for you. You define a scope for your authentication and as long as you GetMethod is configured to do authentication it will automatically pick up any necessary credentials from the HttpClient instance. In Ruby you need to set the credentials on the GetMethod explicitly.
FORM based authenticationThe next most common method is form-based authentication. When you make a request for a web resource, the response contains a cookie that identifies your session on the server. If that session indicates that you haven't been authenticated yet, then you're redirected to a form to enter your id and password. You fill in the values and then submit the form. Now assuming you entered the right credentials your session on the server will indicate that you're authenticated and every subsequent request (which includes the cookie to identify your now authenticated session) will execute normally. There are variations on this theme that may add more than one cookie so just be sure to capture the cookies and continue to submit them on every request in your conversation.
In order to test this I modified the web.xml file for my Java web app:
<login-config>
<auth-method>FORM</auth-method>
<form-login-config>
<form-login-page>/login.jsp</form-login-page>
<form-error-page>/login-error.jsp</form-error-page>
</form-login-config>
</login-config>
and added the requisite JSP pages. The login.jsp contains a form that looks like this:
<form method="POST" action="j_security_check">
Username:<input type="text" name="j_username"><br/>
Password:<input type="password" name="j_password"><br/>
<input type=submit value="Login">
</form>
So the Java code to access the servlet using form based authentication looks like this:
public static void formAuthDemo()
throws IOException, HttpException {
HttpClient client = new HttpClient();
// make the initial get to get the JSESSION cookie
GetMethod get = new GetMethod(
"http://localhost:8080/MyWebApp/myservlet");
client.executeMethod(get);
get.releaseConnection();
// authorize
PostMethod post = new PostMethod(
"http://localhost:8080/MyWebApp/j_security_check");
NameValuePair[] data = {
new NameValuePair("j_username", "tomcat"),
new NameValuePair("j_password", "tomcat")
};
post.setRequestBody(data);
client.executeMethod(post);
post.releaseConnection();
//resubmit the original request
client.executeMethod(get);
String response = get.getResponseBodyAsString();
get.releaseConnection();
System.out.println(response);
}
The Ruby code looks like this:
def form_auth_demo
res = Net::HTTP.new('localhost', 8080).start do |http|
#make the initial get to get the JSESSION cookie
get = Net::HTTP::Get.new('/MyWebApp/myservlet')
response = http.request(get)
cookie = response.response['set-cookie'].split(';')[0]
#authorize
post = Net::HTTP::Post.new('/MyWebApp/j_security_check')
post.set_form_data({'j_username'=>'tomcat', 'j_password'=>'tomcat'})
post['Cookie'] = cookie
http.request(post)
#resubmit the original request
get['Cookie'] = cookie
response = http.request(get)
puts response.body
end
end
Again, the two implementations are remarkably similar. The biggest difference is that the Java HttpClient library is again doing the housekeeping, by tracking and automatically resubmitting the cookies for you. In the Ruby code you have to fetch the cookie yourself from the response header and set the HTTP header for all future requests.
So there you go, you're past the website authentication and are ready to make whatever requests you need to get the data you require.