The Lazy Developer’s Guide to Loading Datasets into GeoCommons
Loading KML Files
So lets say you have a bunch of kml files you want to load into Geocommons. Of course, its fairly easy to load these through the web UI, but if you need to do this often enough, it would be nice to have a program to do it for you – after all, as Larry Wall said, laziness is one of the three virtues of great programmers.
Frankly, its not exactly obvious from our API documentation what the best way to do this is. And if you aren’t familiar with Curl, the examples are probably not going to help you much, so I’ll be doing this code in Java. Of course, we here at GeoIQ are Ruby programmers, and thus have a natural disdain for anything to do with Java, so I’m probably losing serious Ruby street cred just posting this, but anything for the good of the cause. We will be using the occasionally obtuse Geocommons REST API, but I’ll try to steer you around some of the not so obvious pitfalls.
The basic idea of the program is that you would run it from the command line, passing in your login, password, and a directory. It then scans the directory for KML files and uploads them to geocommons. Fortunately, Java elegantly handled getting the files using a FileFilter:
public class Loader {
public static void main(String[] args) {
for (String filePath: kmlFilesIn(dirPath)) {
// load file here
}
}
/**
* Determines if file will be uploaded
*
* @param dirPath path to file
* @return array of filenames for kml files in specified directory
*/
private static String[] kmlFilesIn(String dirPath) {
String[] filePaths = new File(dirPath).list(new FilenameFilter() {
public boolean accept(File dir, String name) {
return name.endsWith(".kml");
}
});
if (filePaths == null) {
throw new RuntimeException( "Invalid directory: " + dirPath);
}
return filePaths;
}
}
POSTing the file
So far, its pretty basic Java stuff. Lets start implementing the Loader itself. The first thing we need to do is submit a POST request to geocommons with our kml data. Looking at the API examples, there is nothing about sending kml. There is this cryptic curl example for csv:cat 98633.csv | curl -i -X POST -u "username:password" --data-binary @- -H "Content-Type: text/csv" http://geocommons.com/datasets.json
You might think we could do something similar and send the raw kml data in the post with the content type set to “application/vnd.google-earth.kml+xml”, but I’m not sure this will work. Even if it did, as a general approach, it has some major shortcomings — for example, you can’t send any other data in the request. Instead, we will be sending the data as if it came from a form, with a Content Type of “multipart/form-data”. While it may be more complex at first, it can be used in virtually all cases where we are POSTing or PUTing data to GeoCommons.
For some reason, multipart form data  doesn’t seem to well supported in Java’s networking libraries (or Ruby’s, for that matter). The code to do format multi-part form data isn’t particularly hard to write, but can be a little tedious to get correct, particularly if you can’t see what’s happening on the server side. Instead we’ll be using the HttpComponents project from Apache. You can download HttpCore and HttpClient (I got versions 4.1.3 and 4.1.2 respectively) from here.
So, lets add a method to our class to post a file:
/**
* Posts a file to a given URL
*
* @param url URL string to post to
* @param file file to post
* @return location returned
* @throws LoaderException if we don't get a good response from the server
* @throws Exception if file fails to load
*/
private String post(String url, File file) throws IOException, LoaderException {
HttpClient httpclient = new DefaultHttpClient();
HttpPost request = new HttpPost(url);
MultipartEntity entity = new MultipartEntity();
entity.addPart("dataset[kml]", new FileBody(file));
request.setEntity(entity);
HttpResponse response = httpclient.execute(request);
if (response.getStatusLine().getStatusCode() == 201) {
return response.getFirstHeader("Location").getValue();
} else {
throw new LoaderException("Failed to process file: "
+ EntityUtils.toString(response.getEntity()));
}
}
Basically, we create HttpClient and HttpPost objects, add our file, submit the request, and handle the response. In handling the response, we have a couple of options. When you POST a new object to GeoCommons, it should respond with a representation of the data posted (based on the extension you gave in the url, “json” in this case). I recommend in most cases getting your response as json, since it is handled more consistently across the GeoCommons API. In addition to the JSON, the response will have a URL in the “Location” header you can use as an HTTP end-point for further REST requests. Since 90% of the time we are going to need that URL for reasons that will be come obvious later, this method return the URL rather than parsing the JSON content.
You might be wondering what is in that JSON response:
{ "success": true, "state": "processing", "id": 1234, "title": "myfile", "desc": "" }
The two important fields here are the id, which you are going to need to do things like add a dataset to a map, and the state, which will be described later. You can actually get the id from the URL in the location field (or vice-versa), since it would just be “http://geocommons.com/1234.json”.
Authenticating
There are a number of options for authenticating with GeoCommons. For now, we’ll be using Http Basic Authentication, because its easy to implement. Unfortunately, its not very secure. Maybe in some future blog post I’ll cover other authentication methods.
HttpClient seems to have support for doing authentication, but after half an hour of searching the JavaDocs, reading the examples, and trying out code, I was unable to figure out how its supposed to work. Fortunately, all we really need to do for Basic Authentication is set  a header in the request with the username and password encoded using Base64 encoding, so we can skip all those confusing authentication classes and just do:
/**
* Add basic authentication header to request
*
* @param request
*/
private void addAuthentication(HttpRequestBase request) {
String usernamePassword = login + ":" + password;
String encodedUsernamePassword
= DatatypeConverter.printBase64Binary(usernamePassword.getBytes());
request.addHeader("Authorization", "Basic " + encodedUsernamePassword);
}
Waiting for Completion
The processing of an uploaded file can be a time-consuming process, particularly for large files. If the process takes too long, the Http server (Apache) will eventually time out the request. Rather than having clients keep the HTTP connection open, and risk a timeout, GeoCommons will in many cases process the file asynchronously. This means your POST request will come back with a response before the file has completed processing. Before you do anything else with that file (except maybe deleting it), you need to check to see if processing is complete. This is done by sending a GET request to the URL we got back from our post method.The response will be a JSON object, somewhat similar to the one we got back from the original POST request, but possibly with more fields in it. At this point, we are only interested in the state. For our purposes, there are four possible values:
- complete – we’re done with this file, and can move on
- processing – GeoCommons is still processing the file, we need to wait
- errored – Some unrecoverable error occurred; we’re done with the file, but should inform the user
- anything else – Geocommons stopped processing the file, and its now in some intermediate state waiting for further input.
Unfortunately, Java (at least as of version 1.6) doesn’t have a built in library for parsing JSON, but as usually, there are a ton of third party libraries that will do JSON just fine. Being a simple person, I ended up using json-simple. You can download version 1.1 from here.
So, our code to check the state of the dataset looks like:
/**
* Checks the specified dataset url to see if the dataset has completed.
*
* @throws IOException in the unlike event that the response can't be processed
* @throws LoaderException if we don't get a good response from the server or geocommons
* has a problem with the file
*/
private boolean isComplete(String url) throws IOException, LoaderException {
String datasetJson = get(url);
JSONObject dataset = (JSONObject) JSONValue.parse(datasetJson);
String state = (String) dataset.get("state");
if (state.equals("errored")) {
throw new LoaderException("Error processing file");
}
return !state.equals("processing");
}
/**
* Sends a GET request to a URL, and returns the response body as a string
*
* @param url url to get
* @return string representing the response content
* @throws IOException in the unlikely event that the response body can't be processed
* @throws LoaderException if we don't get a good response from the server
*/
private String get(String url) throws IOException, LoaderException {
HttpClient httpclient = new DefaultHttpClient();
HttpGet request = new HttpGet(url);
addAuthentication(request);
HttpResponse response = httpclient.execute(request);
if (response.getStatusLine().getStatusCode() == 200) {
return EntityUtils.toString(response.getEntity());
} else {
throw new LoaderException(
"Failed to get dataset: " + UntityUtils.toString(response.getEntity()));
}
}
We are sending the GET request, then parsing the response content as JSON and pulling out the “state” property.
Putting it all together, the loader will now load each file, wait until it finishes loading, then move on to the next
/**
* Loads multiple kml files to GeoCommons.
*
* This program will upload all kml files in the specified directory to geocommons.
*
* To run:
*
* java com.geoiq.Loader mylogin mypassword /dir/to/files
*/
package com.geoiq;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FilenameFilter;
import java.io.IOException;
import javax.xml.bind.DatatypeConverter;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpRequestBase;
import org.apache.http.entity.mime.MultipartEntity;
import org.apache.http.entity.mime.content.FileBody;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;
import org.json.simple.JSONObject;
import org.json.simple.JSONValue;
public class Loader {
public static final String SITE_URL = "http://geocommons.com";
private String login;
private String password;
public Loader(String login, String password) {
this.login = login;
this.password = password;
}
/**
* @param args command line arguments
*
* args[0] - geocommons login
* args[1] - geocommons password
* args[2] - directory
*/
public static void main(String[] args) {
String login = args[0];
String password = args[1];
String dirPath = args[2];
Loader loader = new Loader(login, password);
for (String filePath: kmlFilesIn(dirPath)) {
loader.load(new File(dirPath, filePath));
}
}
/**
* Determines if file will be uploaded
*
* @param dirPath path to file
* @return array of filenames for kml files in specified directory
*/
private static String[] kmlFilesIn(String dirPath) {
String[] filePaths = new File(dirPath).list(new FilenameFilter() {
public boolean accept(File dir, String name) {
return name.endsWith(".kml");
}
});
if (filePaths == null) {
throw new RuntimeException( "Invalid directory: " + dirPath);
}
return filePaths;
}
/**
* Loads the specified kml file to geocommons, tracking state until load completes
*
* @param file file to load
*/
public void load(File file) {
System.out.println("loading" + file.getPath());
try {
String url = post(SITE_URL + "/datasets.json", file);
while (!isComplete(url)) {
Thread.sleep(3000);
}
} catch (LoaderException e) {
System.err.println("Failed to load " + file.getPath());
e.printStackTrace(System.err);
} catch (IOException e) {
System.err.println("Failed to load " + file.getPath());
e.printStackTrace(System.err);
} catch (InterruptedException e) {
System.err.println("Interrupted");
}
}
/**
* Checks the specified dataset url to see if the dataset has completed.
*
* @throws IOException in the unlike event that the response can't be processed
* @throws LoaderException if we don't get a good response from the server or
* geocommons has a problem with the file
*/
private boolean isComplete(String url) throws IOException, LoaderException {
String datasetJson = get(url);
JSONObject dataset = (JSONObject) JSONValue.parse(datasetJson);
String state = (String) dataset.get("state");
if (state.equals("errored")) {
throw new LoaderException("Error processing file");
}
return !state.equals("processing");
}
/**
* Sends a GET request to a URL, and returns the response body as a string
*
* @param url url to get
* @return string representing the response content
* @throws IOException in the unlikely event that the response can't be processed
* @throws LoaderException if we don't get a good response from the server
*/
private String get(String url) throws IOException, LoaderException {
HttpClient httpclient = new DefaultHttpClient();
HttpGet request = new HttpGet(url);
addAuthentication(request);
HttpResponse response = httpclient.execute(request);
if (response.getStatusLine().getStatusCode() == 200) {
return EntityUtils.toString(response.getEntity());
} else {
throw new LoaderException(
"Failed to get dataset: " + EntityUtils.toString(response.getEntity()));
}
}
/**
* Posts a file to a given URL using basic authentication
*
* @param url URL string to post to
* @param file file to post
* @return location returned
* @throws LoaderException if we don't get a good response from the server
* @throws Exception if file fails to load
*/
private String post(String url, File file) throws IOException, LoaderException {
HttpClient httpclient = new DefaultHttpClient();
HttpPost request = new HttpPost(url);
addAuthentication(request);
MultipartEntity entity = new MultipartEntity();
entity.addPart("dataset[kml]", new FileBody(file));
request.setEntity(entity);
HttpResponse response = httpclient.execute(request);
if (response.getStatusLine().getStatusCode() == 201) {
return response.getFirstHeader("Location").getValue();
} else {
throw new LoaderException(
"Failed to process file: " + EntityUtils.toString(response.getEntity()));
}
}
/**
* Add basic authentication header to request
*
* @param request
*/
private void addAuthentication(HttpRequestBase request) {
String usernamePassword = login + ":" + password;
String encodedUsernamePassword
= DatatypeConverter.printBase64Binary(usernamePassword.getBytes());
request.addHeader("Authorization", "Basic " + encodedUsernamePassword);
}
/**
* Exception thrown when geocommons doesn't behave as expected.
*/
private static class LoaderException extends Exception {
private static final long serialVersionUID = 1L;
public LoaderException(String messge) {
super(messge);
}
}
}
GeoIQ Blog- The evolution of discussion around the Boston Marathon events April 18, 2013 Stefan Novak
- Helping to Pioneer Real Time GIS with Social Streams March 8, 2013 Sean Gorman
- Modeling Twitter sentiment during the Oscars March 4, 2013 Stefan Novak
- CrisisCamp Sandy November 5, 2012 Andrew Turner
- Testing Social Media Viability for Disasters at Camp Roberts August 31, 2012 Sean Gorman




