My boss stumbled across a project named
Gisgraphy recently. A big part of what we do involves the need for geocoding. We have generally been using
geocode.com for batch geocoding, but there's a cost to that, and they only do US and Canada. There are many other geocoding services, but if you're doing heavy volume, you're generally excluded from free options, and the paid options can get expensive.
Gisgraphy is an open source project that you can set up on your own server. It will pull in data from freely-available sources, load it all into a local database, then allow you to use a REST web service to geocode addresses. A little testing with some US addresses leads me to believe that it's generally accurate to street level, but not quite to house level. So, I'm not sure that we'll want to use it for all of our geocoding, but it ought to be generally useful.
We decided to set it up on an
AWS EC2 instance. We started messing with EC2 VMs for another project, and it seemed like EC2 would be a good fit for this project too. I started out with a small instance Linux VM, but switched it to a medium instance, since the importer was really stressing the small instance. I will probably switch back to small after the import is done. That's one nice thing about EC2: being able to mess with the horsepower available to your VM.
Gisgraphy uses several technologies that are outside my comfort zone. I'm primarily a Windows / .NET / SQL Server guy, with a reasonable amount of experience with Linux / MySQL / PHP. Gisgraphy runs on Linux (also on Windows, but it's obviously more at home on Linux), so that's ok. But it's written in Java, and uses
PostgreSQL as its back-end database. I have only very limited experience with Java and PostgreSQL. And, of course, I'm new to AWS/EC2 also.
So, setting this all up was a bit of a challenge. The
instructions are ok, but somewhat out of date. I'm using Ubuntu 12.04 LTS on EC2, and many things aren't found in the same places as they were under whatever Linux environment he based his instructions on. For the sake of anyone else who might need a little help getting the basic setup done under a recent version of Ubuntu, I thought I'd list out a few pointers, where I had to do things a bit differently than found in the
Linux instructions:
- Java: I installed Java like this: "sudo apt-get install openjdk-6-jdk openjdk-6-jre".
- And JAVA_HOME should be /usr/lib/jvm/java-6-openjdk-i386/ or /usr/lib/jvm/java-6-openjdk-amd64/.
- PostgreSQL: I installed the most recent versions of PostgreSQL and PostGIS like this: "sudo apt-get install postgresql postgresql-contrib postgis postgresql-9.1-postgis".
- Config files were in /etc/postgresql/9.1/main and data files were in /var/lib/postgresql/9.1/main.
- PostGIS: In his instructions for configuring PostGIS, the "createlang" command wasn't necessary.
- And the SQL scripts you need to run are /usr/share/postgresql/9.1/contrib/postgis-1.5/postgis.sql and spatial_ref_sys.sql.
That's about it for now, I think. I want to write up another blog entry on Gisgraphy, once I've got it fully up & running. And there might be some value in a blog entry on EC2. But now I have to get back to finishing my laundry!
Labels: Linux, software