I have a new repository on github that demonstrates a couple of basic machine learning and AI techniques, principally picked up from CS_373 and Stanford's Introduction to AI. It's all explained there, and I intend to add to it as I continue my eduction in the field.
Machine learning is something I rarely hear talked about in the spatial developer field. This is unfortunate, as it actually lends itself to spatial data quite nicely.
Eliminating duplicates semantically (tuples not perfectly identical)
Feature identification with computer vision
A number of former GIS Analysts have pivoted and started calling themselves data scientists. If you are among them, this is some good stuff to know. And it's likely you've done at least some of the custering/classification work before.
A word on the documentation: I used pycco, a very easy to use annotated code document generator (port of Docco). Here is a particle filter. Here is a Kalman filter.
My company bought an ESRI Flex API application that tried to duplicate ArcMap in the browser - even the UX that ESRI, in retrospect, would have probably avoided if they rewrote the thing (search/selection). While fully featured enough for everybody, it was too much for folks with a customer on the phone just trying to find an address.
I couldn't modify what we bought significantly, so I put together a quick webmap with OpenLayers and the ArcGIS For Server REST API and the inevitable happened - it snowballed. Now the one-off application that was meant for quick lookups grew rapidly, and the haphazard way in which features were added is obvious. It was technical debt, and it was time to pay some of it down.
Refactoring
Like a lot of people who eventually use backbone, I started by writing a poor reimplemention of its core concepts.
"Cramming the data into OpenLayers classes and using accessors all over the place doesn't scale - I should have generic objects that interact with other libraries as necessary"
"Polling the server for data is a common enough operation it could be abstracted to the generic objects"
"Separation of concerns is essential to making this stuff reusable, I'll abstract away the actual display/UI portions from the objects holding the data"
If you find yourself doing any of this, I encourage you to look at backbone. It does all of that and has sane event handling and awesome operations on collections and a bunch of other stuff.
Before and After
I won't do all of the code I shifted over, but I want to mention that only some of the application has moved over to backbone. It's possible to do this incrementally and without a massive rewrite.
There are any number of problems with the above code, some resulting from the haste of the implementation (icon related variables are redefined repeatedly, AJAX call within an AJAX call), but the biggest single issue for maintainability:
What the layer is and what the layer does and how it is displayed is all conflated in one giant self-calling function
The below isn't perfect, but it largely fixes that giant issue:
A Vehicle is a Model. Vehicles is a Collection of these models. If something resets the collection (like an update from the server) it will automatically reproject it. Getting that update? Just call fetch.
So now we have the entity itself as a self contained part of the application. Now the UI component:
The display of Vehicles is separate from the entity itself, so there can be many variations. In the case of a LayerView, I want it as an OpenLayers.Layer on an OpenLayers map. Another view may be displaying the vehicles in a table or a cool graph with d3.js showing speed over time.
The next time I want to display vehicles differently, or add a new (semi) realtime layer, or use the vehicle collection somewhere else for client side analysis, I can do so without much difficulty. I can also modify the view without breaking Vehicle(s), and largely visa-versa.
Some of the most important spatial data is old. It was built up and maintained over decades by paper and early computer systems, and it represents power lines, roads, water pipes, and property lines. It would be good to know the precise location of this stuff.
The PLC power system was designed on in-house drawn lotlines. Today, the difference between those lotlines and the actual parcel locations is as much as 100ft, and in no consistent direction. What follows are attempts to correct the location of more than 20,000 structures without doing a significant portion by hand, using some techniques picked up in Stanford's Free AI class.
The correct location is the "Hidden" bit
Education in some very advanced and useful algorithms are now within the grasp of anyone with an internet connection and a decade old computer. More than a hundred thousand participated in the recently completed Stanford AI course, including myself. One particular technique caught my eye:
The problem being solved above is one of location - that is the hidden variable that needs to be estimated in continuous space. Why couldn't I do something similar for static assets like poles and underground vaults? With enough control points I could then move everything else relative to them (inverse distance weighted rubbersheeting) and vastly improve the data.
A Naive Approach
I wanted to start with the simplest possible implementation. I loaded the lotlines (old, hand-drawn), parcel polygons, and the poles into PostGIS. I then converted the lines and polygons to points, and decided to use the total sum distance as the mechanism for comparing candidate particles to the poles.
Again, very naive (and the data is too noisy for it to work), but it served a purpose - getting everything set up for my next iteration: comparing candidates based on tangent and distance as the robot sensors above undoubtebly do.
Power company needs to know which transformers service which customers - for load calcuations, outage calls, etc. The data looks like this:
Simple enough - this can be done with minimal manipulation of the Utility Network tools in ArcGIS. Put a flag down on the transformer, then Trace Downstream. But there are no geoprocessing tools that help you do this if you have 20,000 transformers. To do so, you would have to dive into ArcObjects and instantiate half a dozen classes (ITopologyGraph, etc) and implement a graph search.
Thankfully, networkx and nx_spatial make this task completely trivial (though my iteration though the connected components could probably stand some improvement, it works pretty fast).
And you don't need any ESRI software to do this. Switch to read_shp, pass it in a folder full of shapefiles, and it uses ogr.
Stanford has an Artificial Intelligence (AI) class, and until late last year you would need to be prepared to pay tens of thousands of dollars in tuition to enjoy it. But times have changed. Education is going to be the next big thing that will face mass disruption from the ever wider scope of the internet. See Khan Academy, Academic Earth, and MITx.
More than a hundred thousand people took the class, and I was among them (I got a B overall).
Why?
Not because I see myself going into AI research. Rather, the interest is in general problem solving. Already I've started exploring applications for some of what I have learned for a couple of the problems I have run up against at work:
Incomplete or inaccurate data - hidden markov models and/or particle filters
Power outage detection and recovery - search and Bayesian networks
Abnormal conditions and auditing - clustering and regression
I'm excited to see how my attempts turn out and look forward to seeing the effect of the other 150,000+ people that completed the course. That is a gigantic number of people who will be trying to apply better and faster techniques in all sorts of fields. In terms of raw social good, I'll bet this one experiment exceeds the impact of all of the classes these professors (Sebastian Thrun and Peter Norvig) have taught before.
I encourage everyone to go take a look at the courses being offered soon from Stanford, MIT, and a bunch of other places. I am going to take the Machine Learning and Algorithm courses next.
The MapBox/DevelopmentSeed team has created one of the the last pieces really needed for mainstream open source GIS to gain really massive appeal
TileMill is used for making web maps - or more specifically - for generating tiles that make up the now-ubiquiteous slippy maps we see online.
There are other desktop applications that do this, the most notable being ArcGIS Desktop. But Desktop was built for other things first: advanced analysis tools, some pretty powerful editing capabilities, and authoring paper maps.
TileMill does one thing and it does it well. It costs nothing (compared to several thousand for some flavor of ArcMap), and outputs an open tile format that you can wire up to a webmap or iPad in less time than it takes to install ArcMap.
And it is smooth. The user experience is the best I have had with a desktop application in a long while.
It also has sane, plaintext css-like styling (MSS). This may sound like a no-brainer, but your options before this were basically some proprietary binary format from ESRI (not extensible, difficult to automate, limiting, vendor specific) or SLD, which is open source but widely regarded as something of a mess for other reasons.
There is also the training issue. ArcMap is giant and powerful - and extremely complex. The market for "GIS Analysts" is still strong in a large part because of this complexity. Less experienced users will find TillMill easier to pick up and web designers (of which there is a large pool of talent) will find it very easy.
It is out for every operating system of note.
Seriously, go give it a try.
What else is needed
Conversion - minimal, well documented, mostly automated steps from ESRI -> TileMill/FOSS
Samples like crazy - more or less emulate the ESRI samples, complete with documentation
Recently went to an HTML5 Hackathon at Google Kirkland. My group's project was an in-browser IDE Chrome extension that zipped up a user-provided series of HTML/CSS/JS files into a package that could be uploaded to the Chrome Store. Issac Lewis came up with the idea after trying to develop chrome extensions on his chromebook and finding it basically impossible to do. Storing the files was a perfect use case for the FileSystem API, but I spent most of my time beating my head against the wall to get it working. Here are some of the things I wish I knew going in.
The FileSystem API is not LocalStorage.
LocalStorage is a key-value store, the FileSystem API really is an entire virtual file system, sandboxed on a user's local file system. You write, read, and create files async. It's also only implemented currently in Chrome. The documentation says 9+, but I hit errors until I switched from Chromium 12 to Chrome 13.
There's no limit to the storage, currently.
Hell yeah, cache all your map data on the user's local file system without needing an explicit download or local client built for it. That's a big deal for conditions or places with little to no connectivity. Also a big deal for massive games with a ton of art assets. They go through some good use cases here.
Debugging is a pain.
You will hit the dreaded SECURITY_ERR or QUOTA_EXCEEDED_ERR at some point, and it will be because debugging locally (file://) doesn't work well in my experience. The documentation suggests it's possible by opening Chrome with the --unlimited-quota-for-files and --allow-file-access-from-files flags, but my problems were only resolved when I started debugging as an extension rather than as a local file.
You also need to be careful about the flux the API is in. Throwing around BlobBuilder() and other pieces of the newer APIs can throw errors that can be difficult to track down. BlobBuilder didn't work for me, I needed window.WebKitBlobBuilder. That webkit prefixing shows up elsewhere as well (like window.webkitRequestFileSystem).
Feel no guilt in lifting gratuitously from the sample docs when starting out.
Async file access isn't really any wierder than any other browser async work, but there is some boilerplate code that is worth snapping up. Example:
//error handling
function errorHandler(e) {
var msg = '';
switch (e.code) {
case FileError.QUOTA_EXCEEDED_ERR:
msg = 'QUOTA_EXCEEDED_ERR';
break;
case FileError.NOT_FOUND_ERR:
msg = 'NOT_FOUND_ERR';
break;
case FileError.SECURITY_ERR:
msg = 'SECURITY_ERR';
break;
case FileError.INVALID_MODIFICATION_ERR:
msg = 'INVALID_MODIFICATION_ERR';
break;
case FileError.INVALID_STATE_ERR:
msg = 'INVALID_STATE_ERR';
break;
default:
msg = 'Unknown Error';
break;
};
console.log('Error: ' + msg);
}
//file system instantiation
window.requestFileSystem(window.PERSISTENT, 5*1024*1024 /*5MB*/, FSCreatedSuccess, errorHandler);`
This kind of thing is okay starting out, but you'll want a lot more out of the error handling eventually. The message is fine, but the code tells you nothing about where the error occurred and in reference to what object or operation.
It's not CRUD, mostly.
Don't look for an explicit create method somewhere, the default is get or create via [filesystem_obj].[directory].get[Directory|File]. All reading, writing, and updating is probably going to live in a closure that starts with that first get.
Don't rush.
I made the mistake of looking at the limited time allocated and starting just throwing the example code in willy-nilly. This is not what you do with an unfamiliar and very new API. The typical help online is not there yet because it hasn't been used yet in a widespread way, throwing those error messages into google is not going to help you (unless that is how you got to this page, naturally). Start with the example code, sure, but I would carefully read the entirety of the short intro before trying random things to get it to work.
By Kirk Van Gorkum I only caught the last half of this because I didn't know Kirk was doing this one and there was another good talk going on. User experience is everyone's job, and some GIS developers are behind their web developer cousins/alter-egos in understanding this.
By Mansour Raad There is a sick combination here. First, it is being done by Mansour Raad, who is easily the most entertaining ESRI presenter I found during the conference. Second, any portion of the title has some interesting stuff for just about everyone - building cross platform mobile apps (the Android and iOS bit), collaborative mobile applications, and apparently there are some people that really like Flex. After this presentation you can count me among them. Much of the content of this talk is from one of his blog posts, but you're cheating yourself if you don't give this a watch.
By Brian Noyle, Dave Bouwman, Mike Juniper Great stuff here on the state of HTML5 - in and out of the geo world - and some cool demos showing off applications working well on mobile/tablet/laptop devices with relatively little additional work (in these demos, a custom view engine for ASP MVC and Modernizr).
By Glenn Goodrich An important thing to keep in mind about every presentation you see is that the given specific technology typically being demonstrated isn't that important the long term - the field just evolves too rapidly. More important is the general techniques, thought processes, and tricks/hacks you can pick up from the presenter(s). Glenn's presentation is full of this stuff, even if the legend bit is more or less now done by later versions of ArcServer.
By Dave Bouwman and Mike Juniper Grassroots open source development can drive a lot of innovation on a platform (gems for Ruby, easy_install/pip apps for Python, etc), but it doesn't seem to be as common in closed-source commercial software - software improvements instead tend to come top-down. It is thus encouraging to see the DTS folks setting their sights on vector tile caching, which is kinda a big deal for those of us that want to do client side vector manipulation/analysis with big data sets (like say an entire electric system).
No doubt I missed a bunch with this list, so please let me know in the comments if you insist I see one of them.
It usually isn't worth having unit tests for every single function. No test coverage at all will cost you time, patience, and money. So what is the optimum pattern for writing tests?
I don't know. But I do know exactly when I have failed to write a test that needed to be there.
I've been writing a data access layer for some of the systems at work. It had become quickly apparent that a lot of the information that would make for good analysis/viewing in conjunction with the GIS resided in other databases, and that getting at the data in the fastest/easiest way (database level views and link servers) is undesirable because:
Low level hard-coded links are a bad business, even with views to abstract the underlying data structures.
Link servers like these are not supported by every kind of datastore.
Can't be usefully published to trusted third parties or the public.
Most of our data is stored in plain old relational databases (primarily MSSQL), so it should be relatively easy to create some simple data access libraries. After that I could expose the data however I'd like. The libraries return everything in common abstract types, so combining them is simple - or so I thought.
var a = from x in db1.GetStuff()
let g = db2.GetStuff(x.fid)
select new {
//a bunch of shared attributes
};
return a.Cast<SharedAbstractClass>();
I've been messing around in Python, Ruby, and JavaScript too much at home I suppose, because .NET won't let you get away with this even if all of the shared attributes make sense. By itself, a simple mistake like this doesn't impact anything. The error is obvious seconds after typing it and fixing it was pretty simple - just make a private concrete class that inherits from the abstract class and go on your way (maybe there is a more elegant solution, but this worked fine).
The problem became more apparent when I realized db1 was giving me back inconsistent fids. And sometimes records were wrong and db2 wouldn't return anything. The new private instance had some non-nullable types to assign to, so it would blow up if it didn't find g. I thought this was simple glue code, but it quickly spiraled into resolving underlying database issues. Databases that I didn't have access to fix myself.
How did I know when to test? When I started doing it the hard way - pressing F5 to debug over and over again when the errors occurred. I was doing what some tests should have been doing for me. Yes, I would have picked up on the problem if I rigidly practiced test-first development (and I do, when it is obvious I'll be dealing with behavior rather than simple data access), but 99% of the rest of the code I was using for this application was boilerplate and didn't suffer from the problems with db1 and db2.
I love doing side projects. There are many reasons:
No better way to learn some technology than to put it into practice.
Can actually turn into a real thing, even a day job (a good example is Instapaper, which is the "killer app" for ebook readers like the Kindle as far as I am concerned).
Good for showing off to current and future employers, particularly if it is semi-related to your current domain.
I had read quite a bit of gushing on Ruby on Rails from Dave Bouwman, and I had a stupid idea for a web app that probably already existed, so I thought "why not"? I searched around and couldn't find what I had a picture in my head of.
This is what I wanted:
Simple two/three line webform where a user enters in where they are, what home improvement project they were looking at. They get back a list of what their local utility company would pay for and what contractors were available for it.
The utility information was basically public and there are actually quite a few conservation programs - every kilowatt hour saved is less that needs to be generated by building new, usually dirty power plants.
Contractors could add themselves and maybe pay to show up first. I wired up a prototype last weekend and showed a coworker who deals with conservation credits like these.
Her: "I like it."
Me: "Yeah I was going for something simple and direct like Hipmunk"
Her: *looks at Hipmunk* "Yeah, you just need an adorable mascot like they do"
Me: "Exactly! How about a big cartoony green brontosaurus with a bulbous nose?"
The next day, of course, I see a link via Twitter of a website that not only does more or less exactly what I was planning, but even had the mascot I was thinking of; only orange instead of green. The mascot bit might have been subconcious, because I remember reading a blog post from the same site (via hackernews) without ever looking at their homepage.
I'm not an expert in web design (mostly backend stuff), but I think I would prefer the simple form rather than their current landing page. I think it would be more focused on responding to what a homeowner is looking for. The energy savings calculator though, is probably the best I've seen in terms of visual style.
I might keep going with this project. The fact I couldn't easily find it on my initial search (my conservation coworkers hadn't heard of it either), the landing page, and the low barrier to entry (what do I have to lose, exactly?) means I could build something pretty quick I think.
Maybe not. I have a lot of other ideas for side projects - an open source outage management system has been rattling in my head since I've had to deal with a wonky one at work. That would be a great opportunity to do some high scalability/performance stuff outside of work.