DUI-heatmap Finland 2012

A month ago I created a heatmap visualization of driving under influence records in Finland 2012. The story and visualization itself were fairly popular on svenska.yle.fi during the week of the release, ranking up to be the most read news-article (see inrikes category artikel.har-aker-rattfylleristerna-fast-se-karta.sivu). It also raised a fair amount of feedback trough the comments section about usability of heatmap visualizations in general.

Heatmap of DUI-incidents at Helsinki area

Heatmaps are a bit tricky to use as visualizations. XKCD-comic has tackled the essence of it. If a map correlates heavily with the population distribution, it may only tell to the viewer what people can already expect, things happen where there is a lot of people. Also when the visualization is dynamic, you get more details on demand. When you zoom in, you’ll see more details and when you zoom out you see less details..which can mean that there will be a huge red blob on top of a map at certain distance.

The data for the map was acquire by a journalist via a information request to police. We received it as an excel-form with 20352 lines of raw data of DUI incidents in Finland. Data included an identification number to police records, municipal area and a mysterious area code, street-address of incident, weekday of incident and date and time of incident. As usual the data was somewhat ‘dirty’ and required thorough clean-up to make it fit our needs.

Screen Shot 2013-03-21 at 12.44.34 PM
Raw-data in Excel

Data itself may reveal patterns and information about measurement and how it was collected. In our case we noticed that the naming convention of incident addresses is fairly imaginative. There’s an address-field that police officers fill in, among other things, when they report the DUI incidents into the system. This field should contain an address of incident so it can be pinpointed if necessary. However there doesn’t seem to be an unified practice of reporting the addresses.

Address field seems to be filled in many different ways which makes the exact pinpointing of the incident later on a fairly difficult task. For example an incident address that has happened at crossroad of two roads might be marked as road1 X road2, or road1, road2 crossroad or road2xroad1 or  road1 at crossroad road2 etc. Or the address-field has been used to give more information about place of incident instead of just an address, i.e. Address xyz, from the yard.

My personal favourite version of the data in the address field was ‘Kylän Kohdalla Jäällä’ = ‘at the region of the village on the ice (lake)’, which couldn’t be more vague in terms of locating the exact spot.

Sure, all these addresses could probably be pinpointed by someone with knowledge of local surroundings, but for outsiders the location stays a mystery. For a future implementation of such a system I’d highly recommend adding the possibility to insert longitude and latitude coordinates field. But enough ranting and back to the topic.

The dataset is fairly large (20k+ addresses) and if one wants to get a overview of geographically distributed data one method to approach this is to plot incidents on to a map. But what tool to use?

I personally prefer to use existing methods and tools to visualize and gain insight on data and after googling for a while I bumped into a heat map javascript library called Heatmap.js. So it was chosen as a prominent technology to implement the test-case. A random city of choice was chosen -Tampere.

The next task was to figure out how to translate this given data from excel-format into a format that Heatmap.js accepts. Heatmap.js uses a list of longitude and latitude coordinates and a weight of how many incidents are at those coordinates. The weight can be calculated in Excel by counting the amount of occurrences of addresses, but the addresses need to be converted into coordinates. This conversion of address into longitude and latitude is usually referred to as geocoding.

Geocoding multiple addresses
Geocoding multiple addresses trough yahoo api

There exists a fair amount of web-based tools to quickly geocode a single address into gps coordinates. I used http://www.gpsvisualizer.com/geocoder/. It’s probably not the best, but it gets the job done. The downside of it lies in the ability to geocode multiple addresses. Gpsvisualizer.com offers an ability to use yahoo geocoder for multiple addresses, but sadly it’s fairly inaccurate in Finland. For example I tried to geocode street-addresses from Tampere and ended up getting a wad of similar coordinates.

But it’s possible to geocode a single address at a time with Google geocoder at gpsvisualizer.com, which gives much more accurate readings, but as you would guess it is also a lot slower method. As I had a deadline breathing down my neck and 20k+ addresses to geocode I gave it a shot. In my case roughly a week of typing and copy+pasting, so clearly a slow method.

Geocoding one address
Geocoding a single address trough google api

A slightly better way to do this is to use Google geocoder api and create your own geocoder. This is the method I would use to geocode multiple addresses now if I’d have to do another address based visualization. But at the time I was working on this the deadline was looming on and I was busy with getting the visualization forward. So I accepted the fact that I’d be spending a considerable amount of time on smashing the same button combination over and over.

Screen Shot 2013-03-21 at 4.36.09 PM
Geocoding multiple addresses trough google api

After the data was in a appropriate format for visualization tool it was a fairly straightforward job to fine tune it and release it.

Well, what I learned from the project or at least reconfirmed, is that 80 percent of the time goes into purifying, transforming and fiddling with the data to get it into a format that can be represented and the rest 20 percent is fine-tuning and adding extra functionalities into it. An interesting dataset and a fun project, and when the data is interesting it seems to translate fairly well into ‘news’.

Who is Jarno?

Greetings everyone!

My name is Jarno Marttila and I am the new ‘Teemo’, or incase you don’t yet know Teemo. I’m the new datajournalist for Yle Svenska. I joined the merry band of YLE just a few weeks ago, mid-January.

Well who am I. I guess I’m a lot of things, or maybe one could even say in Finnish a ‘jokapaikan höylä’ (a jack of all trades..) when it comes to data and information analysis and visualizations. I’m 28 year old Diploma Engineer with a major in Hypermedia, though I have studied a little bit of this and a little bit of that on my way.

Past three years I have worked as a researcher at Intelligent Information Systems Laboratory, previously known Hypermedia Laboratory, at Tampere University of Technology with tasks involving all kinds of cool things, including but not limited to, Social Network Analysis and Information Visualization and web-development. My tools of choice in graph and network visualization have been Gephi, Gource, and Javascript libraries such as d3.js, JIT and highcharts when it comes to visualizing and analyzing data. In web-development I’ve mainly dealt with drupal.

In my Master’s thesis I studied on data-driven social network analysis in context of Children’s Parliament of Finland. Lately I’ve been into information visualization techniques and methods of creating insight into complex data-sets. At TUT I’ve done many projects related to gaining and communicating insight into different kinds of data. Whether it has been studying an impact of Government Official in social media or mapping service potentials for customers customers in heavy industry.

Cliques, networks, outliers, factors and facts that explain why data is what it is, and what connects to what, and why things are how they are excite me. Hence a jump into telling stories with information visualization and data-analysis in context of news, or in grander terms in context of data-journalism was a natural leap of faith for me.

At YLE I wish to create interesting and important data journalistic stories for people to consume. As there’s almost nothing more intriguing than succeeding in finding stories from data and implementing them so that they communicate to the readers.

You can find me also from: