Saturday, October 26, 2013

Using EUROSTAT's data with D3.js

EUROSTAT is the European Union office for statistics. Its goal is to provide statistics at European level that enable comparisons between regions and countries.

The NUTS classification (Nomenclature of territorial units for statistics) is a hierarchical system for dividing up the economic territory of the EU. Many of the EUROSTAT data is relative to these regions, and this data is a good candidate to be mapped.

The problem is that EUROSTAT provides some shapefiles in one hand, but without the actual names for the regions, nor other information, and some tables  with the name and population for every region. This data has to be joined to be able to map properly.

In this example, he EUROSTAT's criminality statistics are mapped using these provided files.

You can see the working example at bl.ocks.org, and get all the source code at GitHub.
There is also a second example to see how to change the data easily.

Organizing the data

At their site, EUROSTAT provides different Shapefiles for different scales. I think that the best option for maps of all Europe is to use NUTS_2010_10M_SH.zip. Using the 1:60 million scale could be an option, but the file doesn't contain the NUTS3 polygons, so only data at country or big region level is possible.

After downloading the polygons, you will see that no names or population are inside the fields, only the NUTS code. So to make a map, we have to label every polygon. To do it, I downloaded  the NUTS2010.zip file, which contains a csv file with the names for every NUTS region. At the web, it's also possible to find the population in a file, looking for the text Annual average population (1 000) by sex - NUTS 3 regions. Then, I made the following script:

'''
Joins a shp file from EUROSTAT with the xls with the region names so
the resulting shp file has the name of every region and the code.
Roger Veciana, oct 2013
'''
from osgeo import ogr
from os.path import exists
from os.path import basename
from os.path import splitext
from os import remove

def create_shp(in_shp, out_shp, csv_data):
    print "Extracting data"
    
    in_ds = ogr.Open( in_shp )
    if in_ds is None:
        print "Open failed.\n"
        sys.exit( 1 )
    in_lyr = in_ds.GetLayerByName( splitext(basename(in_shp))[0] )
    if in_lyr is None:
        print "Error opening layer"
        sys.exit( 1 )

    if exists(out_shp):
        remove(out_shp)
    driver_name = "ESRI Shapefile"
    drv = ogr.GetDriverByName( driver_name )
    if drv is None:
        print "%s driver not available.\n" % driver_name
        sys.exit( 1 )
    out_ds = drv.CreateDataSource( out_shp )
    if out_ds is None:
        print "Creation of output file failed.\n"
        sys.exit( 1 )
    proj = in_lyr.GetSpatialRef()
    ##Creating the layer with its fields
    out_lyr = out_ds.CreateLayer( 
        splitext(basename(out_shp))[0], proj, ogr.wkbMultiPolygon )
    lyr_def = in_lyr.GetLayerDefn ()
    for i in range(lyr_def.GetFieldCount()):
        out_lyr.CreateField ( lyr_def.GetFieldDefn(i) )

    field_defn = ogr.FieldDefn( "NAME", ogr.OFTString )
    out_lyr.CreateField ( field_defn )

    field_defn = ogr.FieldDefn( "COUNTRY", ogr.OFTString )
    out_lyr.CreateField ( field_defn )

    field_defn = ogr.FieldDefn( "COUNTRY_CO", ogr.OFTString )
    out_lyr.CreateField ( field_defn )

    field_defn = ogr.FieldDefn( "POPULATION", ogr.OFTInteger )
    out_lyr.CreateField ( field_defn )
    

    ##Writing all the features
    in_lyr.ResetReading()
    for feat_in in in_lyr:
        value = feat_in.GetFieldAsString(feat_in.GetFieldIndex('NUTS_ID'))
        if value in csv_data:
            #print csv_data[value]
            feat_out = ogr.Feature( out_lyr.GetLayerDefn())
            feat_out.SetField( 'NUTS_ID', value )
            feat_out.SetField( 'NAME', csv_data[value]['label'] )
            feat_out.SetField( 'POPULATION', csv_data[value]['population'] )
            feat_out.SetField( 'COUNTRY_CO', csv_data[value]['id_country'] )
            feat_out.SetField( 'COUNTRY', csv_data[csv_data[value]['id_country']]['label'] )
            feat_out.SetField( 'STAT_LEVL_', csv_data[value]['level'] )
            feat_out.SetField( 'SHAPE_AREA', feat_in.GetFieldAsString(feat_in.GetFieldIndex('SHAPE_AREA')) )
            feat_out.SetField( 'SHAPE_LEN', feat_in.GetFieldAsString(feat_in.GetFieldIndex('SHAPE_LEN')) )

            feat_out.SetGeometry(feat_in.GetGeometryRef())
            out_lyr.CreateFeature(feat_out)
    in_ds = None
    out_ds = None   

def read_population(csv_file):
    '''
    Reads the NUTS csv population file and returns the data in a dict
    '''
    csv_data = {}
    f = open(csv_file, "r")
    f.readline(); #Skip header
    for line in f:
        line_data = line.split(";")
        try:
            csv_data[line_data[0]] = int(float(line_data[4]) * 1000)
        except Exception, e:
            csv_data[line_data[0]] = -9999
    f.close

    return csv_data
def read_names(csv_file, population_data):
    '''
    Reads a NUTS csv file and returns the data in a dict
    '''
    csv_data = {}
    f = open(csv_file, "r")
    f.readline(); #Skip header
    for line in f:
        line_data = line.split(";")
        csv_data[line_data[0]] = {'label': line_data[1],
            'level': line_data[2],
            'id_country': line_data[3],
            'code_country': line_data[4],
            'population': 0}
        if line_data[0] in population_data:
            csv_data[line_data[0]]['population'] = population_data[line_data[0]]

    f.close

    return csv_data

if __name__ == '__main__':
    population_data = read_population('demo_r_d3avg_1_Data.csv')
    csv_data = read_names('NUTS_2010.csv', population_data)
    create_shp('NUTS_2010_10M_SH/Data/NUTS_RG_10M_2010.shp', 'regions.shp', csv_data)
  • Basically, we iterate for every polygon in the shp file, and look into the csv data for the attributes to add. 
  • Population data and region name are in tow separate files. To read them, there are two functions: read_population and read_names.
  • create_shp function reads the geometries, joins them with the population and name, and finally writes the output file.
  • Notice that NUTS regions overlap. There are levels 1, 2 and 3. The first one is at country level, the second takes big areas from the countries, and the third one are the actual departments in every country. To map any data is necessary to choose the level at which the map has to be represented, or mix the different levels if it's necessary, as in this case.
To create a map, I chose the criminality data. The file is again at EUROSTAT, under the text Crimes recorded by the police - NUTS 3 regions. Despite its name, some regions are NUTS3, and some NUTS2. Most of  the files from the site labeled with NUTS regions should work with only little changes in the map code.

Creating a D3.js map to show the data

To draw the data in a map I've used a similar solution as the one in the post D3js Electoral Map:

Crimes recorded by the police - NUTS 3 regions


Crimes recorded by the police - NUTS 3 regions

  • Lines 5 to 18 give style to the map background and the tooltip
  • Lies 29 to 43 set up the map, with its size, projection and svg element
  • Line 45 sets the color scale (this is different to the electoral map example). The colors will vary from blue to red. Discrete colors are an other option. Just changing this line, you can change the whole map colors.
  • Lines 47 to 53 read and parse the csv with the criminality data into  an object (we will only use 2010 data)
  • Line 55 downloads the TopoJSON containing the NUTS regions
  • Then lines 56 to 78, the map is actually drawn. Note
    • Line 60 filters only the regions that have data in the CSV file. For some countries, this is the NUTS2 region (see Italy) and for others, NUTS3 (like Germany). The following lines are only run if there is data (see D3js filter)
    • Line 66 sets the color of the region. Note that calls the scale defined at line 45, and calculates the number by multiplying the crimes by 100000 and dividing by the region population to get the density
    • Line 71 sets the region opacity to 0 if there is no data for it. With the filter could be avoided
    • Lines 78 and 81 set the tooltip box, taken from this example.
  • Line 98 calls the function that draws the legend. The function is in a separate file to make it easy to reuse
 
var legend = function(accessor){
    return function(selection){
      
      //Draw the legend for the map
      var legend = selection.append("g");
      var legendText = accessor.title;
      var numSquares = accessor.elements;
      var xLegend = 0;
      var yLegend = 0;
      var widthLegend = 400;
      
      var title_g = legend.append("g");
      var values_g = legend.append("g");
      
      var legendTitle = title_g.append("text")
        .text("Legend")
        .attr("font-family", "sans-serif")
        .attr("font-size", "18px")
        .attr("fill", "black");
        var bbox = legendTitle.node().getBBox();
        legendTitle.attr("x", xLegend + widthLegend/2 - bbox.width/2);
        legendTitle.attr("y", yLegend + 20);
        
      var legendTextEl = title_g.append("text")
        .text(legendText)
        .attr("y", yLegend + 75)
        .attr("font-family", "sans-serif")
        .attr("font-size", "14px")
        .attr("fill", "black");
      var bbox = legendTextEl.node().getBBox();
      legendTextEl.attr("x", xLegend + widthLegend/2 - bbox.width/2);

      for (i=0; i < numSquares; i++){
        values_g.append("rect")
          .attr("x", xLegend + (i+0.1*i/numSquares)*(widthLegend/numSquares))
          .attr("y", yLegend + 25)
          .attr("width", (widthLegend/numSquares)*0.9)
          .attr("height", 20)
          .attr("fill", color(accessor.domain[0] + i * accessor.domain[1]/(numSquares-1)));
        var value_text = values_g.append("text")
          .text(accessor.domain[0] + i * accessor.domain[1]/(numSquares-1))
          .attr("y", yLegend + 55)
          .attr("font-family", "sans-serif")
          .attr("font-size", "12px")
          .attr("fill", "black");
        var bbox = value_text.node().getBBox();
        value_text
          .attr("x", xLegend + (i+0.1*i/numSquares)*(widthLegend/numSquares) + (widthLegend/numSquares)*(0.9/2)- bbox.width/2);

      }
      var scale = accessor.width / 400;
      legend.attr("transform","translate("+accessor.posx+","+accessor.posy+") scale("+scale+") ");

};
};

  •  Note how the function is called. The argument accessor is the one thas have the object properties (if called from a geometry, for instance), and selection is the node where the function is called from.
  • The code is not very D3jsnic, using things as a for loop but it works and can be reused for drawing other legends from a D3js scale.

4 comments:

  1. Hi Roger,
    Have you considered James Fee over at planetgs.com to link to your very useful blog?
    JAMES.FEE@GMAIL.COM

    Karl Zinglersen

    ReplyDelete
  2. Hi Roger,
    Thanks for sharing your methodology. In order to make web maps based on NUTS regions, you could also consider using this: https://github.com/jgaffuri/Nuts2json
    DB

    ReplyDelete
    Replies
    1. Thank you for the link, I think it's much easier with it.

      Delete