Geospatial Analysis of NYC Crime Data with GridDB

GridDB Community Edition 4.1 and newer feature a Geometry data type that allows developers to combine both Time Series and Geospatial analysis in their applications. We’ve covered GridDB’s Geometry features before in our posts on Using Geometry Values with GridDB and Geometry Data Applications.

This blog post delves into practical use of the Geometry data type by looking at New York City’s historical crime complaint data. This data provides Latitude and Longitude of reported crimes so we will be using GridDB to see in what area a complaint occurred in. After demonstrating how to ingest the crime data from NYC Open Data, we’ll look at seeing how the amount of crime differs month to month in Central Park. We will also load external polygon data by seeing complaint counts in individual precincts.

The primary purpose of use of GridDB’s Geometry data type with Geospatial data points is being able to search where points, polylines (paths), or polygons (areas) intersect. Points, polylines, and polygons are defined using Well-known-text (WKT) which is a markup language that defines vector geometry objects on a map.

Ingest

We fetched the historic crime data from NYC Open Data. The following table shows both the CSV data fields and the GridDB schema.

CSV Values GridDB Schema
  • CMPLNT_NUM (Unique ID)
  • CMPLNT_FR_DT (Complaint Date)
  • CMPLNT_FR_TM (Complaint Time)
  • CMPLNT_TO_DT
  • CMPLNT_TO_TM
  • ADDR_PCT_CD
  • RPT_DT
  • KY_CD
  • OFNS_DESC
  • PD_CD
  • PD_DESC
  • CRM_ATPT_CPTD_CD
  • LAW_CAT_CD
  • BORO_NM
  • LOC_OF_OCCUR_DESC
  • PREM_TYP_DESC
  • JURIS_DESC
  • JURISDICTION_CODE
  • PARKS_NM
  • HADEVELOPT
  • HOUSING_PSA
  • X_COORD_CD
  • Y_COORD_CD
  • SUSP_AGE_GROUP
  • SUSP_RACE
  • SUSP_SEX
  • TRANSIT_DISTRICT
  • Latitude (Floating Point Latitude)
  • Longitude (Floating Point Longitude)
  • Lat_Lon (Lat Lon WKT)
  • PATROL_BORO
  • STATION_NAME
  • VIC_AGE_GROUP
  • VIC_RACE
  • VIC_SEX
public class Complaint {
    int CMPLNT_NUM;
    Date CMPLNT_FR_DT;
    Date CMPLNT_TO_DT;
    int ADDR_PCT_CD;
    Date RPT_DT;
    int KY_CD;
    String OFNS_DESC;
    int PD_CD;
    String PD_DESC;
    String CRM_ATPT_CPTD_CD;
    String LAW_CAT_CD;
    String BORO_NM;
    String LOC_OF_OCCUR_DESC;
    String PREM_TYP_DESC;
    String JURIS_DESC;
    int JURISDICTION_CODE;
    String PARKS_NM;
    String HADEVELOPT;
    String HOUSING_PSA;
    int X_COORD_CD;
    int Y_COORD_CD;
    String SUSP_AGE_GROUP;
    String SUSP_RACE;
    String SUSP_SEX;
    int TRANSIT_DISTRICT;
    float Latitude;
    float Longitude;
    Geometry Lat_Lon;
    String PATROL_BORO;
    String STATION_NAME;
    String VIC_AGE_GROUP;
    String VIC_RACE;
    String VIC_SEX;
}

For simplicity sake, we’re only using one container instead of splitting data into multiple containers.

Parsing the CSV easy with the CSVParser Library:

   Iterable records = CSVFormat.RFC4180.withFirstRecordAsHeader().parse(in);
   for (CSVRecord record : records) {
        Complaint c = parseCsvRecord(record);
        if(c != null)
            col.put(c);
   }
   col.commit();

There are a few changes to the data that need to be made within the parseCsvRecord function. First, the complaint time is a non-standard but easily parse-able format of MM/DD/YYYY and HH:MM:SS.

String dt[] = r.get("CMPLNT_FR_DT").split("/");
String tm[] = r.get("CMPLNT_FR_TM").split(":");
c.CMPLNT_FR_DT = new Date(Integer.parseInt(dt[2])-1900, Integer.parseInt(dt[0])-1, Integer.parseInt(dt[1]), Integer.parseInt(tm[0]), Integer.parseInt(tm[1]), Integer.parseInt(tm[2]));

While the raw CSV contains WKT text for the point where the crime occurred in “lat lon” but the accepted WKT format is POINT(x y”) and Latitude is denotes Y-axis and Longitude the X-axis so we flip them.

c.Lat_Lon =   Geometry.valueOf("POINT("+c.Longitude+" "+c.Latitude+")");

Crime by Precinct

NYC Open Data also provides WKT polygons of individual police precincts available here https://data.cityofnewyork.us/Public-Safety/Police-Precincts/78dh-3ptz
. Like the complaint data, it’s easily loaded with CSVParser but since each precinct may consist of multiple polygons and uses the WKT MULTIPOLYGON type some further processing is required to split the MULTIPOLYGON into simple POLYGONs.


String polys[] = record.get("the_geom").split("\\),");
int count=0;
for(int i=0; i < polys.length; i++) {
    String subpoly = polys[i].replace("MULTIPOLYGON (", "").replace(")))", ")");
    query = col.query("select * where ST_MBRIntersects(Lat_Lon, ST_GeomFromText('POLYGON"+subpoly+")') )");
    rs = query.fetch(false);
    count =+ rs.size();
}

The results are as follows:

Precinct 1: 243         Precinct 52: 888
Precinct 5: 177         Precinct 60: 185
Precinct 6: 216         Precinct 61: n/a
Precinct 71: 227        Precinct 62: 210
Precinct 72: 262        Precinct 63: 324
Precinct 7: 132         Precinct 66: 261
Precinct 9: 233         Precinct 68: 233
Precinct 22: 345        Precinct 69: 220
Precinct 10: 203        Precinct 70: 369
Precinct 13: 400        Precinct 76: 135
Precinct 14: 428        Precinct 77: 334
Precinct 17: 174        Precinct 78: 211
Precinct 20: 132        Precinct 81: 12
Precinct 18: 379        Precinct 83: 498
Precinct 19: 225        Precinct 84: 175
Precinct 23: 225        Precinct 88: 174
Precinct 24: 147        Precinct 90: 290
Precinct 25: 336        Precinct 94: 102
Precinct 79: 266        Precinct 100: 8
Precinct 26: 217        Precinct 101: 0
Precinct 28: 213        Precinct 102: 283
Precinct 30: 206        Precinct 103: 367
Precinct 32: 361        Precinct 104: 511
Precinct 73: 410        Precinct 105: 481
Precinct 33: 152        Precinct 106: 227
Precinct 34: 224        Precinct 107: 294
Precinct 75: 529        Precinct 108: 262
Precinct 40: 444        Precinct 109: 299
Precinct 41: 304        Precinct 110: 431
Precinct 42: 487        Precinct 111: 138
Precinct 43: 408        Precinct 112: 178
Precinct 48: 495        Precinct 113: n/a
Precinct 44: 559        Precinct 114: 28
Precinct 45: 323        Precinct 115: 246
Precinct 46: 400        Precinct 120: 228
Precinct 47: 441        Precinct 121: 194
Precinct 49: 267        Precinct 122: 217
Precinct 50: 219        Precinct 123: 83
Precinct 67: 504

Central Park

Our first geospatial analysis is looking at crime in Central Park month to month too see if crime complaints increase or decrease with the temperature. As Central Park is not aligned to the North-South axis, you can't simply use a bounding box (where lat > min && lat < max && lon > min && lon < max) as you would rudimentary geospatial analysis queries. Instead we can build a polygon of each corner of Central Park and query crime complaints that intersect with that polygon. To find the points of Central Park and build the WKT object a useful tool is Wicket.

Using Wicket To Build Geospatial Analysis Queries

With that, we can build the TQL query

String CentralParkWKT = "POLYGON((-73.97308900174315 40.764422448981996,-73.98192956265623 40.76812781417226,-73.9584064734938 40.80087951931638,-73.94982340464614 40.797240957024385,-73.97308900174315 40.764422448981996))";

for(int month=0; month <= 11; month++) {
    int count=0;
    for (int year=108; year <= 118; year++) {
        Date start = new Date(year, month, 1);
        Date end = new Date(year, month+1, 1);

	Query query = col.query("select * where ST_MBRIntersects(Lat_Lon, ST_GeomFromText('"+CentralParkWKT+"')) and CMPLNT_FR_DT >= TO_TIMESTAMP_MS("+start.getTime()+")  and CMPLNT_FR_DT < TO_TIMESTAMP_MS("+ end.getTime()+") ");
        RowSet rs = query.fetch(false);
        count += rs.size(); 
    }
    System.out.println(month+": "+count);
}

Now to answer our initial question, does the amount of crime change as the temperature changes?

January: 30
February: 23
March: 36
April: 33
May: 29
June: 19
July: 26
August: 49
September: 25
October: 23
November: 26
December: 18

The December and August totals would lend support to the argument, but January and June disagree thus the results are inconclusive.

If you're interested in exploring the data further or having a look at the complete code, it is available here.