We recommend that users install
sf before proceeding with the installation of
stlcsb. Windows users should be able to install
sf without significant issues, but macOS and Linux users will need to install several open source spatial libraries to get
sf itself up and running. The easiest approach for macOS users is to install the GDAL 2.0 Complete framework from Kyng Chaos.
For Linux users, steps will vary based on the flavor being used. Our configuration file for Travis CI and its associated bash script should be useful in determining the necessary components to install. Linux users will need to make sure the dependency
lwgeom is also installed correctly.
sf is installed, the easiest way to get
stlcsb is to install it from CRAN:
Alternatively, the development version of
stlcsb can be accessed from GitHub with
csb_to see all of the functions available in this package, or
cat_to see all of the category data objects.
Before using the other functions, you will need to acquire the CSB data. The
csb_get_data function will download the most recent version of the data, and return it as a single tibble. By default, variable names are clean-up automatically and two variables that largely containing
NA values are removed (
PROBZIP). Variables are also reordered in a more logical fashion.
> csb <- csb_get_data() trying URL 'https://www.stlouis-mo.gov/data/upload/data-files/csb.zip' Content type 'application/x-zip-compressed' length 60999726 bytes (58.2 MB) ================================================== downloaded 58.2 MB > > names(csb)  "requestid" "datetimeinit" "probaddress" "probaddtype"  "callertype" "neighborhood" "ward" "problemcode"  "description" "submitto" "status" "dateinvtdone"  "datetimeclosed" "prjcompletedate" "datecancelled" "srx"  "sry"
tidy = FALSE argument for
csb_get_data() will return a tibble with both variables as well as the original variable names in the order they are disseminated by the City:
> csb <- csb_get_data(tidy = FALSE) trying URL 'https://www.stlouis-mo.gov/data/upload/data-files/csb.zip' Content type 'application/x-zip-compressed' length 60999726 bytes (58.2 MB) ================================================== downloaded 58.2 MB > > names(csb)  "CALLERTYPE" "DATECANCELLED" "DATEINVTDONE" "DATETIMECLOSED"  "DATETIMEINIT" "DESCRIPTION" "NEIGHBORHOOD" "PRJCOMPLETEDATE"  "PROBADDRESS" "PROBADDTYPE" "PROBCITY" "PROBLEMCODE"  "PROBZIP" "REQUESTID" "SRX" "SRY"  "STATUS" "SUBMITTO" "WARD"
It is also possible to only import a single year or a range of years:
csb_load_variables() function is available to provide descriptions of each variable. Like
csb_get_data(), this function returns the tidy version by default, but can be set to return the regular version as well (with
tidy = FALSE):
> csb_load_variables() trying URL 'https://www.stlouis-mo.gov/data/upload/data-files/CSB-Request-Dataset-Field-Definitions.xlsx' Content type 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' length 4976 bytes ================================================== downloaded 4976 bytes # A tibble: 17 x 2 Name Description <chr> <chr> 1 requestid System generated unique record number 2 datetimeinit Date/Time the request was initiated... 3 probaddress The address where the problem is occurring <OUTPUT TRUNCATED>
The most important challenge with these data are the lack of intelligible categorization of Problem codes. These categories were created by hand - we selected like problem codes and created meta-categories that represented an overarching concept.
Each category is available in a named character vector that begins with
cat_. They are used internally and are also exported for reference and extension:
> cat_vacant  "Vacant Unit Appeal" "Vacant Unit Appeal t" "Debris-Vacant Bldg"  "Debris-Vacant Lot" "Missed Cut - V Lot" "Missed Cut - VBldg"  "Unsatisfy Cut - VLot" "Unsatisfy Cut -VBldg" "Weeds-Vacant Bldg"  "Weeds-Vacant Lot" "WTR-VACANT-BLDG" "VACANT BLDG INITIV"  "Vacnt Bldg Unsecured" "Vcnt weed check-web" "LRA Board up"  "Misc-LRA" "Misc-Op Brightside" "Building Collapse"  "Property Damage-LRA"
There are several functions available to represent data categorically.
csb_categorize() can be used to categorize data based on its problem code:
csb_filter() can be used to filter the data based on these same categories. Category takes single or vector arguments:
csb_vacant() is used for appending a logical vector indicating vacancy related problem codes. It is important to know that vacant exists inclusive of other categories, or in other words, a problem code will vacancy related in addition to its association with a more general category.
csb_canceled() is used for the removal of CSB requests that were canceled. By default it deletes the column specified (in this case
datecancelled) since it will contain only
NA values after the data are subset.
stlcsb package also provides helpful functions for dealing with spatial data. The
csb_missingXY() function is used to identify observations with incomplete or invalid spatial data:
csb_projectXY is used to create a simple features object, which can then be manipulated using the
sf package or plotted using any of the mapping packages for
R. It is critical missing spatial data be removed prior to executing this function:
The package offers two functions for handling time and date. The
csb_date_filter()function is used to subset observations for a specified time period. It is flexible in syntax, accepting 3 letter abbreviations, full month names, or a numeric argument:
csb_date_filter(csb, datetimeinit, day = 1) csb_date_filter(csb, datetimeinit, day = 1:15, month = 1) csb_date_filter(csb, datetimeinit, month = "January", year = 09) csb_date_filter(csb, datetimeinit, month = c("jan", "feb", "Mar", "Apr"), year = 2009) csb_date_filter(csb, datetimeinit, day = 1:15, month = 1:6, year = 08:13)
csb_date_parse() is used to parse part of, or whole dates from observations. It too can delete the original date variable. Arguments are the names of the variables to be appended to the data.
csb_date_parse(csb, var = datetimeinit, day = dayInit) csb_date_parse(csb, var = datetimeinit, day = dayInit, month = monthInit) csb_date_parse(csb, var = datetimeinit, month = monthInit) csb_date_parse(csb, var = datetimeinit, month = monthInit, year = yearInit) csb_date_parse(csb, var = datetimeinit, day = dayInit, month = monthInit, year = yearInit, drop = TRUE)
The following is an example of a full workflow for importing the data, filtering for vacancy indicating problem codes, filtering by time, removing observations with missing spatial data, and projecting it to an sf object. Notice that the pipe operator
magrittr is used. This passes the output of the prior function to the first argument of the next function, reducing typing and making code easier to read.
csb <- csb_get_data(year = 2017) csb %>% csb_filter(var = problemcode, category = cat_vacant) %>% csb_date_filter(var = datetimeinit, month = c("Jun", "Jul", "Aug")) %>% csb_missingXY(varX = srx, varY = sry, newVar = missingXY) %>% dplyr::filter(missingXY == FALSE) %>% csb_projectXY(varX = srx, varY = sry) -> vacant_sf
Once the data have been projected, they can be previewed with a package like
These data can also be mapped using
ggplot2 once they have been projected:
Ritself, welcome! Hadley Wickham’s R for Data Science is an excellent way to get started with data manipulation in the tidyverse, which
stlcsbis designed to integrate seamlessly with.
R, we strongly encourage you check out the excellent new Geocomputation in R by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow.
stlcsb, you are encouraged to use the RStudio Community forums. Please create a
reprexbefore posting. Feel free to tag Chris (
@chris.prener) in any posts about
reprexand then open an issue on GitHub.