vignettes/postmastr.Rmd
postmastr.Rmd
postmastr
is designed to be an opinionated toolkit for parsing street addresses using R
. It was originally created to standardize addresses prior to geocoding the, in an effort to increase the geocoder’s ability to correctly match a given address with the appropriate coordinates.
While street addresses have a significant amount of variation, they also are ordered in a relatively standardized fashion. Take, for example, an address from sushi1
(one of the example data sets in postmastr
): 601 Clark Ave Suite 104, St. Louis, MO 63102-1719
. We can break down the address in the following way:
601
Clark
Ave
Suite
104
St. Louis
MO
63102-1719
There is sometimes additional information as well. Imagine that the house number in the example above was 601-603
instead. We could break this more complex number down in the following way:
601-603
601
603
Another permutation has to do with the house number suffix. Imagine a house number that is 601R
or 601 Rear
:
601
R
Finally, streets in the United States sometimes contain either a street prefix direction (601 North Clark Ave
) or a street suffix direction (601 Clark Ave North
).
This basic anatomy of a street address forms our grammar of street addresses - a specific language for thinking about how addresses are formed and therefore can be parsed.
postmastr
To parse our grammar of street addresses, functions can be grouped in two ways. All functions begin with the prefix pm_
in order to take advantage of RStudio’s auto-complete functionality.
First, we have major groups of functions based on their associated grammatical element:
house
- house number
houseAlpha
- alphanumeric house number
houseFrac
- fractional house number
street
- street namestreetDir
- street prefix and suffix directionstreetSuf
- street suffixunit
- unit name and numbercity
- city
state
- state
postal
- postal code
For each group of function, we have a similar menu of options that describe the verb (action) the function implements. For the state
family of functions, for instance:
pm_state_detect()
- does a given street address contain a state name or abbreviation?pm_state_any()
- does a any street address contain a state name or abbreviation?pm_state_all()
- do all street addresses contain a state name or abbreviation?pm_state_none()
- returns a tibble of street addresses that do not contain a state name or abbreviationpm_state_parse()
- parses street addresses that do contain a street name or abbreviationpm_state_std()
- standardizes the parsed state data to return upper-case abbreviationsDictionaries are a critical component of the postmastr
workflow because they allow you to define the terms that need to be parsed and standardized. Using well-defined dictionaries can speed up the parsing process and ensure that it is accurate.
postmastr
comes with a built-in dictionary of states and their abbreviations that can be expanded and filtered using pm_dictionary()
and pm_append()
. On its own, pm_dictionary()
will return that built-in dictionary with a variety of entries based on what we specify for case:
> pm_dictionary(type = "state", case = "title", locale = "us")
# A tibble: 124 x 2
state.output state.input
<chr> <chr>
1 AA AA
2 AA Armed Forces Americas
3 AE AE
4 AE Armed Forces Europe, the Middle East, and Canada
5 AK AK
6 AK Alaska
7 AL AL
8 AL Alabama
9 AP AP
10 AP Armed Forces Pacific
# … with 114 more rows
We can customize this mix of cases by using "upper"
and "lower"
as well. For example:
> pm_dictionary(type = "state", case = c("title", "upper"), locale = "us")
# A tibble: 186 x 2
state.output state.input
<chr> <chr>
1 AA AA
2 AA Armed Forces Americas
3 AA ARMED FORCES AMERICAS
4 AE AE
5 AE Armed Forces Europe, the Middle East, and Canada
6 AE ARMED FORCES EUROPE, THE MIDDLE EAST, AND CANADA
7 AK AK
8 AK Alaska
9 AK ALASKA
10 AL AL
# … with 176 more rows
If only a subset of states are included in your data, you can improve postmastr
’s performance by limiting your state dictionary’s contents. The filter
argument can accept scalar or vector inputs of two-letter state abbreviations. For instance, we could construction a state dictionary that contains only the states along the Gulf of Mexico:
> pm_dictionary(type = "state", filter = c("AL", "FL", "LA", "MS", "TX"), case = "title", locale = "us")
# A tibble: 10 x 2
state.output state.input
<chr> <chr>
1 AL AL
2 AL Alabama
3 FL FL
4 FL Florida
5 LA LA
6 LA Louisiana
7 MS MS
8 MS Mississippi
9 TX TX
10 TX Texas
The state dictionary can also be expanded. For instance, there are several common abbreviations for Mississippi - “Miss” and “MISS”. We can create an appendix with pm_append()
:
Once this has been created, we can combine it with the default state dictionary to create our custom dictionary output:
> pm_dictionary(type = "state", append = miss,
+ filter = c("AL", "FL", "LA", "MS", "TX"),
+ case = c("title", "upper"), locale = "us")
# A tibble: 17 x 2
state.output state.input
<chr> <chr>
1 AL AL
2 AL Alabama
3 AL ALABAMA
4 FL FL
5 FL Florida
6 FL FLORIDA
7 LA LA
8 LA Louisiana
9 LA LOUISIANA
10 MS MS
11 MS Mississippi
12 MS Miss
13 MS MISSISSIPPI
14 MS MISS
15 TX TX
16 TX Texas
17 TX TEXAS
All of the different inputs for Mississippi will now return the same MS
output.
The only required dictionary for U.S. street addresses is the city dictionary, because there municipalities and place names in the U.S. are so numerous that we cannot build a single, efficient object to use as a default. postmastr
comes with default U.S. dictionaries for states (shown above), street directionals, street suffixes, and unit types. It is also possible to build street name dictionaries to correct common misspellings and to create dictionaries for house suffix values (e.g. “Front” or “Rear” addresses).
To illustrate the core components of the postmastr
workflow, we’ll use some data on sushi restaurants in the St. Louis, Missouri region. These are “long” data - some restaurants appear multiple times. Here is a quick preview of the data:
> sushi1
# A tibble: 30 x 3
name address visit
<chr> <chr> <chr>
1 BaiKu Sushi Lounge 3407 Olive St, St. Louis, Missouri 63103 3/20/18
2 Blue Ocean Restaurant 6335 Delmar Blvd, St. Louis, MO 63112 10/26/18
3 Cafe Mochi 3221 S Grand Boulevard, St. Louis, MO 63118 10/10/18
4 Drunken Fish - Ballpark Village 601 Clark Ave #104, St. Louis, MO 63102-1719 4/28/18
5 Drunken Fish - Ballpark Village 601 Clark Ave Suite 104, St. Louis, MO 63102-1719 5/10/18
6 Drunken Fish - Ballpark Village 601 Clark Ave Suite 104, St. Louis, MO 63102-1719 8/7/18
7 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108 12/2/18
8 I Love Mr Sushi 9443 Olive Blvd, St. Louis, Missouri 63132 1/1/18
9 Kampai Sushi Bar 4949 W Pine Blvd, St. Louis, MO 63108 2/13/18
10 Midtown Sushi & Ramen 3674 Forest Park Ave, St. Louis, MO 63108 3/4/18
# … with 20 more rows
Some problems should already be apparent. For instance, Cafe Mouchi uses the full word for “Boulevard” while the entry for Blue Ocean uses the proper abbreviation “Blvd”. For Drunken Fish, the suite number is listed both using the pound sign (“#”) as well as with the word “Suite”. Finally, some of the entries including for BaiKu and I Love Mr Sushi use the full name for “Missouri” while the rest use the proper two-letter abbreviation “MO”. Finally, the Drunk Fish - Ballpark Village uses the “zip+4” format as opposed to the remainder of the addresses visible, which contain only the five digit zip-code.
Some of the other entries have additional issues. For example, Sushi Koi has its address fully capitalized:
Similarly, Wasabi Sushi Bar has its address fully capitalized, but the prefix direction “SOUTH” appears as a word rather than the proper “S”:
This vignette will walk through the process of addressing these issues.
The postmastr
package has a single high-level function for parsing, pm_parse()
, which wraps all of preparatory, parsing, and reconstruction functions into a single call. This can be used if the problem-space is exceptionally well defined - all dictionaries need to be created ahead of time, so you must know what dictionary elements are necessary. This may be possible for you if you consistently work with specific data sources and have a good understanding what cities, states, and other elements of the grammar are present. If you are not sure what dictionary elements are needed, you will need to use the workflow illustrated below to develop these objects.
For the sushi1
data, the required dictionaries are:
> dirs <- pm_dictionary(type = "directional", filter = c("N", "S", "E", "W"), locale = "us")
> mo <- pm_dictionary(type = "state", filter = "MO", case = c("title", "upper"), locale = "us")
> cities <- pm_append(type = "city",
+ input = c("Brentwood", "Clayton", "CLAYTON", "Maplewood",
+ "St. Louis", "SAINT LOUIS", "Webster Groves"),
+ output = c(NA, NA, "Clayton", NA, NA, "St. Louis", NA))
Once those dictionaries are built, we can parse the data. The input
argument is used to define how the address data are structured, and output
is used to specify what type of output you receive.
> sushi1 %>%
+ dplyr::filter(name != "Drunken Fish - Ballpark Village") %>%
+ pm_parse(input = "full", address = "address", output = "full",
+ dir_dict = dirs, city_dict = cities, state_dict = mo)
# A tibble: 27 x 4
name address visit pm.address
<chr> <chr> <chr> <chr>
1 BaiKu Sushi Lounge 3407 Olive St, St. Louis, Missouri 63103 3/20/18 3407 Olive St St. Louis MO 63103
2 Blue Ocean Restaurant 6335 Delmar Blvd, St. Louis, MO 63112 10/26/18 6335 Delmar Blvd St. Louis MO 63112
3 Cafe Mochi 3221 S Grand Boulevard, St. Louis, MO 63118 10/10/18 3221 S Grand Blvd St. Louis MO 63118
4 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108 12/2/18 1 Maryland Plz St. Louis MO 63108
5 I Love Mr Sushi 9443 Olive Blvd, St. Louis, Missouri 63132 1/1/18 9443 Olive Blvd St. Louis MO 63132
6 Kampai Sushi Bar 4949 W Pine Blvd, St. Louis, MO 63108 2/13/18 4949 W Pine Blvd St. Louis MO 63108
7 Midtown Sushi & Ramen 3674 Forest Park Ave, St. Louis, MO 63108 3/4/18 3674 Forest Park Ave St. Louis MO 63108
8 Mizu Sushi Bar 1013 Washington Avenue, St. Louis, MO 63101 9/12/18 1013 Washington Ave St. Louis MO 63101
9 Robata Maplewood 7260 Manchester Road, Maplewood, MO 63143 11/1/18 7260 Manchester Rd Maplewood MO 63143
10 SanSai Japanese Grill Maplewood 1803 Maplewood Commons Dr, St. Louis, MO 63143 2/14/18 1803 Maplewood Commons Dr St. Louis MO 63143
# … with 17 more rows
We can limit our output to just the house and street data:
> sushi1 %>%
+ dplyr::filter(name != "Drunken Fish - Ballpark Village") %>%
+ pm_parse(input = "full", address = "address", output = "short",
+ dir_dict = dirs, city_dict = cities, state_dict = mo)
# A tibble: 27 x 4
name address visit pm.address
<chr> <chr> <chr> <chr>
1 BaiKu Sushi Lounge 3407 Olive St, St. Louis, Missouri 63103 3/20/18 3407 Olive St
2 Blue Ocean Restaurant 6335 Delmar Blvd, St. Louis, MO 63112 10/26/18 6335 Delmar Blvd
3 Cafe Mochi 3221 S Grand Boulevard, St. Louis, MO 63118 10/10/18 3221 S Grand Blvd
4 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108 12/2/18 1 Maryland Plz
5 I Love Mr Sushi 9443 Olive Blvd, St. Louis, Missouri 63132 1/1/18 9443 Olive Blvd
6 Kampai Sushi Bar 4949 W Pine Blvd, St. Louis, MO 63108 2/13/18 4949 W Pine Blvd
7 Midtown Sushi & Ramen 3674 Forest Park Ave, St. Louis, MO 63108 3/4/18 3674 Forest Park Ave
8 Mizu Sushi Bar 1013 Washington Avenue, St. Louis, MO 63101 9/12/18 1013 Washington Ave
9 Robata Maplewood 7260 Manchester Road, Maplewood, MO 63143 11/1/18 7260 Manchester Rd
10 SanSai Japanese Grill Maplewood 1803 Maplewood Commons Dr, St. Louis, MO 63143 2/14/18 1803 Maplewood Commons Dr
# … with 17 more rows
We can also add the city, state, and postal code data as separate columns:
> sushi1 %>%
+ dplyr::filter(name != "Drunken Fish - Ballpark Village") %>%
+ pm_parse(input = "full", address = "address", output = "short", keep_parsed = "limited",
+ dir_dict = dirs, city_dict = cities, state_dict = mo)
# A tibble: 27 x 8
name address visit pm.address pm.city pm.state pm.zip pm.zip4
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 BaiKu Sushi Lounge 3407 Olive St, St. Louis, Missouri 63103 3/20/18 3407 Olive St St. Louis MO 63103 NA
2 Blue Ocean Restaurant 6335 Delmar Blvd, St. Louis, MO 63112 10/26/18 6335 Delmar Blvd St. Louis MO 63112 NA
3 Cafe Mochi 3221 S Grand Boulevard, St. Louis, MO 63118 10/10/18 3221 S Grand Blvd St. Louis MO 63118 NA
4 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108 12/2/18 1 Maryland Plz St. Louis MO 63108 NA
5 I Love Mr Sushi 9443 Olive Blvd, St. Louis, Missouri 63132 1/1/18 9443 Olive Blvd St. Louis MO 63132 NA
6 Kampai Sushi Bar 4949 W Pine Blvd, St. Louis, MO 63108 2/13/18 4949 W Pine Blvd St. Louis MO 63108 NA
7 Midtown Sushi & Ramen 3674 Forest Park Ave, St. Louis, MO 63108 3/4/18 3674 Forest Park Ave St. Louis MO 63108 NA
8 Mizu Sushi Bar 1013 Washington Avenue, St. Louis, MO 63101 9/12/18 1013 Washington Ave St. Louis MO 63101 NA
9 Robata Maplewood 7260 Manchester Road, Maplewood, MO 63143 11/1/18 7260 Manchester Rd Maplewood MO 63143 NA
10 SanSai Japanese Grill Maplewood 1803 Maplewood Commons Dr, St. Louis, MO 63143 2/14/18 1803 Maplewood Commons Dr St. Louis MO 63143 NA
# … with 17 more rows
postmastr
WorkflowIf you do not have a strong sense of the problem-space you are confronted with, and therefore do not know exactly what is needed for dictionaries, postmastr
has a full-featured workflow for step-by-step parsing of street addresses.
postmastr
postmast
’s functionality rests on an order of operations that must be followed to ensure correct parsing:
If no street address in the given data set contains one of the grammatical elements (for example, if none of the addresses contain units), the associated functions in the workflow can be skipped. For “short” addresses that do not include cities, states, or postal codes, the order of operations should be:
There are two initial preparatory steps that must be taken with these data. First, we want to ensure that they have a unique identification number for each row (to preserve the original sort order; pm.id
) as well as a unique identification number for each unique street address string (pm.uid
). These can both be applied using pm.identify()
:
> sushi1 <- pm_identify(sushi1, var = "address")
> sushi1
# A tibble: 30 x 5
pm.id pm.uid name address visit
<int> <int> <chr> <chr> <chr>
1 1 1 BaiKu Sushi Lounge 3407 Olive St, St. Louis, Missouri 63103 3/20/18
2 2 2 Blue Ocean Restaurant 6335 Delmar Blvd, St. Louis, MO 63112 10/26/18
3 3 3 Cafe Mochi 3221 S Grand Boulevard, St. Louis, MO 63118 10/10/18
4 4 4 Drunken Fish - Ballpark Village 601 Clark Ave #104, St. Louis, MO 63102-1719 4/28/18
5 5 5 Drunken Fish - Ballpark Village 601 Clark Ave Suite 104, St. Louis, MO 63102-1719 5/10/18
6 6 5 Drunken Fish - Ballpark Village 601 Clark Ave Suite 104, St. Louis, MO 63102-1719 8/7/18
7 7 6 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108 12/2/18
8 8 7 I Love Mr Sushi 9443 Olive Blvd, St. Louis, Missouri 63132 1/1/18
9 9 8 Kampai Sushi Bar 4949 W Pine Blvd, St. Louis, MO 63108 2/13/18
10 10 9 Midtown Sushi & Ramen 3674 Forest Park Ave, St. Louis, MO 63108 3/4/18
# … with 20 more rows
Notice that the Drunken Fish has three different unique identifiers applied for pm.uid
- two for the Ballpark Village location based on how the suite number is indicated and one for the Central West End location. Since address data are often numerous, postmastr
is designed to operate on unique street address strings rather than the full original data set to improve efficiency. We’ll create our minimal postmastr
object using pm.prep()
:
> sushi1_min <- pm_prep(sushi1, var = "address")
> sushi1_min
# A tibble: 24 x 2
pm.uid pm.address
<int> <chr>
1 1 3407 Olive St St. Louis Missouri 63103
2 2 6335 Delmar Blvd St. Louis MO 63112
3 3 3221 S Grand Boulevard St. Louis MO 63118
4 4 601 Clark Ave #104 St. Louis MO 63102-1719
5 5 601 Clark Ave Suite 104 St. Louis MO 63102-1719
6 6 1 Maryland Plaza St. Louis MO 63108
7 7 9443 Olive Blvd St. Louis Missouri 63132
8 8 4949 W Pine Blvd St. Louis MO 63108
9 9 3674 Forest Park Ave St. Louis MO 63108
10 10 1013 Washington Avenue St. Louis MO 63101
# … with 14 more rows
Notice that all extraneous information has been removed, and that there are now only 24 rows instead of the original 30.
Once we have our data prepared, we can begin working our way down the order of operations list. To see if is possible to skip a step, we should first use the appropriate pm_any_
function. With the sushi1_min
data, it will return TRUE
because postal codes are present in the data:
We can also use pm_all_
functions to determine whether all of the addresses have a postal code:
If this returned a FALSE
result, we would want to use pm_has
and pm_no_
functions to explore whether postal codes are not being detected because they (a) actually do not exist or (b) are not being found because they are mis-formatted. Since we get a TRUE
result, we can move on to parsing. If pm_parse_postal()
detects the presence of carrier routes (the four-digit additions to the typical five-digit zip-codes), it will parse those as well so that two postal code columns (pm.zip
and pm.zip4
) are returned:
> sushi1_min <- pm_parse_postal(sushi1_min)
> sushi1_min
# A tibble: 24 x 4
pm.uid pm.address pm.zip pm.zip4
<int> <chr> <chr> <chr>
1 1 3407 Olive St St. Louis Missouri 63103 NA
2 2 6335 Delmar Blvd St. Louis MO 63112 NA
3 3 3221 S Grand Boulevard St. Louis MO 63118 NA
4 4 601 Clark Ave #104 St. Louis MO 63102 1719
5 5 601 Clark Ave Suite 104 St. Louis MO 63102 1719
6 6 1 Maryland Plaza St. Louis MO 63108 NA
7 7 9443 Olive Blvd St. Louis Missouri 63132 NA
8 8 4949 W Pine Blvd St. Louis MO 63108 NA
9 9 3674 Forest Park Ave St. Louis MO 63108 NA
10 10 1013 Washington Avenue St. Louis MO 63101 NA
# … with 14 more rows
Had no carrier routes been present, these data would have been returned with only the pm.zip
column. Note that pm_parse_postal()
also updates the pm.address
column and removes any postal codes that have been identified. This facilitates additional parsing in subsequent steps.
To parse the cities out of pm.address
, we’ll start by creating a state-level dictionary object that contains only references to Missouri since we don’t have data outside of that state:
> moDict <- pm_dictionary(locale = "us", type = "state", filter = "MO", case = "title")
> moDict
# A tibble: 2 x 2
state.output state.input
<chr> <chr>
1 MO MO
2 MO Missouri
With our dictionary created in the object moDict
, we can use that to test whether state names or abbreviations are found at the end of our address string with pm_any_state()
and pm_all_state()
:
> pm_any_state(sushi1_min, dictionary = moDict)
[1] TRUE
> pm_all_state(sushi1_min, dictionary = moDict)
[1] FALSE
These results indicate that our dictionary is returning matches, but that our list of possible inputs is not complete. We can explore the un-matched streets with pm_no_state()
to determining whether these observations (a) actually do not contain states or (b) are not being matched because an entry is missing from our dictionary:
> pm_no_state(sushi1_min, dictionary = moDict)
# A tibble: 1 x 4
pm.uid pm.address pm.zip pm.zip4
<int> <chr> <chr> <chr>
1 24 16 SOUTH CENTRAL AVE CLAYTON MISSOURI 63105 NA
Our dictionary does not contain MISSOURI
, only Missouri
, so we’ll need to add that option. We can use case = c("title", "upper")
to do this:
> moDict <- pm_dictionary(locale = "us", type = "state", filter = "MO", case = c("title", "upper"))
> moDict
# A tibble: 3 x 2
state.output state.input
<chr> <chr>
1 MO MO
2 MO Missouri
3 MO MISSOURI
For more complex mis-matches, we would want to use pm_append()
to construct an appendix and then re-build our dictionary with that appendix included. We can verify that our dictionary is now complete by repeating our use of pm_all_state()
:
With a TRUE
result, we are ready to parse state names and abbreviations out of our data:
> sushi1_min <- pm_parse_state(sushi1_min, dictionary = moDict)
> sushi1_min
# A tibble: 24 x 5
pm.uid pm.address pm.state pm.zip pm.zip4
<int> <chr> <chr> <chr> <chr>
1 1 3407 Olive St St. Louis MO 63103 NA
2 2 6335 Delmar Blvd St. Louis MO 63112 NA
3 3 3221 S Grand Boulevard St. Louis MO 63118 NA
4 4 601 Clark Ave #104 St. Louis MO 63102 1719
5 5 601 Clark Ave Suite 104 St. Louis MO 63102 1719
6 6 1 Maryland Plaza St. Louis MO 63108 NA
7 7 9443 Olive Blvd St. Louis MO 63132 NA
8 8 4949 W Pine Blvd St. Louis MO 63108 NA
9 9 3674 Forest Park Ave St. Louis MO 63108 NA
10 10 1013 Washington Avenue St. Louis MO 63101 NA
# … with 14 more rows
If we inspect this object, we’ll see that pm.state
contains the state abbreviation for Missouri in all cases. As with postal codes, the city names have been removed from pm.address
to facilitate the next phase in parsing.
The workflow for cities is similar. Two options exist for American cities - you can either create a full list of cities by state using pm_dictionary()
(useful if you are not sure which cities appear in the data) or use pm_append()
on its own to create an appendix (faster if you know that there are a limited number of cities). This appendix for cities can be used on its own. We’ll choose this second option:
cityDict <- pm_append(type = "city",
input = c("Brentwood", "Clayton", "Maplewood", "St. Louis", "Webster Groves"))
We’ll then use pm_any_city()
and pm_all_city()
to verify that our dictionary is working and complete:
> pm_any_city(sushi1_min, dictionary = cityDict)
[1] TRUE
> pm_all_city(sushi1_min, dictionary = cityDict)
[1] FALSE
As with the state data, we can use pm_no_city()
to identify why our dictionary is incomplete (or verify that some addresses do not contain cities):
> pm_no_city(sushi1_min, dictionary = cityDict)
# A tibble: 2 x 5
pm.uid pm.address pm.state pm.zip pm.zip4
<int> <chr> <chr> <chr> <chr>
1 17 4 N EUCLID AVE SAINT LOUIS MO 63108 NA
2 24 16 SOUTH CENTRAL AVE CLAYTON MO 63105 NA
There are two city names in all upper-case. We’ll re-create our dictionary, adding both "SAINT LOUIS"
and "CLAYTON"
as well as specifying an output
vector that is NA
for all the correct cities but includes entries for our two incorrectly formatted cities and verify using pm_all_city()
that our dictionary is now complete:
> cityDict <- pm_append(type = "city",
+ input = c("Brentwood", "Clayton", "CLAYTON", "Maplewood",
+ "St. Louis", "SAINT LOUIS", "Webster Groves"),
+ output = c(NA, NA, "Clayton", NA, NA, "St. Louis", NA))
> pm_all_city(sushi1_min, dictionary = cityDict)
[1] TRUE
We can now move on to parsing using pm_parse_city()
:
> sushi1_min <- pm_parse_city(sushi1_min, dictionary = cityDict)
> sushi1_min
# A tibble: 24 x 6
pm.uid pm.address pm.city pm.state pm.zip pm.zip4
<int> <chr> <chr> <chr> <chr> <chr>
1 1 3407 Olive St St. Louis MO 63103 NA
2 2 6335 Delmar Blvd St. Louis MO 63112 NA
3 3 3221 S Grand Boulevard St. Louis MO 63118 NA
4 4 601 Clark Ave #104 St. Louis MO 63102 1719
5 5 601 Clark Ave Suite 104 St. Louis MO 63102 1719
6 6 1 Maryland Plaza St. Louis MO 63108 NA
7 7 9443 Olive Blvd St. Louis MO 63132 NA
8 8 4949 W Pine Blvd St. Louis MO 63108 NA
9 9 3674 Forest Park Ave St. Louis MO 63108 NA
10 10 1013 Washington Avenue St. Louis MO 63101 NA
The column pm.city
has been added to our data set and city names have been parsed as well as standardized.
Since the example data do not contain fractional addresses or house suffix values, parsing them is straightforward (dealing with less common street addresses will be the subject of an additional vignette). We’ll use an abbreviated version of the sushi data that do not contain city, state, or postal code data. These are sushi restaurants located within the City of St. Louis proper:
> postmastr::sushi2 %>%
+ dplyr::filter(name != "Drunken Fish - Ballpark Village") %>%
+ pm_identify(var = address) -> sushi2
>
> sushi2_min <- pm_prep(sushi2, var = address)
>
> sushi2_min
# A tibble: 11 x 2
pm.uid pm.address
<int> <chr>
1 1 3407 Olive St
2 2 3221 S Grand Boulevard
3 3 1 Maryland Plaza
4 4 4949 W Pine Blvd
5 5 3674 Forest Park Ave
6 6 1013 Washington Avenue
7 7 3043 Olive St
8 8 308 N Euclid Ave
9 9 910 Olive St
10 10 910 Olive Street
11 11 4 N EUCLID AVE
Our next task in the order of operations is to parse out house numbers. We do this with pm_house_parse()
:
> sushi2_min <- pm_house_parse(sushi2_min)
>
> sushi2_min
# A tibble: 11 x 3
pm.uid pm.address pm.house
<int> <chr> <chr>
1 1 Olive St 3407
2 2 S Grand Boulevard 3221
3 3 Maryland Plaza 1
4 4 W Pine Blvd 4949
5 5 Forest Park Ave 3674
6 6 Washington Avenue 1013
7 7 Olive St 3043
8 8 N Euclid Ave 308
9 9 Olive St 910
10 10 Olive Street 910
11 11 N EUCLID AVE 4
There is no dictionary for this step - the first word of any address will be parsed out so long as it contains at least some numbers. If you house number contains a range (123-125 Main St
) or fractional value (123 1/2 Main St
), there are a separate set of functions for both pm_houseRange_
and pm_houseFrac_
for dealing with these special cases.
Our addresses have two types of prefix and suffix data. There are directionals, like the “south” in S Grand Boulevard
, and the suffix value, which is “Boulevard”. The United States Postal Services prefers abbreviations for both directionals and suffix values, and so postmastr
returns the preferred abbreviations whenever full names are found. We can parse out the directionals first with pm_streetDir_parse()
:
> sushi2_min <- pm_streetDir_parse(sushi2_min, dictionary = dirs)
>
> sushi2_min
# A tibble: 11 x 4
pm.uid pm.address pm.house pm.preDir
<int> <chr> <chr> <chr>
1 1 Olive St 3407 NA
2 2 Grand Boulevard 3221 S
3 3 Maryland Plaza 1 NA
4 4 Pine Blvd 4949 W
5 5 Forest Park Ave 3674 NA
6 6 Washington Avenue 1013 NA
7 7 Olive St 3043 NA
8 8 Euclid Ave 308 N
9 9 Olive St 910 NA
10 10 Olive Street 910 NA
11 11 EUCLID AVE 4 N
postmastr
will automatically detect both prefix and suffix directionals (i.e. S Grand Boulevard
and Grand Boulevard S
).
Once directionals have been parsed, we can parse and standardize the street suffix values as well with pm_streetSuf_parse()
:
> sushi2_min <- pm_streetSuf_parse(sushi2_min)
>
> sushi2_min
# A tibble: 11 x 5
pm.uid pm.address pm.house pm.preDir pm.streetSuf
<int> <chr> <chr> <chr> <chr>
1 1 Olive 3407 NA St
2 2 Grand 3221 S Blvd
3 3 Maryland 1 NA Plz
4 4 Pine 4949 W Blvd
5 5 Forest Park 3674 NA Ave
6 6 Washington 1013 NA Ave
7 7 Olive 3043 NA St
8 8 Euclid 308 N Ave
9 9 Olive 910 NA St
10 10 Olive 910 NA St
11 11 EUCLID 4 N Ave
If a street has a directional name (i.e. North Ave
), the word North
will be added back into pm.address
after the street suffix is parsed out.
Our final parsing task is to convert whatever remains in pm.address
to the street name with pm_street_parse()
. As part of the standardization process, names like Second
will be converted to ordinals (i.e. 2nd
). This creates shorter, more compact output street names. This same standardization functionality allows for optional standardization of commonly misspelled street names as well.
> sushi2_min <- pm_street_parse(sushi2_min, ordinal = TRUE, drop = TRUE)
>
> sushi2_min
# A tibble: 11 x 5
pm.uid pm.house pm.preDir pm.street pm.streetSuf
<int> <chr> <chr> <chr> <chr>
1 1 3407 NA Olive St
2 2 3221 S Grand Blvd
3 3 1 NA Maryland Plz
4 4 4949 W Pine Blvd
5 5 3674 NA Forest Park Ave
6 6 1013 NA Washington Ave
7 7 3043 NA Olive St
8 8 308 N Euclid Ave
9 9 910 NA Olive St
10 10 910 NA Olive St
11 11 4 N Euclid Ave
The pm_street_
family of functions does not have the logical test functions at this time since street names are assumed to be whatever is leftover from the parsing process to this point.
Once we have parsed data, we add our parsed data back into the source data frame with pm_replae()
:
> sushi2_parsed <- pm_replace(sushi2_min, source = sushi2)
>
> sushi2_parsed
# A tibble: 15 x 9
pm.id pm.uid name address visit pm.house pm.preDir pm.street pm.streetSuf
<int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 BaiKu Sushi Lounge 3407 Olive St 3/20/18 3407 NA Olive St
2 2 2 Cafe Mochi 3221 S Grand Boulevard 10/10/18 3221 S Grand Blvd
3 3 3 Drunken Fish - Central West End 1 Maryland Plaza 12/2/18 1 NA Maryland Plz
4 4 4 Kampai Sushi Bar 4949 W Pine Blvd 2/13/18 4949 W Pine Blvd
5 5 5 Midtown Sushi & Ramen 3674 Forest Park Ave 3/4/18 3674 NA Forest Park Ave
6 6 6 Mizu Sushi Bar 1013 Washington Avenue 9/12/18 1013 NA Washington Ave
7 7 7 Sapporo 2 3043 Olive St 3/1/18 3043 NA Olive St
8 8 7 Sapporo 2 3043 Olive St 7/3/18 3043 NA Olive St
9 9 8 Sub Zero Vodka Bar 308 N Euclid Ave 12/7/18 308 N Euclid Ave
10 10 9 Sushi Ai 910 Olive St 3/29/18 910 NA Olive St
11 11 9 Sushi Ai 910 Olive St 5/20/18 910 NA Olive St
12 12 10 Sushi Ai 910 Olive Street 6/17/18 910 NA Olive St
13 13 10 Sushi Ai 910 Olive Street 8/25/18 910 NA Olive St
14 14 10 Sushi Ai 910 Olive Street 10/30/18 910 NA Olive St
15 15 11 SUSHI KOI 4 N EUCLID AVE 1/17/18 4 N Euclid Ave
The replacement process includes an unnest
argument, which will convert the house range list-columns into individual observations.
With our addresses replaced, we can then rebuild address strings them with pm.rebuild()
:
> sushi2_parsed <- pm_rebuild(sushi2_parsed, start = pm.house, end = pm.streetSuf, keep_parsed = "no")
>
> sushi2_parsed
# A tibble: 15 x 4
name address visit pm.address
<chr> <chr> <chr> <chr>
1 BaiKu Sushi Lounge 3407 Olive St 3/20/18 3407 Olive St
2 Cafe Mochi 3221 S Grand Boulevard 10/10/18 3221 S Grand Blvd
3 Drunken Fish - Central West End 1 Maryland Plaza 12/2/18 1 Maryland Plz
4 Kampai Sushi Bar 4949 W Pine Blvd 2/13/18 4949 W Pine Blvd
5 Midtown Sushi & Ramen 3674 Forest Park Ave 3/4/18 3674 Forest Park Ave
6 Mizu Sushi Bar 1013 Washington Avenue 9/12/18 1013 Washington Ave
7 Sapporo 2 3043 Olive St 3/1/18 3043 Olive St
8 Sapporo 2 3043 Olive St 7/3/18 3043 Olive St
9 Sub Zero Vodka Bar 308 N Euclid Ave 12/7/18 308 N Euclid Ave
10 Sushi Ai 910 Olive St 3/29/18 910 Olive St
11 Sushi Ai 910 Olive St 5/20/18 910 Olive St
12 Sushi Ai 910 Olive Street 6/17/18 910 Olive St
13 Sushi Ai 910 Olive Street 8/25/18 910 Olive St
14 Sushi Ai 910 Olive Street 10/30/18 910 Olive St
15 SUSHI KOI 4 N EUCLID AVE 1/17/18 4 N Euclid Ave
The keep_parsed
argument has options to retain some data, including city, state, and postal code, when a full address was present in the source data (use keep_parsed = "limited"
) or to keep all parsed data (with keep_parsed = "yes"
). The keep_ids
argument will retain the pm.id
and pm.uid
variables. Finally, there is a new_address
argument to specify the name of the new variable with the rebuild address. If not specified, pm.address
is used as the default variable name.
R
itself, welcome! Hadley Wickham’s R for Data Science is an excellent way to get started with data manipulation in the tidyverse, which stlcsb
is designed to integrate seamlessly with.postmastr
, you are encouraged to use the RStudio Community forums. Please create a reprex
before posting. Feel free to tag Chris (@chris.prener
) in any posts about postmastr
.reprex
and then open an issue on GitHub.