Address Parsing in R

postmastr is designed to be an opinionated toolkit for parsing street addresses using R. It was originally created to standardize addresses prior to geocoding the, in an effort to increase the geocoder’s ability to correctly match a given address with the appropriate coordinates.

A Grammar of Street Addresses

The Anatomy of an American Street Address

While street addresses have a significant amount of variation, they also are ordered in a relatively standardized fashion. Take, for example, an address from sushi1 (one of the example data sets in postmastr): 601 Clark Ave Suite 104, St. Louis, MO 63102-1719. We can break down the address in the following way:

house number - 601
street name - Clark
street suffix - Ave
unit type - Suite
unit number - 104
city - St. Louis
state - MO
postal code - 63102-1719

There is sometimes additional information as well. Imagine that the house number in the example above was 601-603 instead. We could break this more complex number down in the following way:

house number - 601-603
house number lower value - 601
house number upper value - 603

Another permutation has to do with the house number suffix. Imagine a house number that is 601R or 601 Rear:

house number - 601
house number suffix - R

Finally, streets in the United States sometimes contain either a street prefix direction (601 North Clark Ave) or a street suffix direction (601 Clark Ave North).

This basic anatomy of a street address forms our grammar of street addresses - a specific language for thinking about how addresses are formed and therefore can be parsed.

Basic Organization of `postmastr`

To parse our grammar of street addresses, functions can be grouped in two ways. All functions begin with the prefix pm_ in order to take advantage of RStudio’s auto-complete functionality.

First, we have major groups of functions based on their associated grammatical element:

house - house number
houseAlpha - alphanumeric house number
houseFrac - fractional house number
street - street name
streetDir - street prefix and suffix direction
streetSuf - street suffix
unit - unit name and number
city - city
state - state
postal - postal code

For each group of function, we have a similar menu of options that describe the verb (action) the function implements. For the state family of functions, for instance:

pm_state_detect() - does a given street address contain a state name or abbreviation?
pm_state_any() - does a any street address contain a state name or abbreviation?
pm_state_all() - do all street addresses contain a state name or abbreviation?
pm_state_none() - returns a tibble of street addresses that do not contain a state name or abbreviation
pm_state_parse() - parses street addresses that do contain a street name or abbreviation
pm_state_std() - standardizes the parsed state data to return upper-case abbreviations

Creating Dictionaries

Dictionaries are a critical component of the postmastr workflow because they allow you to define the terms that need to be parsed and standardized. Using well-defined dictionaries can speed up the parsing process and ensure that it is accurate.

State Dictionaries

postmastr comes with a built-in dictionary of states and their abbreviations that can be expanded and filtered using pm_dictionary() and pm_append(). On its own, pm_dictionary() will return that built-in dictionary with a variety of entries based on what we specify for case:

> pm_dictionary(type = "state", case = "title", locale = "us")
# A tibble: 124 x 2
   state.output state.input                                     
   <chr>        <chr>                                           
 1 AA           AA                                              
 2 AA           Armed Forces Americas                           
 3 AE           AE                                              
 4 AE           Armed Forces Europe, the Middle East, and Canada
 5 AK           AK                                              
 6 AK           Alaska                                          
 7 AL           AL                                              
 8 AL           Alabama                                         
 9 AP           AP                                              
10 AP           Armed Forces Pacific                            
# … with 114 more rows

We can customize this mix of cases by using "upper" and "lower" as well. For example:

> pm_dictionary(type = "state", case = c("title", "upper"), locale = "us")
# A tibble: 186 x 2
   state.output state.input                                     
   <chr>        <chr>                                           
 1 AA           AA                                              
 2 AA           Armed Forces Americas                           
 3 AA           ARMED FORCES AMERICAS                           
 4 AE           AE                                              
 5 AE           Armed Forces Europe, the Middle East, and Canada
 6 AE           ARMED FORCES EUROPE, THE MIDDLE EAST, AND CANADA
 7 AK           AK                                              
 8 AK           Alaska                                          
 9 AK           ALASKA                                          
10 AL           AL                                              
# … with 176 more rows

If only a subset of states are included in your data, you can improve postmastr’s performance by limiting your state dictionary’s contents. The filter argument can accept scalar or vector inputs of two-letter state abbreviations. For instance, we could construction a state dictionary that contains only the states along the Gulf of Mexico:

> pm_dictionary(type = "state", filter = c("AL", "FL", "LA", "MS", "TX"), case = "title", locale = "us")
# A tibble: 10 x 2
   state.output state.input
   <chr>        <chr>      
 1 AL           AL         
 2 AL           Alabama    
 3 FL           FL         
 4 FL           Florida    
 5 LA           LA         
 6 LA           Louisiana  
 7 MS           MS         
 8 MS           Mississippi
 9 TX           TX         
10 TX           Texas

The state dictionary can also be expanded. For instance, there are several common abbreviations for Mississippi - “Miss” and “MISS”. We can create an appendix with pm_append():

> miss <- pm_append(type = "state", input = "Miss", output = "MS", locale = "us")

Once this has been created, we can combine it with the default state dictionary to create our custom dictionary output:

> pm_dictionary(type = "state", append = miss, 
+     filter = c("AL", "FL", "LA", "MS", "TX"), 
+     case = c("title", "upper"), locale = "us")
# A tibble: 17 x 2
   state.output state.input
   <chr>        <chr>      
 1 AL           AL         
 2 AL           Alabama    
 3 AL           ALABAMA    
 4 FL           FL         
 5 FL           Florida    
 6 FL           FLORIDA    
 7 LA           LA         
 8 LA           Louisiana  
 9 LA           LOUISIANA  
10 MS           MS         
11 MS           Mississippi
12 MS           Miss       
13 MS           MISSISSIPPI
14 MS           MISS       
15 TX           TX         
16 TX           Texas      
17 TX           TEXAS

All of the different inputs for Mississippi will now return the same MS output.

Other Dictionaries

The only required dictionary for U.S. street addresses is the city dictionary, because there municipalities and place names in the U.S. are so numerous that we cannot build a single, efficient object to use as a default. postmastr comes with default U.S. dictionaries for states (shown above), street directionals, street suffixes, and unit types. It is also possible to build street name dictionaries to correct common misspellings and to create dictionaries for house suffix values (e.g. “Front” or “Rear” addresses).

Sample Data

To illustrate the core components of the postmastr workflow, we’ll use some data on sushi restaurants in the St. Louis, Missouri region. These are “long” data - some restaurants appear multiple times. Here is a quick preview of the data:

> sushi1
# A tibble: 30 x 3
   name                            address                                           visit   
   <chr>                           <chr>                                             <chr>   
 1 BaiKu Sushi Lounge              3407 Olive St, St. Louis, Missouri 63103          3/20/18 
 2 Blue Ocean Restaurant           6335 Delmar Blvd, St. Louis, MO 63112             10/26/18
 3 Cafe Mochi                      3221 S Grand Boulevard, St. Louis, MO 63118       10/10/18
 4 Drunken Fish - Ballpark Village 601 Clark Ave #104, St. Louis, MO 63102-1719      4/28/18 
 5 Drunken Fish - Ballpark Village 601 Clark Ave Suite 104, St. Louis, MO 63102-1719 5/10/18 
 6 Drunken Fish - Ballpark Village 601 Clark Ave Suite 104, St. Louis, MO 63102-1719 8/7/18  
 7 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108             12/2/18 
 8 I Love Mr Sushi                 9443 Olive Blvd, St. Louis, Missouri 63132        1/1/18  
 9 Kampai Sushi Bar                4949 W Pine Blvd, St. Louis, MO 63108             2/13/18 
10 Midtown Sushi & Ramen           3674 Forest Park Ave, St. Louis, MO 63108         3/4/18  
# … with 20 more rows

Some problems should already be apparent. For instance, Cafe Mouchi uses the full word for “Boulevard” while the entry for Blue Ocean uses the proper abbreviation “Blvd”. For Drunken Fish, the suite number is listed both using the pound sign (“#”) as well as with the word “Suite”. Finally, some of the entries including for BaiKu and I Love Mr Sushi use the full name for “Missouri” while the rest use the proper two-letter abbreviation “MO”. Finally, the Drunk Fish - Ballpark Village uses the “zip+4” format as opposed to the remainder of the addresses visible, which contain only the five digit zip-code.

Some of the other entries have additional issues. For example, Sushi Koi has its address fully capitalized:

> sushi1$address[[22]]
[1] "4 N EUCLID AVE, SAINT LOUIS, MO 63108"

Similarly, Wasabi Sushi Bar has its address fully capitalized, but the prefix direction “SOUTH” appears as a word rather than the proper “S”:

> sushi1$address[[30]]
[1] "16 SOUTH CENTRAL AVE, CLAYTON, MISSOURI 63105"

This vignette will walk through the process of addressing these issues.

Omnibus Parsing Functionality

The postmastr package has a single high-level function for parsing, pm_parse(), which wraps all of preparatory, parsing, and reconstruction functions into a single call. This can be used if the problem-space is exceptionally well defined - all dictionaries need to be created ahead of time, so you must know what dictionary elements are necessary. This may be possible for you if you consistently work with specific data sources and have a good understanding what cities, states, and other elements of the grammar are present. If you are not sure what dictionary elements are needed, you will need to use the workflow illustrated below to develop these objects.

For the sushi1 data, the required dictionaries are:

> dirs <- pm_dictionary(type = "directional", filter = c("N", "S", "E", "W"), locale = "us")
> mo <- pm_dictionary(type = "state", filter = "MO", case = c("title", "upper"), locale = "us")
> cities <- pm_append(type = "city",
+                       input = c("Brentwood", "Clayton", "CLAYTON", "Maplewood", 
+                                 "St. Louis", "SAINT LOUIS", "Webster Groves"),
+                       output = c(NA, NA, "Clayton", NA, NA, "St. Louis", NA))

Once those dictionaries are built, we can parse the data. The input argument is used to define how the address data are structured, and output is used to specify what type of output you receive.

> sushi1 %>%
+   dplyr::filter(name != "Drunken Fish - Ballpark Village") %>%
+   pm_parse(input = "full", address = "address", output = "full",
+          dir_dict = dirs, city_dict = cities, state_dict = mo)
# A tibble: 27 x 4
   name                            address                                        visit    pm.address                                  
   <chr>                           <chr>                                          <chr>    <chr>                                       
 1 BaiKu Sushi Lounge              3407 Olive St, St. Louis, Missouri 63103       3/20/18  3407 Olive St St. Louis MO 63103            
 2 Blue Ocean Restaurant           6335 Delmar Blvd, St. Louis, MO 63112          10/26/18 6335 Delmar Blvd St. Louis MO 63112         
 3 Cafe Mochi                      3221 S Grand Boulevard, St. Louis, MO 63118    10/10/18 3221 S Grand Blvd St. Louis MO 63118        
 4 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108          12/2/18  1 Maryland Plz St. Louis MO 63108           
 5 I Love Mr Sushi                 9443 Olive Blvd, St. Louis, Missouri 63132     1/1/18   9443 Olive Blvd St. Louis MO 63132          
 6 Kampai Sushi Bar                4949 W Pine Blvd, St. Louis, MO 63108          2/13/18  4949 W Pine Blvd St. Louis MO 63108         
 7 Midtown Sushi & Ramen           3674 Forest Park Ave, St. Louis, MO 63108      3/4/18   3674 Forest Park Ave St. Louis MO 63108     
 8 Mizu Sushi Bar                  1013 Washington Avenue, St. Louis, MO 63101    9/12/18  1013 Washington Ave St. Louis MO 63101      
 9 Robata Maplewood                7260 Manchester Road, Maplewood, MO 63143      11/1/18  7260 Manchester Rd Maplewood MO 63143       
10 SanSai Japanese Grill Maplewood 1803 Maplewood Commons Dr, St. Louis, MO 63143 2/14/18  1803 Maplewood Commons Dr St. Louis MO 63143
# … with 17 more rows

We can limit our output to just the house and street data:

> sushi1 %>%
+   dplyr::filter(name != "Drunken Fish - Ballpark Village") %>%
+   pm_parse(input = "full", address = "address", output = "short",
+          dir_dict = dirs, city_dict = cities, state_dict = mo)
# A tibble: 27 x 4
   name                            address                                        visit    pm.address               
   <chr>                           <chr>                                          <chr>    <chr>                    
 1 BaiKu Sushi Lounge              3407 Olive St, St. Louis, Missouri 63103       3/20/18  3407 Olive St            
 2 Blue Ocean Restaurant           6335 Delmar Blvd, St. Louis, MO 63112          10/26/18 6335 Delmar Blvd         
 3 Cafe Mochi                      3221 S Grand Boulevard, St. Louis, MO 63118    10/10/18 3221 S Grand Blvd        
 4 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108          12/2/18  1 Maryland Plz           
 5 I Love Mr Sushi                 9443 Olive Blvd, St. Louis, Missouri 63132     1/1/18   9443 Olive Blvd          
 6 Kampai Sushi Bar                4949 W Pine Blvd, St. Louis, MO 63108          2/13/18  4949 W Pine Blvd         
 7 Midtown Sushi & Ramen           3674 Forest Park Ave, St. Louis, MO 63108      3/4/18   3674 Forest Park Ave     
 8 Mizu Sushi Bar                  1013 Washington Avenue, St. Louis, MO 63101    9/12/18  1013 Washington Ave      
 9 Robata Maplewood                7260 Manchester Road, Maplewood, MO 63143      11/1/18  7260 Manchester Rd       
10 SanSai Japanese Grill Maplewood 1803 Maplewood Commons Dr, St. Louis, MO 63143 2/14/18  1803 Maplewood Commons Dr
# … with 17 more rows

We can also add the city, state, and postal code data as separate columns:

> sushi1 %>%
+   dplyr::filter(name != "Drunken Fish - Ballpark Village") %>%
+   pm_parse(input = "full", address = "address", output = "short", keep_parsed = "limited", 
+          dir_dict = dirs, city_dict = cities, state_dict = mo)
# A tibble: 27 x 8
   name                            address                                        visit    pm.address                pm.city   pm.state pm.zip pm.zip4
   <chr>                           <chr>                                          <chr>    <chr>                     <chr>     <chr>    <chr>  <chr>  
 1 BaiKu Sushi Lounge              3407 Olive St, St. Louis, Missouri 63103       3/20/18  3407 Olive St             St. Louis MO       63103  NA     
 2 Blue Ocean Restaurant           6335 Delmar Blvd, St. Louis, MO 63112          10/26/18 6335 Delmar Blvd          St. Louis MO       63112  NA     
 3 Cafe Mochi                      3221 S Grand Boulevard, St. Louis, MO 63118    10/10/18 3221 S Grand Blvd         St. Louis MO       63118  NA     
 4 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108          12/2/18  1 Maryland Plz            St. Louis MO       63108  NA     
 5 I Love Mr Sushi                 9443 Olive Blvd, St. Louis, Missouri 63132     1/1/18   9443 Olive Blvd           St. Louis MO       63132  NA     
 6 Kampai Sushi Bar                4949 W Pine Blvd, St. Louis, MO 63108          2/13/18  4949 W Pine Blvd          St. Louis MO       63108  NA     
 7 Midtown Sushi & Ramen           3674 Forest Park Ave, St. Louis, MO 63108      3/4/18   3674 Forest Park Ave      St. Louis MO       63108  NA     
 8 Mizu Sushi Bar                  1013 Washington Avenue, St. Louis, MO 63101    9/12/18  1013 Washington Ave       St. Louis MO       63101  NA     
 9 Robata Maplewood                7260 Manchester Road, Maplewood, MO 63143      11/1/18  7260 Manchester Rd        Maplewood MO       63143  NA     
10 SanSai Japanese Grill Maplewood 1803 Maplewood Commons Dr, St. Louis, MO 63143 2/14/18  1803 Maplewood Commons Dr St. Louis MO       63143  NA     
# … with 17 more rows

The `postmastr` Workflow

If you do not have a strong sense of the problem-space you are confronted with, and therefore do not know exactly what is needed for dictionaries, postmastr has a full-featured workflow for step-by-step parsing of street addresses.

Order of Operations in `postmastr`

postmast’s functionality rests on an order of operations that must be followed to ensure correct parsing:

prep
postal code
state
city
unit
house number
ranged house number
fractional house number
house suffix
street directionals
street suffix
street name
reconstruct

If no street address in the given data set contains one of the grammatical elements (for example, if none of the addresses contain units), the associated functions in the workflow can be skipped. For “short” addresses that do not include cities, states, or postal codes, the order of operations should be:

prep
unit
house number
ranged house number
fractional house number
house suffix
street directionals
street suffix
street name
reconstruct

Prep

There are two initial preparatory steps that must be taken with these data. First, we want to ensure that they have a unique identification number for each row (to preserve the original sort order; pm.id) as well as a unique identification number for each unique street address string (pm.uid). These can both be applied using pm.identify():

> sushi1 <- pm_identify(sushi1, var = "address")
> sushi1
# A tibble: 30 x 5
   pm.id pm.uid name                            address                                           visit   
   <int>  <int> <chr>                           <chr>                                             <chr>   
 1     1      1 BaiKu Sushi Lounge              3407 Olive St, St. Louis, Missouri 63103          3/20/18 
 2     2      2 Blue Ocean Restaurant           6335 Delmar Blvd, St. Louis, MO 63112             10/26/18
 3     3      3 Cafe Mochi                      3221 S Grand Boulevard, St. Louis, MO 63118       10/10/18
 4     4      4 Drunken Fish - Ballpark Village 601 Clark Ave #104, St. Louis, MO 63102-1719      4/28/18 
 5     5      5 Drunken Fish - Ballpark Village 601 Clark Ave Suite 104, St. Louis, MO 63102-1719 5/10/18 
 6     6      5 Drunken Fish - Ballpark Village 601 Clark Ave Suite 104, St. Louis, MO 63102-1719 8/7/18  
 7     7      6 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108             12/2/18 
 8     8      7 I Love Mr Sushi                 9443 Olive Blvd, St. Louis, Missouri 63132        1/1/18  
 9     9      8 Kampai Sushi Bar                4949 W Pine Blvd, St. Louis, MO 63108             2/13/18 
10    10      9 Midtown Sushi & Ramen           3674 Forest Park Ave, St. Louis, MO 63108         3/4/18  
# … with 20 more rows

Notice that the Drunken Fish has three different unique identifiers applied for pm.uid - two for the Ballpark Village location based on how the suite number is indicated and one for the Central West End location. Since address data are often numerous, postmastr is designed to operate on unique street address strings rather than the full original data set to improve efficiency. We’ll create our minimal postmastr object using pm.prep():

> sushi1_min <- pm_prep(sushi1, var = "address")
> sushi1_min
# A tibble: 24 x 2
   pm.uid pm.address                                     
    <int> <chr>                                          
 1      1 3407 Olive St St. Louis Missouri 63103         
 2      2 6335 Delmar Blvd St. Louis MO 63112            
 3      3 3221 S Grand Boulevard St. Louis MO 63118      
 4      4 601 Clark Ave #104 St. Louis MO 63102-1719     
 5      5 601 Clark Ave Suite 104 St. Louis MO 63102-1719
 6      6 1 Maryland Plaza St. Louis MO 63108            
 7      7 9443 Olive Blvd St. Louis Missouri 63132       
 8      8 4949 W Pine Blvd St. Louis MO 63108            
 9      9 3674 Forest Park Ave St. Louis MO 63108        
10     10 1013 Washington Avenue St. Louis MO 63101      
# … with 14 more rows

Notice that all extraneous information has been removed, and that there are now only 24 rows instead of the original 30.

Postal Codes

Once we have our data prepared, we can begin working our way down the order of operations list. To see if is possible to skip a step, we should first use the appropriate pm_any_ function. With the sushi1_min data, it will return TRUE because postal codes are present in the data:

> pm_any_postal(sushi1_min)
[1] TRUE

We can also use pm_all_ functions to determine whether all of the addresses have a postal code:

> pm_all_postal(sushi1_min)
[1] TRUE

If this returned a FALSE result, we would want to use pm_has and pm_no_ functions to explore whether postal codes are not being detected because they (a) actually do not exist or (b) are not being found because they are mis-formatted. Since we get a TRUE result, we can move on to parsing. If pm_parse_postal() detects the presence of carrier routes (the four-digit additions to the typical five-digit zip-codes), it will parse those as well so that two postal code columns (pm.zip and pm.zip4) are returned:

> sushi1_min <- pm_parse_postal(sushi1_min)
> sushi1_min
# A tibble: 24 x 4
   pm.uid pm.address                           pm.zip pm.zip4
    <int> <chr>                                <chr>  <chr>  
 1      1 3407 Olive St St. Louis Missouri     63103  NA     
 2      2 6335 Delmar Blvd St. Louis MO        63112  NA     
 3      3 3221 S Grand Boulevard St. Louis MO  63118  NA     
 4      4 601 Clark Ave #104 St. Louis MO      63102  1719   
 5      5 601 Clark Ave Suite 104 St. Louis MO 63102  1719   
 6      6 1 Maryland Plaza St. Louis MO        63108  NA     
 7      7 9443 Olive Blvd St. Louis Missouri   63132  NA     
 8      8 4949 W Pine Blvd St. Louis MO        63108  NA     
 9      9 3674 Forest Park Ave St. Louis MO    63108  NA     
10     10 1013 Washington Avenue St. Louis MO  63101  NA     
# … with 14 more rows

Had no carrier routes been present, these data would have been returned with only the pm.zip column. Note that pm_parse_postal() also updates the pm.address column and removes any postal codes that have been identified. This facilitates additional parsing in subsequent steps.

States

To parse the cities out of pm.address, we’ll start by creating a state-level dictionary object that contains only references to Missouri since we don’t have data outside of that state:

> moDict <- pm_dictionary(locale = "us", type = "state", filter = "MO", case = "title")
> moDict
# A tibble: 2 x 2
  state.output state.input
  <chr>        <chr>      
1 MO           MO         
2 MO           Missouri

With our dictionary created in the object moDict, we can use that to test whether state names or abbreviations are found at the end of our address string with pm_any_state() and pm_all_state():

> pm_any_state(sushi1_min, dictionary = moDict)
[1] TRUE
> pm_all_state(sushi1_min, dictionary = moDict)
[1] FALSE

These results indicate that our dictionary is returning matches, but that our list of possible inputs is not complete. We can explore the un-matched streets with pm_no_state() to determining whether these observations (a) actually do not contain states or (b) are not being matched because an entry is missing from our dictionary:

> pm_no_state(sushi1_min, dictionary = moDict)
# A tibble: 1 x 4
  pm.uid pm.address                            pm.zip pm.zip4
   <int> <chr>                                 <chr>  <chr>  
1     24 16 SOUTH CENTRAL AVE CLAYTON MISSOURI 63105  NA

Our dictionary does not contain MISSOURI, only Missouri, so we’ll need to add that option. We can use case = c("title", "upper") to do this:

> moDict <- pm_dictionary(locale = "us", type = "state", filter = "MO", case = c("title", "upper"))
> moDict
# A tibble: 3 x 2
  state.output state.input
  <chr>        <chr>      
1 MO           MO         
2 MO           Missouri   
3 MO           MISSOURI

For more complex mis-matches, we would want to use pm_append() to construct an appendix and then re-build our dictionary with that appendix included. We can verify that our dictionary is now complete by repeating our use of pm_all_state():

> pm_all_state(sushi1_min, dictionary = moDict)
[1] TRUE

With a TRUE result, we are ready to parse state names and abbreviations out of our data:

> sushi1_min <- pm_parse_state(sushi1_min, dictionary = moDict)
> sushi1_min
# A tibble: 24 x 5
   pm.uid pm.address                        pm.state pm.zip pm.zip4
    <int> <chr>                             <chr>    <chr>  <chr>  
 1      1 3407 Olive St St. Louis           MO       63103  NA     
 2      2 6335 Delmar Blvd St. Louis        MO       63112  NA     
 3      3 3221 S Grand Boulevard St. Louis  MO       63118  NA     
 4      4 601 Clark Ave #104 St. Louis      MO       63102  1719   
 5      5 601 Clark Ave Suite 104 St. Louis MO       63102  1719   
 6      6 1 Maryland Plaza St. Louis        MO       63108  NA     
 7      7 9443 Olive Blvd St. Louis         MO       63132  NA     
 8      8 4949 W Pine Blvd St. Louis        MO       63108  NA     
 9      9 3674 Forest Park Ave St. Louis    MO       63108  NA     
10     10 1013 Washington Avenue St. Louis  MO       63101  NA     
# … with 14 more rows

If we inspect this object, we’ll see that pm.state contains the state abbreviation for Missouri in all cases. As with postal codes, the city names have been removed from pm.address to facilitate the next phase in parsing.

Cities

The workflow for cities is similar. Two options exist for American cities - you can either create a full list of cities by state using pm_dictionary() (useful if you are not sure which cities appear in the data) or use pm_append() on its own to create an appendix (faster if you know that there are a limited number of cities). This appendix for cities can be used on its own. We’ll choose this second option:

cityDict <- pm_append(type = "city",
                      input = c("Brentwood", "Clayton", "Maplewood", "St. Louis", "Webster Groves"))

We’ll then use pm_any_city() and pm_all_city() to verify that our dictionary is working and complete:

> pm_any_city(sushi1_min, dictionary = cityDict)
[1] TRUE
> pm_all_city(sushi1_min, dictionary = cityDict)
[1] FALSE

As with the state data, we can use pm_no_city() to identify why our dictionary is incomplete (or verify that some addresses do not contain cities):

> pm_no_city(sushi1_min, dictionary = cityDict)
# A tibble: 2 x 5
  pm.uid pm.address                   pm.state pm.zip pm.zip4
   <int> <chr>                        <chr>    <chr>  <chr>  
1     17 4 N EUCLID AVE SAINT LOUIS   MO       63108  NA     
2     24 16 SOUTH CENTRAL AVE CLAYTON MO       63105  NA

There are two city names in all upper-case. We’ll re-create our dictionary, adding both "SAINT LOUIS" and "CLAYTON" as well as specifying an output vector that is NA for all the correct cities but includes entries for our two incorrectly formatted cities and verify using pm_all_city() that our dictionary is now complete:

> cityDict <- pm_append(type = "city",
+                       input = c("Brentwood", "Clayton", "CLAYTON", "Maplewood", 
+                                 "St. Louis", "SAINT LOUIS", "Webster Groves"),
+                       output = c(NA, NA, "Clayton", NA, NA, "St. Louis", NA))
> pm_all_city(sushi1_min, dictionary = cityDict)
[1] TRUE

We can now move on to parsing using pm_parse_city():

> sushi1_min <- pm_parse_city(sushi1_min, dictionary = cityDict)
> sushi1_min
# A tibble: 24 x 6
   pm.uid pm.address              pm.city   pm.state pm.zip pm.zip4
    <int> <chr>                   <chr>     <chr>    <chr>  <chr>  
 1      1 3407 Olive St           St. Louis MO       63103  NA     
 2      2 6335 Delmar Blvd        St. Louis MO       63112  NA     
 3      3 3221 S Grand Boulevard  St. Louis MO       63118  NA     
 4      4 601 Clark Ave #104      St. Louis MO       63102  1719   
 5      5 601 Clark Ave Suite 104 St. Louis MO       63102  1719   
 6      6 1 Maryland Plaza        St. Louis MO       63108  NA     
 7      7 9443 Olive Blvd         St. Louis MO       63132  NA     
 8      8 4949 W Pine Blvd        St. Louis MO       63108  NA     
 9      9 3674 Forest Park Ave    St. Louis MO       63108  NA     
10     10 1013 Washington Avenue  St. Louis MO       63101  NA

The column pm.city has been added to our data set and city names have been parsed as well as standardized.

Units

This functionality is not enabled yet.

House Numbers

Since the example data do not contain fractional addresses or house suffix values, parsing them is straightforward (dealing with less common street addresses will be the subject of an additional vignette). We’ll use an abbreviated version of the sushi data that do not contain city, state, or postal code data. These are sushi restaurants located within the City of St. Louis proper:

> postmastr::sushi2 %>%
+   dplyr::filter(name != "Drunken Fish - Ballpark Village") %>%
+   pm_identify(var = address) -> sushi2
>
> sushi2_min <- pm_prep(sushi2, var = address)
>
> sushi2_min
# A tibble: 11 x 2
   pm.uid pm.address            
    <int> <chr>                 
 1      1 3407 Olive St         
 2      2 3221 S Grand Boulevard
 3      3 1 Maryland Plaza      
 4      4 4949 W Pine Blvd      
 5      5 3674 Forest Park Ave  
 6      6 1013 Washington Avenue
 7      7 3043 Olive St         
 8      8 308 N Euclid Ave      
 9      9 910 Olive St          
10     10 910 Olive Street      
11     11 4 N EUCLID AVE

Our next task in the order of operations is to parse out house numbers. We do this with pm_house_parse():

> sushi2_min <- pm_house_parse(sushi2_min)
>
> sushi2_min
# A tibble: 11 x 3
   pm.uid pm.address        pm.house
    <int> <chr>             <chr>   
 1      1 Olive St          3407    
 2      2 S Grand Boulevard 3221    
 3      3 Maryland Plaza    1       
 4      4 W Pine Blvd       4949    
 5      5 Forest Park Ave   3674    
 6      6 Washington Avenue 1013    
 7      7 Olive St          3043    
 8      8 N Euclid Ave      308     
 9      9 Olive St          910     
10     10 Olive Street      910     
11     11 N EUCLID AVE      4

There is no dictionary for this step - the first word of any address will be parsed out so long as it contains at least some numbers. If you house number contains a range (123-125 Main St) or fractional value (123 1/2 Main St), there are a separate set of functions for both pm_houseRange_ and pm_houseFrac_ for dealing with these special cases.

Street Prefix and Suffix Data

Our addresses have two types of prefix and suffix data. There are directionals, like the “south” in S Grand Boulevard, and the suffix value, which is “Boulevard”. The United States Postal Services prefers abbreviations for both directionals and suffix values, and so postmastr returns the preferred abbreviations whenever full names are found. We can parse out the directionals first with pm_streetDir_parse():

> sushi2_min <- pm_streetDir_parse(sushi2_min, dictionary = dirs)
>
> sushi2_min
# A tibble: 11 x 4
   pm.uid pm.address        pm.house pm.preDir
    <int> <chr>             <chr>    <chr>    
 1      1 Olive St          3407     NA       
 2      2 Grand Boulevard   3221     S        
 3      3 Maryland Plaza    1        NA       
 4      4 Pine Blvd         4949     W        
 5      5 Forest Park Ave   3674     NA       
 6      6 Washington Avenue 1013     NA       
 7      7 Olive St          3043     NA       
 8      8 Euclid Ave        308      N        
 9      9 Olive St          910      NA       
10     10 Olive Street      910      NA       
11     11 EUCLID AVE        4        N

postmastr will automatically detect both prefix and suffix directionals (i.e. S Grand Boulevard and Grand Boulevard S).

Once directionals have been parsed, we can parse and standardize the street suffix values as well with pm_streetSuf_parse():

> sushi2_min <- pm_streetSuf_parse(sushi2_min)
>
> sushi2_min
# A tibble: 11 x 5
   pm.uid pm.address  pm.house pm.preDir pm.streetSuf
    <int> <chr>       <chr>    <chr>     <chr>       
 1      1 Olive       3407     NA        St          
 2      2 Grand       3221     S         Blvd        
 3      3 Maryland    1        NA        Plz         
 4      4 Pine        4949     W         Blvd        
 5      5 Forest Park 3674     NA        Ave         
 6      6 Washington  1013     NA        Ave         
 7      7 Olive       3043     NA        St          
 8      8 Euclid      308      N         Ave         
 9      9 Olive       910      NA        St          
10     10 Olive       910      NA        St          
11     11 EUCLID      4        N         Ave

If a street has a directional name (i.e. North Ave), the word North will be added back into pm.address after the street suffix is parsed out.

Street Names

Our final parsing task is to convert whatever remains in pm.address to the street name with pm_street_parse(). As part of the standardization process, names like Second will be converted to ordinals (i.e. 2nd). This creates shorter, more compact output street names. This same standardization functionality allows for optional standardization of commonly misspelled street names as well.

> sushi2_min <- pm_street_parse(sushi2_min, ordinal = TRUE, drop = TRUE)
>
> sushi2_min
# A tibble: 11 x 5
   pm.uid pm.house pm.preDir pm.street   pm.streetSuf
    <int> <chr>    <chr>     <chr>       <chr>       
 1      1 3407     NA        Olive       St          
 2      2 3221     S         Grand       Blvd        
 3      3 1        NA        Maryland    Plz         
 4      4 4949     W         Pine        Blvd        
 5      5 3674     NA        Forest Park Ave         
 6      6 1013     NA        Washington  Ave         
 7      7 3043     NA        Olive       St          
 8      8 308      N         Euclid      Ave         
 9      9 910      NA        Olive       St          
10     10 910      NA        Olive       St          
11     11 4        N         Euclid      Ave

The pm_street_ family of functions does not have the logical test functions at this time since street names are assumed to be whatever is leftover from the parsing process to this point.

Putting It All Back Together

Once we have parsed data, we add our parsed data back into the source data frame with pm_replae():

> sushi2_parsed <- pm_replace(sushi2_min, source = sushi2)
> 
> sushi2_parsed
# A tibble: 15 x 9
   pm.id pm.uid name                            address                visit    pm.house pm.preDir pm.street   pm.streetSuf
   <int>  <int> <chr>                           <chr>                  <chr>    <chr>    <chr>     <chr>       <chr>       
 1     1      1 BaiKu Sushi Lounge              3407 Olive St          3/20/18  3407     NA        Olive       St          
 2     2      2 Cafe Mochi                      3221 S Grand Boulevard 10/10/18 3221     S         Grand       Blvd        
 3     3      3 Drunken Fish - Central West End 1 Maryland Plaza       12/2/18  1        NA        Maryland    Plz         
 4     4      4 Kampai Sushi Bar                4949 W Pine Blvd       2/13/18  4949     W         Pine        Blvd        
 5     5      5 Midtown Sushi & Ramen           3674 Forest Park Ave   3/4/18   3674     NA        Forest Park Ave         
 6     6      6 Mizu Sushi Bar                  1013 Washington Avenue 9/12/18  1013     NA        Washington  Ave         
 7     7      7 Sapporo 2                       3043 Olive St          3/1/18   3043     NA        Olive       St          
 8     8      7 Sapporo 2                       3043 Olive St          7/3/18   3043     NA        Olive       St          
 9     9      8 Sub Zero Vodka Bar              308 N Euclid Ave       12/7/18  308      N         Euclid      Ave         
10    10      9 Sushi Ai                        910 Olive St           3/29/18  910      NA        Olive       St          
11    11      9 Sushi Ai                        910 Olive St           5/20/18  910      NA        Olive       St          
12    12     10 Sushi Ai                        910 Olive Street       6/17/18  910      NA        Olive       St          
13    13     10 Sushi Ai                        910 Olive Street       8/25/18  910      NA        Olive       St          
14    14     10 Sushi Ai                        910 Olive Street       10/30/18 910      NA        Olive       St          
15    15     11 SUSHI KOI                       4 N EUCLID AVE         1/17/18  4        N         Euclid      Ave

The replacement process includes an unnest argument, which will convert the house range list-columns into individual observations.

With our addresses replaced, we can then rebuild address strings them with pm.rebuild():

> sushi2_parsed <- pm_rebuild(sushi2_parsed, start = pm.house, end = pm.streetSuf, keep_parsed = "no")
>
> sushi2_parsed
# A tibble: 15 x 4
   name                            address                visit    pm.address          
   <chr>                           <chr>                  <chr>    <chr>               
 1 BaiKu Sushi Lounge              3407 Olive St          3/20/18  3407 Olive St       
 2 Cafe Mochi                      3221 S Grand Boulevard 10/10/18 3221 S Grand Blvd   
 3 Drunken Fish - Central West End 1 Maryland Plaza       12/2/18  1 Maryland Plz      
 4 Kampai Sushi Bar                4949 W Pine Blvd       2/13/18  4949 W Pine Blvd    
 5 Midtown Sushi & Ramen           3674 Forest Park Ave   3/4/18   3674 Forest Park Ave
 6 Mizu Sushi Bar                  1013 Washington Avenue 9/12/18  1013 Washington Ave 
 7 Sapporo 2                       3043 Olive St          3/1/18   3043 Olive St       
 8 Sapporo 2                       3043 Olive St          7/3/18   3043 Olive St       
 9 Sub Zero Vodka Bar              308 N Euclid Ave       12/7/18  308 N Euclid Ave    
10 Sushi Ai                        910 Olive St           3/29/18  910 Olive St        
11 Sushi Ai                        910 Olive St           5/20/18  910 Olive St        
12 Sushi Ai                        910 Olive Street       6/17/18  910 Olive St        
13 Sushi Ai                        910 Olive Street       8/25/18  910 Olive St        
14 Sushi Ai                        910 Olive Street       10/30/18 910 Olive St        
15 SUSHI KOI                       4 N EUCLID AVE         1/17/18  4 N Euclid Ave

The keep_parsed argument has options to retain some data, including city, state, and postal code, when a full address was present in the source data (use keep_parsed = "limited") or to keep all parsed data (with keep_parsed = "yes"). The keep_ids argument will retain the pm.id and pm.uid variables. Finally, there is a new_address argument to specify the name of the new variable with the rebuild address. If not specified, pm.address is used as the default variable name.

Getting Help

If you are new to R itself, welcome! Hadley Wickham’s R for Data Science is an excellent way to get started with data manipulation in the tidyverse, which stlcsb is designed to integrate seamlessly with.
If you have questions about using postmastr, you are encouraged to use the RStudio Community forums. Please create a reprex before posting. Feel free to tag Chris (@chris.prener) in any posts about postmastr.
If you think you’ve found a bug, please create a reprex and then open an issue on GitHub.

Christopher Prener, Ph.D.

2019-03-27

A Grammar of Street Addresses

The Anatomy of an American Street Address

Basic Organization of `postmastr`

Creating Dictionaries

State Dictionaries

Other Dictionaries

Sample Data

Omnibus Parsing Functionality

The `postmastr` Workflow

Order of Operations in `postmastr`

Prep

Postal Codes

States

Cities

Units

House Numbers

Street Prefix and Suffix Data

Street Names

Putting It All Back Together

Getting Help

Contents

Address Parsing in R

Christopher Prener, Ph.D.

2019-03-27

A Grammar of Street Addresses

The Anatomy of an American Street Address

Basic Organization of postmastr

Creating Dictionaries

State Dictionaries

Other Dictionaries

Sample Data

Omnibus Parsing Functionality

The postmastr Workflow

Order of Operations in postmastr

Prep

Postal Codes

States

Cities

Units

House Numbers

Street Prefix and Suffix Data

Street Names

Putting It All Back Together

Getting Help

Contents

Basic Organization of `postmastr`

The `postmastr` Workflow

Order of Operations in `postmastr`