postmastr is designed to be an opinionated toolkit for parsing street addresses using R. It was originally created to standardize addresses prior to geocoding the, in an effort to increase the geocoder’s ability to correctly match a given address with the appropriate coordinates.

A Grammar of Street Addresses

The Anatomy of an American Street Address

While street addresses have a significant amount of variation, they also are ordered in a relatively standardized fashion. Take, for example, an address from sushi1 (one of the example data sets in postmastr): 601 Clark Ave Suite 104, St. Louis, MO 63102-1719. We can break down the address in the following way:

  • house number - 601
  • street name - Clark
  • street suffix - Ave
  • unit type - Suite
  • unit number - 104
  • city - St. Louis
  • state - MO
  • postal code - 63102-1719

There is sometimes additional information as well. Imagine that the house number in the example above was 601-603 instead. We could break this more complex number down in the following way:

  • house number - 601-603
  • house number lower value - 601
  • house number upper value - 603

Another permutation has to do with the house number suffix. Imagine a house number that is 601R or 601 Rear:

  • house number - 601
  • house number suffix - R

Finally, streets in the United States sometimes contain either a street prefix direction (601 North Clark Ave) or a street suffix direction (601 Clark Ave North).

This basic anatomy of a street address forms our grammar of street addresses - a specific language for thinking about how addresses are formed and therefore can be parsed.

Basic Organization of postmastr

To parse our grammar of street addresses, functions can be grouped in two ways. All functions begin with the prefix pm_ in order to take advantage of RStudio’s auto-complete functionality.

First, we have major groups of functions based on their associated grammatical element:

  • house - house number
  • houseAlpha - alphanumeric house number
  • houseFrac - fractional house number
  • street - street name
  • streetDir - street prefix and suffix direction
  • streetSuf - street suffix
  • unit - unit name and number
  • city - city
  • state - state
  • postal - postal code

For each group of function, we have a similar menu of options that describe the verb (action) the function implements. For the state family of functions, for instance:

  • pm_state_detect() - does a given street address contain a state name or abbreviation?
  • pm_state_any() - does a any street address contain a state name or abbreviation?
  • pm_state_all() - do all street addresses contain a state name or abbreviation?
  • pm_state_none() - returns a tibble of street addresses that do not contain a state name or abbreviation
  • pm_state_parse() - parses street addresses that do contain a street name or abbreviation
  • pm_state_std() - standardizes the parsed state data to return upper-case abbreviations

Creating Dictionaries

Dictionaries are a critical component of the postmastr workflow because they allow you to define the terms that need to be parsed and standardized. Using well-defined dictionaries can speed up the parsing process and ensure that it is accurate.

State Dictionaries

postmastr comes with a built-in dictionary of states and their abbreviations that can be expanded and filtered using pm_dictionary() and pm_append(). On its own, pm_dictionary() will return that built-in dictionary with a variety of entries based on what we specify for case:

We can customize this mix of cases by using "upper" and "lower" as well. For example:

If only a subset of states are included in your data, you can improve postmastr’s performance by limiting your state dictionary’s contents. The filter argument can accept scalar or vector inputs of two-letter state abbreviations. For instance, we could construction a state dictionary that contains only the states along the Gulf of Mexico:

The state dictionary can also be expanded. For instance, there are several common abbreviations for Mississippi - “Miss” and “MISS”. We can create an appendix with pm_append():

> miss <- pm_append(type = "state", input = "Miss", output = "MS", locale = "us")

Once this has been created, we can combine it with the default state dictionary to create our custom dictionary output:

All of the different inputs for Mississippi will now return the same MS output.

Other Dictionaries

The only required dictionary for U.S. street addresses is the city dictionary, because there municipalities and place names in the U.S. are so numerous that we cannot build a single, efficient object to use as a default. postmastr comes with default U.S. dictionaries for states (shown above), street directionals, street suffixes, and unit types. It is also possible to build street name dictionaries to correct common misspellings and to create dictionaries for house suffix values (e.g. “Front” or “Rear” addresses).

Sample Data

To illustrate the core components of the postmastr workflow, we’ll use some data on sushi restaurants in the St. Louis, Missouri region. These are “long” data - some restaurants appear multiple times. Here is a quick preview of the data:

Some problems should already be apparent. For instance, Cafe Mouchi uses the full word for “Boulevard” while the entry for Blue Ocean uses the proper abbreviation “Blvd”. For Drunken Fish, the suite number is listed both using the pound sign (“#”) as well as with the word “Suite”. Finally, some of the entries including for BaiKu and I Love Mr Sushi use the full name for “Missouri” while the rest use the proper two-letter abbreviation “MO”. Finally, the Drunk Fish - Ballpark Village uses the “zip+4” format as opposed to the remainder of the addresses visible, which contain only the five digit zip-code.

Some of the other entries have additional issues. For example, Sushi Koi has its address fully capitalized:

Similarly, Wasabi Sushi Bar has its address fully capitalized, but the prefix direction “SOUTH” appears as a word rather than the proper “S”:

This vignette will walk through the process of addressing these issues.

Omnibus Parsing Functionality

The postmastr package has a single high-level function for parsing, pm_parse(), which wraps all of preparatory, parsing, and reconstruction functions into a single call. This can be used if the problem-space is exceptionally well defined - all dictionaries need to be created ahead of time, so you must know what dictionary elements are necessary. This may be possible for you if you consistently work with specific data sources and have a good understanding what cities, states, and other elements of the grammar are present. If you are not sure what dictionary elements are needed, you will need to use the workflow illustrated below to develop these objects.

For the sushi1 data, the required dictionaries are:

> dirs <- pm_dictionary(type = "directional", filter = c("N", "S", "E", "W"), locale = "us")
> mo <- pm_dictionary(type = "state", filter = "MO", case = c("title", "upper"), locale = "us")
> cities <- pm_append(type = "city",
+                       input = c("Brentwood", "Clayton", "CLAYTON", "Maplewood", 
+                                 "St. Louis", "SAINT LOUIS", "Webster Groves"),
+                       output = c(NA, NA, "Clayton", NA, NA, "St. Louis", NA))

Once those dictionaries are built, we can parse the data. The input argument is used to define how the address data are structured, and output is used to specify what type of output you receive.

We can limit our output to just the house and street data:

We can also add the city, state, and postal code data as separate columns:

The postmastr Workflow

If you do not have a strong sense of the problem-space you are confronted with, and therefore do not know exactly what is needed for dictionaries, postmastr has a full-featured workflow for step-by-step parsing of street addresses.

Order of Operations in postmastr

postmast’s functionality rests on an order of operations that must be followed to ensure correct parsing:

  1. prep
  2. postal code
  3. state
  4. city
  5. unit
  6. house number
  7. ranged house number
  8. fractional house number
  9. house suffix
  10. street directionals
  11. street suffix
  12. street name
  13. reconstruct

If no street address in the given data set contains one of the grammatical elements (for example, if none of the addresses contain units), the associated functions in the workflow can be skipped. For “short” addresses that do not include cities, states, or postal codes, the order of operations should be:

  1. prep
  2. unit
  3. house number
  4. ranged house number
  5. fractional house number
  6. house suffix
  7. street directionals
  8. street suffix
  9. street name
  10. reconstruct

Prep

There are two initial preparatory steps that must be taken with these data. First, we want to ensure that they have a unique identification number for each row (to preserve the original sort order; pm.id) as well as a unique identification number for each unique street address string (pm.uid). These can both be applied using pm.identify():

Notice that the Drunken Fish has three different unique identifiers applied for pm.uid - two for the Ballpark Village location based on how the suite number is indicated and one for the Central West End location. Since address data are often numerous, postmastr is designed to operate on unique street address strings rather than the full original data set to improve efficiency. We’ll create our minimal postmastr object using pm.prep():

Notice that all extraneous information has been removed, and that there are now only 24 rows instead of the original 30.

Postal Codes

Once we have our data prepared, we can begin working our way down the order of operations list. To see if is possible to skip a step, we should first use the appropriate pm_any_ function. With the sushi1_min data, it will return TRUE because postal codes are present in the data:

We can also use pm_all_ functions to determine whether all of the addresses have a postal code:

If this returned a FALSE result, we would want to use pm_has and pm_no_ functions to explore whether postal codes are not being detected because they (a) actually do not exist or (b) are not being found because they are mis-formatted. Since we get a TRUE result, we can move on to parsing. If pm_parse_postal() detects the presence of carrier routes (the four-digit additions to the typical five-digit zip-codes), it will parse those as well so that two postal code columns (pm.zip and pm.zip4) are returned:

Had no carrier routes been present, these data would have been returned with only the pm.zip column. Note that pm_parse_postal() also updates the pm.address column and removes any postal codes that have been identified. This facilitates additional parsing in subsequent steps.

States

To parse the cities out of pm.address, we’ll start by creating a state-level dictionary object that contains only references to Missouri since we don’t have data outside of that state:

With our dictionary created in the object moDict, we can use that to test whether state names or abbreviations are found at the end of our address string with pm_any_state() and pm_all_state():

These results indicate that our dictionary is returning matches, but that our list of possible inputs is not complete. We can explore the un-matched streets with pm_no_state() to determining whether these observations (a) actually do not contain states or (b) are not being matched because an entry is missing from our dictionary:

Our dictionary does not contain MISSOURI, only Missouri, so we’ll need to add that option. We can use case = c("title", "upper") to do this:

For more complex mis-matches, we would want to use pm_append() to construct an appendix and then re-build our dictionary with that appendix included. We can verify that our dictionary is now complete by repeating our use of pm_all_state():

With a TRUE result, we are ready to parse state names and abbreviations out of our data:

If we inspect this object, we’ll see that pm.state contains the state abbreviation for Missouri in all cases. As with postal codes, the city names have been removed from pm.address to facilitate the next phase in parsing.

Cities

The workflow for cities is similar. Two options exist for American cities - you can either create a full list of cities by state using pm_dictionary() (useful if you are not sure which cities appear in the data) or use pm_append() on its own to create an appendix (faster if you know that there are a limited number of cities). This appendix for cities can be used on its own. We’ll choose this second option:

cityDict <- pm_append(type = "city",
                      input = c("Brentwood", "Clayton", "Maplewood", "St. Louis", "Webster Groves"))

We’ll then use pm_any_city() and pm_all_city() to verify that our dictionary is working and complete:

As with the state data, we can use pm_no_city() to identify why our dictionary is incomplete (or verify that some addresses do not contain cities):

There are two city names in all upper-case. We’ll re-create our dictionary, adding both "SAINT LOUIS" and "CLAYTON" as well as specifying an output vector that is NA for all the correct cities but includes entries for our two incorrectly formatted cities and verify using pm_all_city() that our dictionary is now complete:

> cityDict <- pm_append(type = "city",
+                       input = c("Brentwood", "Clayton", "CLAYTON", "Maplewood", 
+                                 "St. Louis", "SAINT LOUIS", "Webster Groves"),
+                       output = c(NA, NA, "Clayton", NA, NA, "St. Louis", NA))
> pm_all_city(sushi1_min, dictionary = cityDict)
[1] TRUE

We can now move on to parsing using pm_parse_city():

The column pm.city has been added to our data set and city names have been parsed as well as standardized.

Units

This functionality is not enabled yet.

House Numbers

Since the example data do not contain fractional addresses or house suffix values, parsing them is straightforward (dealing with less common street addresses will be the subject of an additional vignette). We’ll use an abbreviated version of the sushi data that do not contain city, state, or postal code data. These are sushi restaurants located within the City of St. Louis proper:

Our next task in the order of operations is to parse out house numbers. We do this with pm_house_parse():

There is no dictionary for this step - the first word of any address will be parsed out so long as it contains at least some numbers. If you house number contains a range (123-125 Main St) or fractional value (123 1/2 Main St), there are a separate set of functions for both pm_houseRange_ and pm_houseFrac_ for dealing with these special cases.

Street Prefix and Suffix Data

Our addresses have two types of prefix and suffix data. There are directionals, like the “south” in S Grand Boulevard, and the suffix value, which is “Boulevard”. The United States Postal Services prefers abbreviations for both directionals and suffix values, and so postmastr returns the preferred abbreviations whenever full names are found. We can parse out the directionals first with pm_streetDir_parse():

postmastr will automatically detect both prefix and suffix directionals (i.e. S Grand Boulevard and Grand Boulevard S).

Once directionals have been parsed, we can parse and standardize the street suffix values as well with pm_streetSuf_parse():

If a street has a directional name (i.e. North Ave), the word North will be added back into pm.address after the street suffix is parsed out.

Street Names

Our final parsing task is to convert whatever remains in pm.address to the street name with pm_street_parse(). As part of the standardization process, names like Second will be converted to ordinals (i.e. 2nd). This creates shorter, more compact output street names. This same standardization functionality allows for optional standardization of commonly misspelled street names as well.

The pm_street_ family of functions does not have the logical test functions at this time since street names are assumed to be whatever is leftover from the parsing process to this point.

Putting It All Back Together

Once we have parsed data, we add our parsed data back into the source data frame with pm_replae():

The replacement process includes an unnest argument, which will convert the house range list-columns into individual observations.

With our addresses replaced, we can then rebuild address strings them with pm.rebuild():

The keep_parsed argument has options to retain some data, including city, state, and postal code, when a full address was present in the source data (use keep_parsed = "limited") or to keep all parsed data (with keep_parsed = "yes"). The keep_ids argument will retain the pm.id and pm.uid variables. Finally, there is a new_address argument to specify the name of the new variable with the rebuild address. If not specified, pm.address is used as the default variable name.

Getting Help

  • If you are new to R itself, welcome! Hadley Wickham’s R for Data Science is an excellent way to get started with data manipulation in the tidyverse, which stlcsb is designed to integrate seamlessly with.
  • If you have questions about using postmastr, you are encouraged to use the RStudio Community forums. Please create a reprex before posting. Feel free to tag Chris (@chris.prener) in any posts about postmastr.
  • If you think you’ve found a bug, please create a reprex and then open an issue on GitHub.