A wrapper around the parse functions that can be used to shorten all of postmastr's core code down to a single function call once dictionaries have been created and tested against the data.

pm_parse(.data, input, address, output, new_address, ordinal = TRUE,
    operator = "at", unnest = FALSE, include_commas = FALSE, include_units = TRUE,
    keep_parsed = "no", side = "right", left_vars, keep_ids = FALSE, houseSuf_dict,
    dir_dict, street_dict, suffix_dict, unit_dict, city_dict, state_dict,
    locale = "us")

Arguments

.data

A source data set to be parsed

input

Describes the format of the source address. One of either "full" or "short". A short address contains, at the most, a house number, street directionals, a street name, a street suffix, and a unit type and number. A full address contains all of the selements of a short address as well as a, at the most, a city, state, and postal code.

address

A character variable containing address data to be parsed

output

Describes the format of the output address. One of either "full" or "short". A short address contains, at the most, a house number, street directionals, a street name, a street suffix, and a unit type and number. A full address contains all of the selements of a short address as well as a, at the most, a city, state, and postal code.

new_address

Name of new variable to store rebuilt address in.

ordinal

A logical scalar; if TRUE, street names that contain numeric words values (i.e. "Second") will be converted and standardized to ordinal values (i.e. "2nd"). The default is TRUE because it returns much more compact clean addresses (i.e. "168th St" as opposed to "One Hundred Sixty Eigth St").

operator

A character scalar to be used as the intersection operator (between the 'x' and 'y' sides of the intersection).

unnest

A logical scalar; if TRUE, house ranges will be unnested (i.e. a house range that has been expanded to cover four addresses with pm_houseRange_parse will be converted from a single observation to four observations, one for each house number). If FALSE (default), the single observation will remain.

include_commas

A logical scalar; if TRUE, a comma is added both before and after the city name in rebuild addresses. If FALSE (default), no punctuation is added.

include_units

A logical scalar; if TRUE (default), the unit name and number (if given) will be included in the output string. Otherwise if FALSE, the unit name and number will not be included.

keep_parsed

Character string; if "yes", all parsed elements will be added to the source data after replacement. If "limited", only the pm.city, pm.state, and postal code variables will be retained. Otherwise, if "no", only the rebuilt address will be added to the source data (default).

side

One of either "left" or "right" - should parsed data be placed to the left or right of the original data? Placing data to the left may be useful in particularly wide data sets.

left_vars

A character scalar or vector of variables to place on the left-hand side of the output when side is equal to "middle".

keep_ids

Logical scalar; if TRUE, the identification numbers will be kept in the source data after replacement. Otherwise, if FALSE, they will be removed (default).

houseSuf_dict

Optional; name of house suffix dictionary object. Standardizationl and parsing are skipped if none is specified.

dir_dict

Optional; name of directional dictionary object. If none is specified, the full default directional dictionary will be used.

street_dict

Optional; name of street dictionary object. Standardizationl is skipped if none is specified.

suffix_dict

Optional; name of street suffix dictionary object. If none is specified, the full default street suffix dictionary will be used.

unit_dict

Optional; name of unit dictionary object - NOT CURRENTLY ENABLED

city_dict

Required for "full" addresses; name of city dictionary object.

state_dict

Optional; name of state dictionary object. If none is specified, the full default state dictionary will be used.

locale

A string indicating the country these data represent; the only current option is "us" but this is included to facilitate future expansion.

Value

An updated version of the source data with, at a minimum, a new variable containing standardized street addresses for each observation. Options allow for columns containing parsed elements to be returned as well.

Examples

# construct dictionaries dirs <- pm_dictionary(type = "directional", filter = c("N", "S", "E", "W"), locale = "us") sufs <- pm_dictionary(type = "suffix", locale = "us") mo <- pm_dictionary(type = "state", filter = "MO", case = c("title", "upper"), locale = "us") cities <- pm_append(type = "city", input = c("Brentwood", "Clayton", "CLAYTON", "Maplewood", "St. Louis", "SAINT LOUIS", "Webster Groves"), output = c(NA, NA, "Clayton", NA, NA, "St. Louis", NA)) # add example data df <- sushi1 # identify df <- pm_identify(df, var = address) # temporary code to subset unit df <- dplyr::filter(df, name != "Drunken Fish - Ballpark Village") # parse, full output pm_parse(df, input = "full", address = address, output = "full", keep_parsed = "no", dir_dict = dirs, suffix_dict = sufs, city_dict = cities, state_dict = mo)
#> # A tibble: 27 x 4 #> name address visit pm.address #> <chr> <chr> <chr> <chr> #> 1 BaiKu Sushi Loun… 3407 Olive St, St. Louis… 3/20/18 3407 Olive St St. Louis … #> 2 Blue Ocean Resta… 6335 Delmar Blvd, St. Lo… 10/26/… 6335 Delmar Blvd St. Lou… #> 3 Cafe Mochi 3221 S Grand Boulevard, … 10/10/… 3221 S Grand Blvd St. Lo… #> 4 Drunken Fish - C… 1 Maryland Plaza, St. Lo… 12/2/18 1 Maryland Plz St. Louis… #> 5 I Love Mr Sushi 9443 Olive Blvd, St. Lou… 1/1/18 9443 Olive Blvd St. Loui… #> 6 Kampai Sushi Bar 4949 W Pine Blvd, St. Lo… 2/13/18 4949 W Pine Blvd St. Lou… #> 7 Midtown Sushi & … 3674 Forest Park Ave, St… 3/4/18 3674 Forest Park Ave St.… #> 8 Mizu Sushi Bar 1013 Washington Avenue, … 9/12/18 1013 Washington Ave St. … #> 9 Robata Maplewood 7260 Manchester Road, Ma… 11/1/18 7260 Manchester Rd Maple… #> 10 SanSai Japanese … 1803 Maplewood Commons D… 2/14/18 1803 Maplewood Commons D… #> # … with 17 more rows
# parse, short output pm_parse(df, input = "full", address = address, output = "short", keep_parsed = "no", new_address = clean_address, dir_dict = dirs, suffix_dict = sufs, city_dict = cities, state_dict = mo)
#> # A tibble: 27 x 4 #> name address visit clean_address #> <chr> <chr> <chr> <chr> #> 1 BaiKu Sushi Lounge 3407 Olive St, St. Louis, Mi… 3/20/… 3407 Olive St #> 2 Blue Ocean Restaura… 6335 Delmar Blvd, St. Louis,… 10/26… 6335 Delmar Blvd #> 3 Cafe Mochi 3221 S Grand Boulevard, St. … 10/10… 3221 S Grand Blvd #> 4 Drunken Fish - Cent… 1 Maryland Plaza, St. Louis,… 12/2/… 1 Maryland Plz #> 5 I Love Mr Sushi 9443 Olive Blvd, St. Louis, … 1/1/18 9443 Olive Blvd #> 6 Kampai Sushi Bar 4949 W Pine Blvd, St. Louis,… 2/13/… 4949 W Pine Blvd #> 7 Midtown Sushi & Ram… 3674 Forest Park Ave, St. Lo… 3/4/18 3674 Forest Park A… #> 8 Mizu Sushi Bar 1013 Washington Avenue, St. … 9/12/… 1013 Washington Ave #> 9 Robata Maplewood 7260 Manchester Road, Maplew… 11/1/… 7260 Manchester Rd #> 10 SanSai Japanese Gri… 1803 Maplewood Commons Dr, S… 2/14/… 1803 Maplewood Com… #> # … with 17 more rows