How can I parse an address into its individual components?

c# html-agility-pack parsing regex


I've been tasked with building a parser that will parse a particular web-page, so that our employees can do a bulk import of their user data into their web-site with our company.

I've utilized the HtmlAgilityPack to parse the page, I've correlated the table row and table data to be pushed into properties in my Map class.

However one column is causing me alot of grief. The Address column, is the thorn in my side for an assortment of reasons.

Sample Data:

6313 SW 203rd Ave <br> Portland, OR 97224
16600 Lomita Way <br> El Dorado Hills, CA 95762
PO Box #42 <br> Hampton Bays, NY 11946

Each one of those addresses is wrapped like so (Obviously the addresses may vary based on the customer whom we are importing users for):

     <td> 6313 SW 203rd Ave <br> Portland, OR 97224 </td>

I'm trying to implement a Regular Expression to split this in the proper area, so it may be assigned to the correlating properties:

public string Unit { get; set; }
public string Street { get; set; }
public string City { get; set; }
public string State { get; set; }
public string Zip { get; set; }

However the addresses don't provide much to anchor off of:

Issue One: If I anchor off the <br> then I'm only separating the lines. Doesn't fully split into proper segments.

Issue Two: Same issue with the individual comma.

Issue Three: If I anchor to numeric values, for the Zip may be invalid for Canada and may split incorrectly based on street name.

What is the best way to separate items for an address? With Regex?

Accepted Answer

Okay, so the Address field was quite painful to parse. However I did manage to parse the data based on my particular requirements.

  • The Address always has a <br> between the Street & City.

So I did the following:

var splitBasedOnHTML = Regex.Split(column[2], @"\br<br>");

The column[] contains my address in index two. So after that call, it will automatically position my Unit and Street in Index Zero. The City, State, and Zip will be located Index One.

So I did another split, to break the City, State, and Zip like this:

var splitBasedOnSpace = splitBasedOnHtml[1].Split(' ');

After that I now end up with the following:

6313 SW 203rd Ave // splitBasedOnHtml[0]
Portland, // splitBasedonSpace[0]
OR // splitBasedOnSpace[1]
97224 // splitBasedOnSpace[2]

So I simply mapped my properties to those individual array index's.

This solution makes the assumption that the Unit is apart of the Street, which become an okay sacrifice as the data is being imported into another web-site and can be modified by particular people later on.

That is how I solved the parse issues, this solution may not be viable for others in this boat but hopefully this is a nice alternative or points in a good direction. What the method looks like:

    public static Map AddressMapper(IList<string> column)
        var map = new Map();
        var splitBasedOnHTML = Regex.Split(column[2], @"\b<br>");
        var splitBasedOnSpace = splitBasedOnHTML[1].Split(' ');

        map.Street = splitBasedOnHTML[0];
        map.City = splitBasedOnSpace[0].Replace(@",", " ");
        map.State = splitBasedOnSpace[1];
        map.Zip = spliteBasedOnSpace[2];

        return map;

Popular Answer

Parsing addresses is hard; really hard. There is no truely uniform format for addresses, especially across country borders. It's highly unlikely that you will be able to do this using a single RegEx.

See this other post for a few examples and a more in-depth explanation. How to parse freeform street/postal address out of text, and into components

