Address cleanup for logistics companies using ML

Beginning with the Data Itself

Address cleanup using machine learning is a sought-after solution for the Indian logistics companies. The problem is much more complex than what initially it looks like. While companies like Flipkart and Snapdeal have had their fair way of success because of access to the vast amounts of user address data, there are some standard ways in which beginner logistics companies can create in-house solutions to address their own problems.

Data Does Not Follow a Particular Format and Is Out of “Order”

All AI and ML data scientists consume more than half of the time cleaning up data. The actual modeling and number crunching is much less time-consuming. This is a standard industry practice (or feedback) for any kind of a machine learning problem. The more accurate the data, the more accurate the models and more accurate the output.

So for the address “cleanup” problem, how do you get the cleaned-up data? Here are some of the problems which the data will throw in your face…

Data Does Not Follow a Particular Format and Is Out of Order

There is no “standardization” of the format for addresses in the Indian landscape. E-commerce sites would provide consumers with standard form-based inputs in a particular wizard-like form sequence, but then people have the freedom to choose what kind of data they insert into it.

Addresses shared with logistics companies are not exactly “standardized”. This could be generic and this is a practice which cannot be monitored. While submitting data, all address data could have been stuffed into one text field element, or the street address could be mixed up with the locality. There are more than one ways to ruffle up the cat, proverbially speaking. This can be a problem for logistics companies for address standardization. However, using machine learning, algorithms can learn “key phrases” and associate them in the correct context. This may need lots of data “annotation” or “labeling” which is a manual process.

Data Is Spelled Incorrectly

Now typical case of places like Bangalore, aka, Bengaluru, there are many ways to spell a locality which has a complex pronunciation. Long tail address keywords run into this problem. People in the same locality will spell it differently!

Let us take the case of Kadugondanahalli, a locality in Bengaluru. This locality can be spelled as…

  • Kadugondanahalli - original
  • Kadugondanhalli - “a” missing
  • Kadu gondan halli - split into three words
  • Kadu gondanhalli - split into two words
  • Kadu gondan Halli - split into two words at a different location
  • Kadu gondanahalli - “u” missing
  • Kadugondanahali - “u” misplaced as “a”
  • Kadugondanhali - “u” missing
  • Kadagondanahalli - single “l” instead of double

Most of the time your conventional RDBMS systems will not be able to figure out the “exact” match. However, databases like MySQL and SQL server have come out with “soundex” phrase or keyword matches which can give you “similar sounding” words. This gets complicated by the fact that the keyword itself could be split into multiple words depending on user input.

The Problem in a Nutshell

The problem in a nutshell is - there are more than one ways to “spell” the cat!

So What Exactly Was the Zip Code?

Yes, this is a more prevalent problem than what seems on the surface. Many people in India do not remember their “postal or zip” codes. This could be in case you were a recent migrant to the city or you are in an age group where your simply forget to remember. This can complicate matters since a given city could be found in different states for the exact name. In a large geography like India, this article could be an eye-opener. There are 32 Ramp;ur’s** in India - some in the same state! This is where zip code now matters. For large cities, something like a “Ghanta Ghar” or a “Hari Nagar” could be at multiple places within the same city. Sounds interesting?

Google! Why Did the Reverse Geocoding Change the Address Completely? I Just Added the Flat Number.

A lot of times, users are in the habit of adding prefixes like these…

  • house number, house#, house no., house num, h.no., ho. no. …
  • flat number, flat no., flat num, flat#, f.no., fo. no. …
  • room number, room num, room no., room #
  • shop number, shop num, shop no., shop #
  • flat no H-74, flat number H-74, flat number H 74, flat number H74, flat number H-74

So why is that a problem?

Just put this additional description to your address and query Google for a reverse lookup and you will see that it sometimes goes haywire. This additional information can confuse the address lookup engine. While using API-based queries, the responses (top 10) could be totally away (sometimes by many many kms) from the actual address that was meant. This has a direct impact on your shipping costs and route management/scheduling algorithms. For a startup, this can bring unnecessary burn into your pocket.

Now Wait—That’s the Phone Number Inside the Address?

Yes, this can happen too. Data can come in formats, where this could be sitting inside your address fields. While a human being can figure that out, it can confuse address parsers as to what this data field means. There are customers shipping gifts to their friends and then they don’t have their friends’ complete address. Asking them would look odd, so they put his phone number in the shipping details. Not that very rare since the customer “expects” the shipping person to call up before delivering the package. Who said life is fair?

Did I Mention Which Floor?

Many customers will share data - and can confuse your poor parser! Data often cited about “which floor do I want to ship to” could be stated as…

  • first floor, 1st floor, Ist floor,
  • mezzanine floor
  • top floor
  • ground floor
  • bottom floor

This kind of information about the floor can be a great aid to humans but can confuse address parsers as to how to qualify these keywords. This kind of data is difficult to classify. Nevertheless, the parser needs to intelligently classify these keywords.

Are We “In Front of” the Landmark? Or “To the Side”? Or “Behind” It? Or Simply Adjacent? I Need to Be Accurate!

Ideally we should qualify the landmark or the address in fewest number of keywords which make the “most” sense. Clarity and brevity is the key. However, as human beings, we want to help the other person have the maximum information which gives him directions about the actual place. We try to be informative, and that can be a problem to handle algorithmically.

Here are some keywords which you will find inside addresses, especially when specifying landmark data which need to be handled…

  • [Adjacent to], [Adj to], [Adj. to], [Adj]
  • [Behind], [behind]
  • [In front of], [in front]
  • [Above]

Very informative to a human being but confusing to a stubborn parser.

It Is Not Calcutta! It Is Kolkata! Or Is It Kolkata?

With more than 25 Indian cities changing name (Baroda to Vadodara, Cochin to Kochi, Benaras to Varanasi for complete list, see here), keeping up with the latest and best copy of the data (often called golden copy) takes up serious data cleaning and updating effort. This is not that difficult a problem to solve, till you have “proactive users” or a support center team which constantly keeps updating the tech team with complaints, changes or reported updates from customers in case the tech team has not been updating itself with the recent geo-political updates.

Hey UIDAI, Did You Give the GIS Based Pin Code on the Adhaar Card?

Here is an interesting news article whereby three people in a family got different PIN codes because the UIDAI when providing Aadhaar cards used the GIS-based pin to print on their cards. Postal services in India are an autonomous organization. It has the sole ownership and responsibility of providing PIN codes to regions in India. However, there have been cases of people getting their zip codes based on the GIS-based data as in the article above. Although UIDAI authorities deny any role of theirs, there have been cases of such confusions over the last few years. Thanks UIDAI! This was all the help we needed!

Address Standardization in the Indian Context

In US, address standardization is an industry-wide practice which is yet to take strong roots in countries like India. Here are some steps being taken by the Indian postal services. A standardized address would have the fields mandated to have a particular sequence, like number, locality, street, state, country, zip etc. However, there is no such mandate by the government or the postal services in India. Under different governments and development and planning plans, while some cities are divided into blocks and streets, others have been divided into sectors and phases.

Can we standardize the format of addresses in India whose nomenclature and legacy dates back from the British times?

Now That We Have “Spelt” the Cat, How Do We Skin the Cat Using Machine Learning?

Here is a multifarious approach which can be taken to get to a better-looking data.

Cleaning Up the House

The first step in the entire process is creating the “golden copy” of the data, with clearly defined, manually labeled data. Without this “golden reference”, it is challenging to reach that level of accuracy. There should be processes in place which “enrich” the data with newer updates to this golden copy. Manual interventions should be in place before the golden copy gets updated.

This would become an essential step when creating a supervised model, where the data would need manual interventions and labeling for learning purposes.

Modeling the “Postman’s” Mental Models

One crucial step to the entire process is creating mental models of the way the courier agent identifies or “maps” the given address to the final location. It needs to be “simulated” and this would need a close interaction with the logistics team. These mental models are typical to the demography and the logic may change from region to region. These are the models which will make the core of the engine for address parsing and geolocation mapping. While there are companies like MapMyIndia do keep data using deep mapping technology, translating the user-driven address input to pin point the geo location is a different problem to solve.

Learn, Learn, and Learn…

In the end, you got to play with the data to learn more from the data. Ultimately you will have multiple models from which to choose, optimize and create a recommendation system of locations for a given address. Each of the recommendations would have with it a probability score mentioning the confidence with which your system believes is the percentage match.

While there are more than one way to “skin” the cat, you have to find the top ones for every cat that’s thrown your way.