Heuristic Search for e-mails

AlNipper49

Huge Member
Dope
SoSH Member
Apr 3, 2001
44,840
Mtigawi
I have a client who has a problem, he came to me for the answer and I don't even know where to point him in the right direction.

He has a large amount of people's names and their cities. He has also a database of e-mail addresses.

He's not looking to spam them, but his business is one that deals with thousands of signups a day and approximately 20% of the time he gets e-mail addresses that are just written wrong. Due to the software that he's running he doesn't have the opportunity to check the emails when they're entered. He'd obviously like the right e-mails without dedicating his admin to call up people to confirm, etc.

He has multiple other products / services which most of his customers also buy. So let's assume that the 20% of bad email addresses are corrected in these other data sources.

Mind you, this is a "nice to have". He's just trying to save some manpower.

I've done some looking into data enrichment before but frankly it's not my forte. What I really need is a system whereby I can feed in a CSV full of names and addresses and for the software to heuristically try to match it up with the other data sources. It could theoretically be done with something like a vlookup but due to the amount of data here I'm not confident that'll be a solution that will scale to the level that he's looking to expand.

I'm playing around with Open Semantic right now, it seems powerful, but I really don't know how to get from Point A to Point Z.

Thanks!

PS: There are companies out there that can do this for like .06/record so that's always an option.
 

InstaFace

The Ultimate One
SoSH Member
Sep 27, 2016
21,591
Pittsburgh, PA
The problem with the outsourcers is that they'll put it through mechanical turk or the equivalent and the data quality you'll get on the other end may not be any better. Excel / vlookup isn't going to do it either, unless you've got crazy VB macro skills.

You could try some string-distance algorithms. Jaro-Winkler is probably the best-suited to this task; play around with it a bit to see what thresholds give you the best results. Depends on how many records we're talking about as to whether this is worthwhile, though.
 

tonyandpals

Well-Known Member
Lifetime Member
SoSH Member
Mar 18, 2004
7,853
Burlington
How do you determine what is a correct email address? Are all the known good emails in the CSV file and you're trying to match based on best guess from your names files?
 

AlNipper49

Huge Member
Dope
SoSH Member
Apr 3, 2001
44,840
Mtigawi
How do you determine what is a correct email address? Are all the known good emails in the CSV file and you're trying to match based on best guess from your names files?
It's a best effort type of thing. My client gets a list of "bad" email addresses. The service he is using right now is about .06 record and they only match about 33% of them, and from talking to them they basically have the issue that Instaface references above.

(it's a service that notifies car warranty holders of 'high priority' recalls)

Hell, at .06/record they do maybe 100,000 month. He'd be just as happy to send that business my way if I could write something.