This paper addresses the name matching (duplicate detection) problem in the US patent dataset. It contains more then 400K unique company names spellings. In order to solve the matching problem we choose appropriate string similarity measure and clustering approach and estimate their parameters. Finally we apply them to the whole dataset and estimate the positives and negatives rates.
+ Full Paper (PDF)
Source: HP Labs