record linkage that I wrote at MIH. We use it to identify similar groups of records in our data files and databases. I received permission to release the core library under the open source GNU General Public License.
The source repository is hosted on GitHub. A packaged release is not available.
PyDedupe was written to remove the inherent limitations of existing record linkage tools by supporting a general model of tabular data and decoupling the input formats from the algorithm. FEBRL, the only other Python record linkage library supported only scalar-valued fields. PyDedupe supports row transformations for generated fields, multi-valued fields derived from delimited values in a column or combined from several columns, and compound values (such as geographic coordinates). The API is operates on iterations of tuples so that it is decoupled from input formats (such as a database or delimited text file). A convenience module is provided for loading records from CSV files and re-writing them with similar records grouped together.
The general strategy for record linkage is:
Record linkage on a database requires writing additional code to retrieves tuples, using the PyDedupe API to index, compare and classify the tuples, thentag the pairs of linked records in the database - or present a user interface for manually merging them.