Projects‎ > ‎

Record Linkage

PyDedupe is a Python library for performing record linkage that I wrote at MIH.  We use it to identify similar groups of records in our data files and databases. I received permission to release the core library under the open source GNU General Public License.  

The source repository is hosted on GitHub.  A packaged release is not available.

PyDedupe was written to remove the inherent limitations of existing record linkage tools by supporting a general model of tabular data and decoupling the input formats from the algorithm. FEBRL, the only other Python record linkage library supported only scalar-valued fields. PyDedupe supports row transformations for generated fields, multi-valued fields derived from delimited values in a column or combined from several columns, and compound values (such as geographic coordinates).  The API is operates on iterations of tuples so that it is decoupled from input formats (such as a database or delimited text file).   A convenience module is provided for loading records from CSV files and re-writing them with similar records grouped together.

The general strategy for record linkage is:
  1. Index records into blocks
  2. Compare all pairs of records in each block with a similarity function
  3. Cluster record pairs into "matches" and "non-matches" from the vector of similarity values.
The PyDedupe API can be used at multiple levels of abstraction:
  1. Low-level functions to
    1. Normalise values
    2. Generate indexed values
    3. Compare values for similarity
    4. Do binary classification of floating-point vectors
  2. Higher level classes to
    1. Index records into blocks
    2. Compare pairs of records for similarity vectors
    3. Classify pairs of records as matches/non-matches
    4. Group records together
  3. Highest level API to
    1. Use a record linkage strategy
    2. Accept records from CSV input and write groups to CSV output
Record linkage on CSV files requires writing a small script that defines the strategies for indexing, comparison and classification, then calls a high-level function with the name of the CSV input file and a folder in which to write the output.   Records may be linked either within a single file, or between two files.

Record linkage on a database requires writing additional code to retrieves tuples, using the PyDedupe API to index, compare and classify the tuples, thentag the pairs of linked records in the database - or present a user interface for manually merging them.

Comments