Image from Google Jackets

No Ground Truth? No Problem: Improving Administrative Data Linking Using Active Learning and a Little Bit of Guile / Sarah Tahamont, Zubin Jelveh, Melissa McNeill, Shi Yan, Aaron Chalfin, Benjamin Hansen.

By: Contributor(s): Material type: TextTextSeries: Working Paper Series (National Bureau of Economic Research) ; no. w31100.Publication details: Cambridge, Mass. National Bureau of Economic Research 2023.Description: 1 online resource: illustrations (black and white)Subject(s): Other classification:
  • C15
  • C88
Online resources: Available additional physical forms:
  • Hardcopy version available to institutional subscribers
Abstract: While linking records across large administrative datasets ["big data"] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to "ground-truth" examples -- matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use "active learning" algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.
Tags from this library: No tags from this library for this title. Log in to add tags.
Star ratings
    Average rating: 0.0 (0 votes)
Holdings
Item type Home library Collection Call number Status Date due Barcode Item holds
Working Paper Biblioteca Digital Colección NBER nber w31100 (Browse shelf(Opens below)) Not For Loan
Total holds: 0

April 2023.

While linking records across large administrative datasets ["big data"] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to "ground-truth" examples -- matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use "active learning" algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.

Hardcopy version available to institutional subscribers

System requirements: Adobe [Acrobat] Reader required for PDF files.

Mode of access: World Wide Web.

Print version record

There are no comments on this title.

to post a comment.

Powered by Koha