Comprehensive Crash Course - Data Linkage

Introduction

Many big data companies are struggling to address the record linkage challenge, which ranges from defining the appropriate logic to making the process scalable and cost-effective.

Record linkage is a method in data management that identifies and links related records within or across datasets, thereby greatly improving data quality and consistency. This technology is essential in the era of big data, as it aids organizations in handling large volumes of data by ensuring data accuracy and integrity. Since its development, record linkage has evolved to incorporate more advanced algorithms and techniques, enhancing the efficiency and accuracy of data management.

The real-world applications of this technology underscore its significance in enhancing data management across various industries. Use cases: Farmacie, Media, Analytics.

The problem and task definition

Let's assume the FBI requires us to build a system that tracks the movements of individuals from a list of wanted fugitives at airports. Hypothetically, these fugitives may travel on commercial flights using different names, dates of birth, and places of origin. However, they cannot significantly alter the image on their passport, nor can they change their gender, height, or eye color. Our task is to collect daily data streams from airport passport control scanners and compare them with the FBI's central database to identify and alert about any matches.

Building roadmap

  • In this tutorial we will not talk about handling data stream and data normalization (it will come in another post) so we will assume, that we have dump locally central database and let's say one day list of passengers from 2 airports. So we will have 3 jsonl (in the next iteration we will improve project and use parquet format) files wich you can download from here.
  • Now we are going to build web service (in enhanced version we will build gRPC service) Here is important that service be distributed system, with load balancer.
  • Next we will load data from jsonl files into Postgres database with Spark.
  • In this point we will build workflow which will load data from jsonl files into Postgres database with Spark It will be elaborated later in the post why to choose those technologies.
  • Finally we will use this data and shot requests on web service to find matches.

Flow Diagram

Loading Sandbox...

Logic for solution

Explain how other tools and mention weakness. Splink e.g. Explain logic.

Code

Write full tutorial Preparation Code Results

github