Abstract:
Atpresentmanypublicandprivateorganizationscollecthugeamountofdata.Later,thesedataare processed and analyzed to discover interesting knowledge that support proper decision making. Developingefficienttechniquesforcleaningandlinkinglargedatasetstosupportknowledgediscovery hasgainedhighimportanceinbothacademiaandindustry.Solvingrecordlinkageproblemswithan incrementalapproachisarelativelynewresearcharea.Fewstudieshavebeenperformedinthefield of incremental record linkage targeting the linkage quality or efficiency. However, the privacy issue regarding the incremental approach has not yet been discussed. Privacy preservation is essential for sensitive record linkage, e.g., health records, financial records, etc. In this regard, we have come up with a novel concept which encapsulates privacy preserving techniques with an incremental record linkageapproach.
Inthisthesis,wefocusonthehealthcaredomain.Aproblemwithrealhealthdataisthattheyare noisy by nature. Another problem with health data is the presence of missing values. Wepropose a novelphoneticalgorithmtoreducethenoiseinpatients’namestoimprovetheperformanceofrecord linkage. For handling missing data, we extend the widely used MICE algorithm to impute missing data of both categorical and numericattributes.
For preserving privacy, we use different privacy techniques such as phonetic encoding, hashing, and generalization. For handling incremental updates and internal linkage, we use the Naive incrementalclusteringapproach.Weperformvariousexperimentstotesttheprivacyandlinkagequalityof our proposed framework. We compare our work with the existing incremental record linkageframe- work and also with existing privacy preserved record linkage techniques. It is apparent from our resultsthatotherthanasmalltrade-offinlinkagequality,ourframeworkworksbetterasacombined packageofprivacyandlinkagesolution,whichanyexistingframeworksdonotyetprovide.