Topic

Introduction to Project Magellan
By: Ancy Phillip
Date: April 13, 2017, 7 p.m.

Day by day, the world is becoming more data driven, making data science extremely popular. Data Wrangling , Data Analysis form the two important stages in any Data Science problem and Entity Matching(EM) is extremely critical in the latter phase. EM has been a long-standing challenge in data management. Most current EM works focus only on developing matching algorithms. A solution to this, Magellan, is a new kind of EM systems, open sourced on top of the PyData eco-system. Magellan is novel in four important aspects. (1) It provides how-to guides that tell users what to do in each EM scenario, step by step. (2) It provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just match- ing and blocking as current EM systems do. (3) Tools are built on top of the data analysis and Big Data stacks in Python, allowing Magellan to borrow a rich set of capabil- ities in data cleaning, IE, visualization, learning, etc. (4) Magellan provides a powerful scripting environment to fa- cilitate interactive experimentation and quick “patching” of the system. Magellan is used at Walmart Labs, Johnson Controls, Marshfield Clinic and as a teaching tool in UWM classes.