Big Gorilla:
A Hands-on Tutorial

The main goal of this tutorial is to highlight the tools that BigGorilla provides for “entity matching” problems. The workflow presented here integrates two movie datasets that are acquired from different sources. The entity-matching step is discussed in the last part of the tutorial (i.e., Part 4), but we recommend readers to read Parts 1-3 where we showcase how existing python packages can be deployed to prepare the data for the entity-matching task.

Part 1: Data Acquistion

We will start by using urllib, a popular python package for fetching data across the web, to download the datasets that we need for this tutorial.

Step 1: Downloading the “Kaggle 5000 Movie Dataset”

The desired dataset is a .csv file with a url that is specified in the code snippet below.

In [1]:

				
					# Importing urlib
import urllib
import os

# Creating the data folder
if not os.path.exists('./data'):
	os.makedirs('./data')

# Obtaining the dataset using the url that hosts it
kaggle_url = 'https://github.com/sundeepblue/movie_rating_prediction/raw/master/movie_metadata.csv'
if not os.path.exists('./data/kaggle_dataset.csv'):     # avoid downloading if the file exists
	response = urllib.urlretrieve(kaggle_url, './data/kaggle_dataset.csv')
				
			

Step 2: Downloading the “IMDB Plain Text Data”

The IMDB Plain Text Data (see here) is a collection of files where each files describe one or a few attributes of a movie. We are going to focus on a subset of movie attribues which subsequently means that we are only interested in a few of these files which are listed below:

  • genres.list.gz
  • ratings.list.gz

** Note: The total size of files mentioned above is roughly 30M. Running the following code may take a few minutes.

In [2]:

				
					import gzip

# Obtaining IMDB's text files
imdb_url_prefix = 'ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/'
imdb_files_list = ['genres.list.gz', 'ratings.list.gz']
for name in imdb_files_list:
	if not os.path.exists('./data/' + name):
		response = urllib.urlretrieve(imdb_url_prefix + name, './data/' + name)
		urllib.urlcleanup()   # urllib fails to download two files from a ftp source. This fixes the bug!
		with gzip.open('./data/' + name) as comp_file, open('./data/' + name[:-3], 'w') as reg_file:
			file_content = comp_file.read()
			reg_file.write(file_content)
				
			

Step 3: Downloading the “IMDB Prepared Data”

During this tutorial, we discuss how the contents of genres.list.gz and ratings.list.gz files can be integrated. However, to make the tutorial more concise, we avoid including the same process for all the files in the “IMDB Plain Text Data”. The “IMDB Prepared Data” is the dataset that we obtained by integrating a number of files from the “IMDB Plain Text Data” which we will use during later stages of this tutorial. The following code snippet downloads this dataset.

In [3]:

				
					imdb_url = 'https://anaconda.org/BigGorilla/datasets/1/download/imdb_dataset.csv'
if not os.path.exists('./data/imdb_dataset.csv'):     # avoid downloading if the file exists
	response = urllib.urlretrieve(kaggle_url, './data/imdb_dataset.csv')
				
			

Part 2: Data Extraction

The “Kaggle 5000 Movie Dataset” is stored in a .csv file which is alreday structured and ready to use. On the other hand, the “IMDB Plain Text Data” is a collection of semi-structured text files that need to be processed to extract the data. A quick look at the first few lines of each files shows that each file has a different format and has to be handled separately.

Content of “ratings.list” data file

In [4]:

				
					with open("./data/ratings.list") as myfile:
	head = [next(myfile) for x in range(38)]
print (''.join(head[28:38]))   # skipping the first 28 lines as they are descriptive headers
				
			
0000000125  1728818   9.2  The Shawshank Redemption (1994)
0000000125  1181412   9.2  The Godfather (1972)
0000000124  810055   9.0  The Godfather: Part II (1974)
0000000124  1714042   8.9  The Dark Knight (2008)
0000000133  461310   8.9  12 Angry Men (1957)
0000000133  885509   8.9  Schindler's List (1993)
0000000123  1354135   8.9  Pulp Fiction (1994)
0000000124  1241908   8.9  The Lord of the Rings: The Return of the King (2003)
0000000123  514540   8.9  Il buono, il brutto, il cattivo (1966)
0000000133  1380148   8.8  Fight Club (1999)
Content of the “genres.list” data file

In [5]:

				
					with open("./data/genres.list") as myfile:
	head = [next(myfile) for x in range(392)]
print (''.join(head[382:392]))   # skipping the first 382 lines as they are descriptive header
				
			
"!Next?" (1994)						Documentary
"#1 Single" (2006)					Reality-TV
"#15SecondScare" (2015)					Horror
"#15SecondScare" (2015)					Short
"#15SecondScare" (2015)					Thriller
"#15SecondScare" (2015) {Who Wants to Play with the Rabbit? (#1.2)}	Drama
"#15SecondScare" (2015) {Who Wants to Play with the Rabbit? (#1.2)}	Horror
"#15SecondScare" (2015) {Who Wants to Play with the Rabbit? (#1.2)}	Short
"#15SecondScare" (2015) {Who Wants to Play with the Rabbit? (#1.2)}	Thriller
"#1MinuteNightmare" (2014)				Horror

Step 1: Extracting the information from “genres.list”

The goal of this step is to extract the movie titles and their production year from “movies.list”, and store the extracted data into a dataframe. Dataframe (from the python package pandas) is one of the key tools that is commonly used for data profiling and cleaning. To extract the desired information from the text, we rely on regular expressions which are implemented in the python package “re”.

In [6]:

				
					import re
import pandas as pd

with open("./data/genres.list") as genres_file:
	raw_content = genres_file.readlines()
	genres_list = []
	content = raw_content[382:]
	for line in content:
		m = re.match(r'"?(.*[^"])"? \(((?:\d|\?){4})(?:/\w*)?\).*\s((?:\w|-)+)', line.strip())
		genres_list.append([m.group(1), m.group(2), m.group(3)])
	genres_data = pd.DataFrame(genres_list, columns=['movie', 'year', 'genre'])
				
			

Step 2: Extracting the information from “ratings.list”

Let’s start by finding out how many movies are listed in each dataframe.

In [7]:

				
					with open("./data/ratings.list") as ratings_file:
	raw_content = ratings_file.readlines()
	ratings_list = []
	content = raw_content[28:]
	for line in content:
		m = re.match(r'(?:\d|\.|\*){10}\s+\d+\s+(1?\d\.\d)\s"?(.*[^"])"? \(((?:\d|\?){4})(?:/\w*)?\)', line.strip())
		if m is None: continue
		ratings_list.append([m.group(2), m.group(3), m.group(1)])
	ratings_data = pd.DataFrame(ratings_list, columns=['movie', 'year', 'rating'])
				
			

Note that one has to repeat the information extraction procedure for other data files as well if he is interested in their content. For now (and to keep the tutorial simple), we assume that we are only interested in genres and ratings of movies. The above code snippets store the extracted data on these two attributes into two dataframes (namely, genres_list and ratings_list.)

Part 3: Data Profiling & Cleaning

The high-level goal in this stage of data prepration is to look into the data that we have acquired and extracted so far. This helps us to get familiar with data, understand in what ways the data needs cleaning or transformation, and finally enables us to prepare the data for the following steps of the data integration task.

Step 1: Loading the “Kaggle 5000 Movie Dataset”

For this step, we rely on dataframes (from the python package pandas) as they are designed to assist users in data exploration and data profiling tasks. In Part 2 of the tutorial, we stored the extracted data from “IMDB Plain Text Data” into dataframes. It would be appropriate to load the “Kaggle 5000 Movies Dataset” into a dataframe as well and follow the same data profiling procedure for all datasets.

In [8]:

				
					import pandas as pd

# Loading the Kaggle dataset from the .csv file (kaggle_dataset.csv)
kaggle_data = pd.read_csv('./data/kaggle_dataset.csv')
				
			

Step 2: Calculating Some Basic Statistics (Profiling)

Let’s start by finding out how many movies are listed in each dataframe.

In [9]:

				
					import pandas as pd

# Loading the Kaggle dataset from the .csv file (kaggle_dataset.csv)
kaggle_data = pd.read_csv('./data/kaggle_dataset.csv')
				
			
Number of movies in kaggle_data: 5043
Number of movies in genres_data: 2384400
Number of movies in ratings_data: 691621

We can also check to see if we have duplicates (i.e., a movie appearing more than once) in the data. We consider an entry duplicate if we can find another entry with the same movie title and production year.

In [10]:

				
					print ('Number of duplicates in kaggle_data: {}'.format(
	sum(kaggle_data.duplicated(subset=['movie_title', 'title_year'], keep=False))))
print ('Number of duplicates in genres_data: {}'.format(
	sum(genres_data.duplicated(subset=['movie', 'year'], keep=False))))
print ('Number of duplicates in ratings_data: {}'.format(
	sum(ratings_data.duplicated(subset=['movie', 'year'], keep=False))))
				
			
Number of duplicates in kaggle_data: 241
Number of duplicates in genres_data: 1807712
Number of duplicates in ratings_data: 286515

Step 3: Dealing with duplicates (cleaning)

There are many strategies to deal with duplicates. Here, we are going to use a simple method for dealing with duplicates and that is to only keep the first occurrence of a duplicated entry and remove the rest.

In [11]:

				
					kaggle_data = kaggle_data.drop_duplicates(subset=['movie_title', 'title_year'], keep='first').copy()
genres_data = genres_data.drop_duplicates(subset=['movie', 'year'], keep='first').copy()
ratings_data = ratings_data.drop_duplicates(subset=['movie', 'year'], keep='first').copy()
				
			
Number of movies in kaggle_data: 5043
Number of movies in genres_data: 2384400
Number of movies in ratings_data: 691621

Step 4: Normalizing the text (cleaning)

The key attribute that we will use to integrate our movie datasets is the movie titles. So it is important to normalize these titles. The following code snippet makes all movie titles lower case, and then removes certain characters such as “‘” and “?”, and replaces some other special characters (e.g., “&” is replaced with “and”).

In [12]:

				
					def preprocess_title(title):
	title = title.lower()
	title = title.replace(',', ' ')
	title = title.replace("'", '')    
	title = title.replace('&', 'and')
	title = title.replace('?', '')
	title = title.decode('utf-8', 'ignore')
	return title.strip()

kaggle_data['norm_movie_title'] = kaggle_data['movie_title'].map(preprocess_title)
genres_data['norm_movie'] = genres_data['movie'].map(preprocess_title)
ratings_data['norm_movie'] = ratings_data['movie'].map(preprocess_title)
				
			

Step 5: Looking at a few samples

The goal here is to a look at a few sample entries from each dataset for a quick sanity check. To keep the tutorial consice, we just present this step for the “Kaggle 5000 Movies Dataset” which is stored in the kaggle_data dataframe.

In [13]:

				
					kaggle_data.sample(3, random_state=0)

				
			

Out [13]:

 colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenreslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likesnorm_movie_title
4422ColorSimeon Rice6.093.06.056.0Lisa Brave393.0NaNAction|Horror|ThrillerEnglishUSAR1500000.02014.0191.05.52.35307unsullied
1022ColorDoug Liman214.0108.0218.0405.0Ty Burrell6000.09528092.0Biography|Drama|ThrillerEnglishUSAPG-1322000000.02010.03000.06.82.359000fair game
3631ColorJonathan Levine147.099.0129.0362.0Aaron Yoo976.02077046.0Comedy|Drama|RomanceEnglishUSAR6000000.02008.0617.07.02.350the wackness

3 rows × 29 columns

Looking at the data guides us to decide in what ways we might want to clean the data. For instance, the small sample data shown above, reveals that the title_year attribute is stored as floats (i.e., rational numbers). We can add another cleaning step to transform the title_year into strings and replace the missing title years with symbol “?”.

In [14]:

				
					def preprocess_year(year):
	if pd.isnull(year):
		return '?'
	else:
		return str(int(year))

kaggle_data['norm_title_year'] = kaggle_data['title_year'].map(preprocess_year)
kaggle_data.head()


				
			

Out [14]:

 colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenrescountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likesnorm_movie_titlenorm_title_year
0ColorJames Cameron723.0178.00.0855.0Joel David Moore1000.0760505847.0Action|Adventure|Fantasy|Sci-FiUSAPG-13237000000.02009.0936.07.91.7833000avatar2009
1ColorGore Verbinski302.0169.0563.01000.0Orlando Bloom40000.0309404152.0Action|Adventure|FantasyUSAPG-13300000000.02007.05000.07.12.350pirates of the caribbean: at worlds end2007
2ColorSam Mendes602.0148.00.0161.0Rory Kinnear11000.0200074175.0Action|Adventure|ThrillerUKPG-13245000000.02015.0393.06.82.3585000spectre2015
3ColorChristopher Nolan813.0164.022000.023000.0Christian Bale27000.0448130642.0Action|ThrillerUSAPG-13250000000.02012.023000.08.52.35164000the dark knight rises2012
4NaNDoug WalkerNaNNaN131.0NaNRob Walker131.0NaNDocumentaryNaNNaNNaNNaN12.07.1NaN0star wars: episode vii – the force awakens?

5 rows × 30 columns

Part 4: Data Matching & Merging

The main goal in this part is go match the data that we have acquired from different sources to create a single rich dataset. Recall that in Part 3, we transformed all datasets into a dataframe which we used to clean the data. In this part, we continue using the same dataframes for the data that we have prepared so far.

Step 1: Integrating the “IMDB Plain Text Data” files

Note that both ratings_data and genres_data dataframes contain data that come from the same source (i.e., “the IMDB Plain Text data”). Thus, we assume that there are no inconsistencies between the data stored in these dataframe and to combine them, all we need to do is to match the entries that share the same title and production year. This simple “exact match” can be done simply using dataframes.

In [15]:

				
					brief_imdb_data = pd.merge(ratings_data, genres_data, how='inner', on=['norm_movie', 'year'])

brief_imdb_data.head()
				
			

Out [15]:

 movie_xyearratingnorm_moviemovie_ygenre
0The Shawshank Redemption19949.2the shawshank redemptionThe Shawshank RedemptionCrime
1The Godfather19729.2the godfatherThe GodfatherCrime
2The Godfather: Part II19749.0the godfather: part iiThe Godfather: Part IICrime
3The Dark Knight20088.9the dark knightThe Dark KnightAction
412 Angry Men19578.912 angry men12 Angry MenCrime

We refer to the dataset created above as the brief_imdb_data since it only contains two attributes (namely, genre and rating). Henceforth, we are going to use a richer version of the IMDB dataset which we created by integrating a number of files from the “IMDB Plain Text Data”. If you have completed the first part of this tutorial, then this dataset is already downloaded and stored in “imdb_dataset.csv” under the “data” folder. The following code snippet loads this dataset, does preprocessing on the title and production year of movies, removes the duplicates as before, and prints the size of the dataset.

In [16]:

				
					# reading the new IMDB dataset
imdb_data = pd.read_csv('./data/imdb_dataset.csv')
# let's normlize the title as we did in Part 3 of the tutorial
imdb_data['norm_title'] = imdb_data['title'].map(preprocess_title)
imdb_data['norm_year'] = imdb_data['year'].map(preprocess_year)
imdb_data = imdb_data.drop_duplicates(subset=['norm_title', 'norm_year'], keep='first').copy()
imdb_data.shape
				
			

Out [16]:

(869178, 27)

Step 2: Integrating the Kaggle and IMDB datasets

A simple approach to integrate the two datasets is to simply join entries that share the same movie title and year of production. The following code reveals that 4,248 matches are found using this simple approach.

In [17]:

				
					data_attempt1 = pd.merge(imdb_data, kaggle_data, how='inner', left_on=['norm_title', 'norm_year'],
						 right_on=['norm_movie_title', 'norm_title_year'])
data_attempt1.shape
				
			

Out [17]:

(4248, 57)

 

But given that IMDB and Kaggle datasets are collected from different sources, chances are that the name of a movie would be slightly different in these datasets (e.g. “Wall.E” vs “WallE”). To be able to find such matches, one can look at the similarity of movie titles and consider title with high similarity to be the same entity. BigGorilla provides a python pacakge named py_stringsimjoin for doing similarity join across two datasets. The following code snippet uses the py_stringsimjoin to match all the titles that have an edit distance of one or less (i.e., there is at most one character that needs to be changed/added/removed to make both titles identical). Once the similarity join is complete, it only selects the title pairs that are produced in the same year.

In [18]:

				
					import py_stringsimjoin as ssj
import py_stringmatching as sm

imdb_data['id'] = range(imdb_data.shape[0])
kaggle_data['id'] = range(kaggle_data.shape[0])
similar_titles = ssj.edit_distance_join(imdb_data, kaggle_data, 'id', 'id', 'norm_title',
										'norm_movie_title', l_out_attrs=['norm_title', 'norm_year'],
										 r_out_attrs=['norm_movie_title', 'norm_title_year'], threshold=1)
# selecting the entries that have the same production year
data_attempt2 = similar_titles[similar_titles.r_norm_title_year == similar_titles.l_norm_year]
data_attempt2.shape
				
			
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:02:46

Out [18]:

(4689, 8)

We can see that using the similarity join 4,689 titles were matched. Let’s look at some of the titles that are matched by the similarity join but are not identical.

In [19]:

				
					data_attempt2[data_attempt2.l_norm_title != data_attempt2.r_norm_movie_title].head()
				
			

Out [19]:

 _idl_idr_idl_norm_titlel_norm_yearr_norm_movie_titler_norm_title_year_sim_score
14414485273646world war v2013world war z20131.0
16216228164956grave2012brave20121.0
18018083149058walle2008wall·e20081.0
23623681618867upe2009up20091.0
24324381736667ut2009up20091.0

 

Step 3: Using Magellan for Data Matching

Substep A: Finding a candidate set (Blocking)

The goal of this step is to limit the number of pairs that we consider as potential matches using a simple heuristic. For this task, we can create a new column in each dataset that combines the values of important attributes into a single string (which we call the mixture). Then, we can use the string similarity join as before to find a set of entities that have some overlap in the values of the important columns. Before doing that, we need to transform the columns that are part of the mixture to strings. The py_stringsimjoin package allows us to do so easily.

In [20]:

				
					# transforming the "budget" column into string and creating a new **mixture** column
ssj.utils.converter.dataframe_column_to_str(imdb_data, 'budget', inplace=True)
imdb_data['mixture'] = imdb_data['norm_title'] + ' ' + imdb_data['norm_year'] + ' ' + imdb_data['budget']

# repeating the same thing for the Kaggle dataset
ssj.utils.converter.dataframe_column_to_str(kaggle_data, 'budget', inplace=True)
kaggle_data['mixture'] = kaggle_data['norm_movie_title'] + ' ' + kaggle_data['norm_title_year'] + \
						 ' ' + kaggle_data['budget']
				
			

Now, we can use the mixture columns to create a desired candidate set which we call C.

In [21]:

				
					

C = ssj.overlap_coefficient_join(kaggle_data, imdb_data, 'id', 'id', 'mixture', 'mixture', sm.WhitespaceTokenizer(), 
								 l_out_attrs=['norm_movie_title', 'norm_title_year', 'duration',
											  'budget', 'content_rating'],
								 r_out_attrs=['norm_title', 'norm_year', 'length', 'budget', 'mpaa'],
								 threshold=0.65)
C.shape


				
			
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:01:08

Out [21]:

(18317, 14)

We can see that by doing a similarity join, we already reduced the candidate set to 18,317 pairs.


Substep B: Specifying the keys

The next step is to specify to the py_entitymatching package which columns correspond to the keys in each dataframe. Also, we need to specify which columns correspond to the foreign keys of the the two dataframes in the candidate set.

In [22]:

				
					import py_entitymatching as em
em.set_key(kaggle_data, 'id')   # specifying the key column in the kaggle dataset
em.set_key(imdb_data, 'id')     # specifying the key column in the imdb dataset
em.set_key(C, '_id')            # specifying the key in the candidate set
em.set_ltable(C, kaggle_data)   # specifying the left table 
em.set_rtable(C, imdb_data)     # specifying the right table
em.set_fk_rtable(C, 'r_id')     # specifying the column that matches the key in the right table 
em.set_fk_ltable(C, 'l_id')     # specifying the column that matches the key in the left table 
				
			

Out [22]:

True

 


Substep C: Debugging the blocker

Now, we need to make sure that the candidate set is loose enough to include pairs of movies that are not very close. If this is not the case, there is a chance that we have eliminated pair that could be potentially matched together. By looking at a few pairs from the candidate set, we can judge whether the blocking step has been too harsh or not.

Note: The py_entitymatching package provides some tools for debugging the blocker as well.

In [23]:

				
					C[['l_norm_movie_title', 'r_norm_title', 'l_norm_title_year', 'r_norm_year',
   'l_budget', 'r_budget', 'l_content_rating', 'r_mpaa']].head()
				
			

Out [23]:

 l_norm_movie_titler_norm_titlel_norm_title_yearr_norm_yearl_budgetr_budgetl_content_ratingr_mpaa
0dude wheres my dog!#hacked201420142000020000PGNaN
1road hard#horror2015201515000001500000NaNNaN
2me you and five bucks#horror2015201515000001500000NaNNaN
3checkmate#horror2015201515000001500000NaNNaN
4#horror#horror2015201515000001500000Not RatedNaN

 

Based on the above sample we can see that the blocking seems to be reasonable.

Substep D: Sampling from the candidate set

The goal of this step is to obtain a sample from the candidate set and manually label the sampled candidates; that is, to specify if the candidate pair is a correct match or not.

In [24]:

				
					# Sampling 500 pairs and writing this sample into a .csv file
sampled = C.sample(500, random_state=0)
sampled.to_csv('./data/sampled.csv', encoding='utf-8')
				
			

In order to label the sampled data, we can create a new column in the .csv file (which we call label) and put value 1 under that column if the pair is a correct match and 0 otherwise. To avoid overriding the files, let’s rename the new file as labeled.csv.

In [25]:

				
					# If you would like to avoid labeling the pairs for now, you can download the labled.csv file from
# BigGorilla using the following command (if you prefer to do it yourself, command the next line)
response = urllib.urlretrieve('https://anaconda.org/BigGorilla/datasets/1/download/labeled.csv',
							  './data/labeled.csv')
labeled = em.read_csv_metadata('data/labeled.csv', ltable=kaggle_data, rtable=imdb_data,
							   fk_ltable='l_id', fk_rtable='r_id', key='_id')
labeled.head()
				
			
No handlers could be found for logger "py_entitymatching.io.parsers"

Out [25]:

 Unnamed: 0_idl_idr_idl_norm_movie_titlel_norm_title_yearl_durationl_budgetl_content_ratingr_norm_titler_norm_yearr_lengthr_budgetr_mpaa_sim_scorelabel
0477147712639235925eye of the beholder1999109.015000000Reye of the beholder1999109.035000000R0.8333331
111478114782001600301rocky balboa2006139.024000000PGrocky balboa2006139.024000000PG1.0000001
213630136304160691766from russia with love1963115.02000000Approvedthe aeolians: from russia with love2012NaN20000NaN0.6666670
3197219721248101029sex tape201494.040000000Rblended2014117.040000000PG-130.6666670
41590315903722758133the scorch trials2015132.061000000PG-13the scorch trials2015132.061000000PG-131.0000001

 

Substep E: Traning machine learning algorithms

Now we can use the sampled dataset to train various machine learning algorithms for our prediction task. To do so, we need to split our dataset into a training and a test set, and then select the desired machine learning techniques for our prediction task.

In [26]:

				
					split = em.split_train_test(labeled, train_proportion=0.5, random_state=0)
train_data = split['train']
test_data = split['test']

dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')
				
			

Before we can apply any machine learning technique, we need to extract a set of features. Fortunately, the py_entitymatching package can automatically extract a set of features once we specify which columns in the two datasets correspond to each other. The following code snippet starts by specifying the correspondence between the column of the two datasets. Then, it uses the py_entitymatching package to determine the type of each column. By considering the types of columns in each dataset (stored in variables l_attr_types and r_attr_types), and using the tokenizers and similarity functions suggested by the package, we can extract a set of instructions for extracting features. Note that variable F is not the set of extracted features, rather it encodes the instructions for computing the features.

In [27]:

				
					attr_corres = em.get_attr_corres(kaggle_data, imdb_data)
attr_corres['corres'] = [('norm_movie_title', 'norm_title'), 
						 ('norm_title_year', 'norm_year'),
						('content_rating', 'mpaa'),
						 ('budget', 'budget'),
]

l_attr_types = em.get_attr_types(kaggle_data)
r_attr_types = em.get_attr_types(imdb_data)

tok = em.get_tokenizers_for_matching()
sim = em.get_sim_funs_for_matching()

F = em.get_features(kaggle_data, imdb_data, l_attr_types, r_attr_types, attr_corres, tok, sim)
				
			

Given the set of desired features F, we can now calculate the feature values for our training data and also impute the missing values in our data. In this case, we choose to replace the missing values with the mean of the column.

In [28]:

				
					train_features = em.extract_feature_vecs(train_data, feature_table=F, attrs_after='label', show_progress=False) 
train_features = em.impute_table(train_features,  exclude_attrs=['_id', 'l_id', 'r_id', 'label'], strategy='mean')
				
			

Using the calculated features, we can evaluate the performance of different machine learning algorithms and select the best one for our matching task.

In [29]:

				
					result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=train_features, 
						   exclude_attrs=['_id', 'l_id', 'r_id', 'label'], k=5,
						   target_attr='label', metric='f1', random_state=0)
result['cv_stats']
				
			

Out [29]:

 NameMatcherNum foldsFold 1Fold 2Fold 3Fold 4Fold 5Mean score
0DecisionTree<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x15d828090>51.0000000.9677421.01.0000001.0000.993548
1RF<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x15d828550>gt;51.0000000.9677421.01.0000001.0000.993548
2SVM<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x15d8284d0>50.9565220.9677421.01.0000000.8750.959853
3LinReg<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x15d8560d0>51.0000000.9677421.01.0000001.0000.993548
4LogReg<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x15d8281d0>51.0000000.9677421.00.9565221.0000.984853
5NaiveBayes<py_entitymatching.matcher.nbmatcher.NBMatcher object at 0x111b2c290>51.0000000.9677421.01.0000001.0000.993548

We can observe based on the reported accuracy of different techniques that the “random forest (RF)” algorithm achieves the best performance. Thus, it is best to use this technique for the matching.


Substep F: Evaluating the quality of our matching

It is important to evaluate the quality of our matching. We can now, use the traning set for this purpose and measure how well the random forest predicts the matches. We can see that we are obtaining a high accuracy and recall on the test set as well.

In [30]:

				
					best_model = result['selected_matcher']
best_model.fit(table=train_features, exclude_attrs=['_id', 'l_id', 'r_id', 'label'], target_attr='label')

test_features = em.extract_feature_vecs(test_data, feature_table=F, attrs_after='label', show_progress=False)
test_features = em.impute_table(test_features, exclude_attrs=['_id', 'l_id', 'r_id', 'label'], strategy='mean')

# Predict on the test data
predictions = best_model.predict(table=test_features, exclude_attrs=['_id', 'l_id', 'r_id', 'label'], 
								 append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'label', 'predicted')
em.print_eval_summary(eval_result)
				
			
Precision : 94.44% (51/54)
Recall : 100.0% (51/51)
F1 : 97.14%
False positives : 3 (out of 54 positive predictions)
False negatives : 0 (out of 196 negative predictions)

Substep G: Using the trained model to match the datasets

Now, we can use the trained model to match the two tables as follows:

In [31]:

				
					candset_features = em.extract_feature_vecs(C, feature_table=F, show_progress=True)
candset_features = em.impute_table(candset_features, exclude_attrs=['_id', 'l_id', 'r_id'], strategy='mean')
predictions = best_model.predict(table=candset_features, exclude_attrs=['_id', 'l_id', 'r_id'],
								 append=True, target_attr='predicted', inplace=False)
matches = predictions[predictions.predicted == 1]
				
			
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:01:05

Note that the matches dataframe contains many columns storing the extracted features for both datasets. The following code snippet removes all the unnecessary columns and creates a nice formatted dataframe that has the resulting integrated dataset.

In [32]:

				
					from py_entitymatching.catalog import catalog_manager as cm
matches = matches[['_id', 'l_id', 'r_id', 'predicted']]
matches.reset_index(drop=True, inplace=True)
cm.set_candset_properties(matches, '_id', 'l_id', 'r_id', kaggle_data, imdb_data)
matches = em.add_output_attributes(matches, l_output_attrs=['norm_movie_title', 'norm_title_year', 'budget', 'content_rating'],
								   r_output_attrs=['norm_title', 'norm_year', 'budget', 'mpaa'],
								   l_output_prefix='l_', r_output_prefix='r_',
								   delete_from_catalog=False)
matches.drop('predicted', axis=1, inplace=True)
matches.head()
				
			

Out [32]:

 _idl_idr_idl_norm_movie_titlel_norm_title_yearl_budgetl_content_ratingr_norm_titler_norm_yearr_budgetr_mpaa
044352106#horror20151500000Not Rated#horror20151500000NaN
182726450crocodile dundee ii198815800000PGcrocodile dundee ii198814000000NaN
2113406838500 days of summer20097500000PG-13(500) days of summer20097500000PG-13
3243631187210 cloverfield lane201615000000PG-1310 cloverfield lane201615000000PG-13
4262965188110 days in a madhouse201512000000R10 days in delaware20150NaN