In this 2 part article series we will learn how to build your own recommendation engine with the help of Python, from basic models to content-based and collaborative filtering recommender systems.

Recommender Engines or Systems are among the most popular applications of data science today. They are used to predict the “rating” or “preference” that a user would give to an item. Almost every major tech company has applied them in some form or the other: Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow. What’s more, for some companies -think Netflix and Spotify, the business model and its success revolves around the potency of their recommendations.

Netflix’s long list of suggested movies and TV shows is a fantastic example of personalized user experience. In fact, about 70 percent of everything users watch is a personalized recommendation, according to the company.

Getting to that point hasn’t been easy, and improving on its recommendation system is an ongoing process. Netflix has spent well over a decade developing and refining its recommendations.

In a very general way:

Recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries).

**Why Recommender systems are even needed?**

As the World Wide Web continues to grow at an exponential rate, the size and complexity of many web sites grow along with it. For the users of these web sites, it becomes increasingly difficult and time-consuming to find the information they are looking for. User interfaces could help users find the information that is in accordance with their interests by personalizing a web site.

Some web sites present users with personalized information by letting them choose from a set of predefined topics of interest. Users, however, do not always know what they are interested in beforehand and their interests may change over time which would require them to change their selection frequently. **Recommender systems** provide personalized information by learning the user’s interests from traces of interaction with that user.

Broadly, Recommender Systems can be classified into 3 types:

**Simple recommenders**: offer generalized recommendations to every user, based on the popularity of the restaurant. The basic idea behind this system is that restaurants that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.**Content-based recommenders**: suggest similar items based on a particular item. This system uses item metadata, such as Locality, Cuisine, rating, etc. for restaurants, to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.**Collaborative filtering engines**: these systems try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

In this **1st part** of this article series, we will see how to build a basic model of simple as well as content-based recommender systems. While these models will be nowhere close to the industry standard in terms of complexity, quality or accuracy, it will help you to get started with building more complex models that produce even better results.

In the **2nd part**, we will create a collaborative filtering recommender system using K-Nearest Neighbour(K-NN) machine learning algorithm. You can get comfortable with the K-NN algorithm before going ahead with Collaborative filtering.

Let's start building a Restaurant Recommendation Engine using the techniques discussed above which should be capable of recommending restaurants which best suits you.

We will use Zomato restaurants data and it can be downloaded from here.

# Simple Recommenders

As described in the previous section, simple recommenders are basic systems that recommend the top items based on a certain score. In this section, you will build a simplified model which will give you the top 10 restaurants of your city based on user rating and score.

The following are the steps involved:

- Decide on the metric or score to rate restaurants.
- Calculate the score for every restaurant.
- Sort the restaurants based on the score and output the top results.

Load the libraries

`import numpy as np `

import pandas as pd

import re

from plotly.offline import init_notebook_mode, iplot

init_notebook_mode()

import plotly.graph_objs as go

import seaborn as sns

import matplotlib.pyplot as plt

Load the datasets

data = pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/recommend/zomato.csv', encoding ='latin1')country = pd.read_excel("/Users/nageshsinghchauhan/Downloads/ML/recommend/Country-Code.xlsx")

Now, let's merge the two files.

`data1 = pd.merge(data, country, on='Country Code')`

Start with some data exploration.

- Let us check countries where the maximum number of restaurants are registered on Zomato.

abels = list(data1.Country.value_counts().index)

values = list(data1.Country.value_counts().values)

fig = {

"data":[

{

"labels" : labels,

"values" : values,

"hoverinfo" : 'label+percent',

"domain": {"x": [0, .9]},

"hole" : 0.6,

"type" : "pie",

"rotation":120,

},

],

"layout": {

"title" : "Zomato's Presence around the World",

"annotations": [

{

"font": {"size":20},

"showarrow": True,

"text": "Countries",

"x":0.2,

"y":0.9,

},

]

}

}

iplot(fig)

2. Let us check Zomato’s presence in the top 10 Indian cities.

res_India = data1[data1.Country == 'India']labels1 = list(res_India.City.value_counts().index)

values1 = list(res_India.City.value_counts().values)

labels1 = labels1[:10]

values1 = values1[:10]

fig = {

"data":[

{

"labels" : labels1,

"values" : values1,

"hoverinfo" : 'label+percent',

"domain": {"x": [0, .8]},

"hole" : 0.6,

"type" : "pie",

"rotation":120,

},

],

"layout": {

"title" : "",

"annotations": [

{

"font": {"size":20},

"showarrow": True,

"text": "Cities",

"x":0.2,

"y":0.9,

},

]

}

}

iplot(fig)

3. Number of restaurants in NCR(4 cities `New Delhi, Gurgaon, Noida, Faridabad`

together called NCR) with aggregate rating ranging from 1.9 to 4.9

`NCR = ['New Delhi','Gurgaon','Noida','Faridabad']`

res_NCR = res_India[(res_India.City == NCR[0])|(res_India.City == NCR[1])|(res_India.City == NCR[2])|

(res_India.City == NCR[3])]

agg_rat = res_NCR[res_NCR['Aggregate rating'] > 0]

f, ax = plt.subplots(1,1, figsize = (14, 4))

ax = sns.countplot(agg_rat['Aggregate rating'])

plt.show()

4. Top 10 Cuisines served by restaurants.

`res_India['Cuisines'].value_counts().sort_values(ascending=False).head(10)`

res_India['Cuisines'].value_counts().sort_values(ascending=False).head(10).plot(kind='pie',figsize=(10,6),

title="Most Popular Cuisines", autopct='%1.2f%%')

plt.axis('equal')

Now, we are going to select Country as “India” and the city as NCR(New Delhi, Gurgaon, Noida, Faridabad).

`res_India = data1[data1.Country == 'India']`

NCR = ['New Delhi','Gurgaon','Noida','Faridabad']

res_NCR = res_India[(res_India.City == NCR[0])|(res_India.City == NCR[1])|(res_India.City == NCR[2])|

(res_India.City == NCR[3])]

One of the most basic metrics you can think of is the rating. However, using this metric has a few caveats. For one, it does not take into consideration the popularity of a restaurant. Therefore, a restaurant with a rating of 9 from 10 voters will be considered ‘better’ than a restaurant with a rating of 8.9 from 10,000 voters.

On a related note, this metric will also tend to favor restaurants with a smaller number of voters with skewed and/or extremely high ratings. As the number of voters increases, the rating of a restaurant regularizes and approaches towards a value that is reflective of the restaurant’s quality. It is more difficult to discern the quality of a restaurant with extremely few voters.

Taking these shortcomings into consideration, it is necessary that you come up with a weighted rating that takes into account the average rating and the number of votes it has garnered.

Here, we will use its weighted rating formula as our metric/score. Mathematically, it is represented as:

where WR is Weighted Rating,

*v*is the number of votes for the restaurant;*m*is the minimum votes required to be listed in the chart;*R*is the average rating of the restaurant; And*C*is the mean vote across the whole report

You already have the values to *v* ("`Votes”`

) and *R* (“`Aggregate rating`

”) for each restaurant in the dataset. It is also possible to directly calculate *C* from this data.

What you need to determine if an appropriate value for *m*, the minimum votes required to be listed in the chart. There is no right value for *m*. You can view it as a preliminary negative filter that ignores restaurants which have less than a certain number of votes. The selectivity of your filter is up to your discretion.

In this case, you will use the 90th percentile as your cutoff. In other words, for a restaurant to feature in the charts, it must have more votes than at least 90% of the restaurants in the list. (On the other hand, if you had chosen the 75th percentile, you would have considered the top 25% of the restaurants in terms of the number of votes garnered. As percentile decreases, the number of restaurants considered increases. Feel free to play with this value and observe the changes in your final chart).

As a first step, let’s calculate the value of *C*, the mean rating across all restaurants:

data_new_delphi=res_NCR[['Restaurant Name','Cuisines','Locality','Aggregate rating', 'Votes']]C = data_new_delphi['Aggregate rating'].mean()

print(C)

#2.39583438526

The average rating of a restaurant in the **NCR region** is around 2.39, on a scale of 5.

Next, let’s calculate the number of votes, *m*, received by a restaurant in the 90th percentile. The `pandas`

library makes this task extremely trivial using the `.quantile()`

method of a pandas Series:

`m = data_new_delphi['Votes'].quantile(0.90)`

print(m)

#234.0

Next, you can filter the restaurants that qualify for the chart, based on their vote counts:

`# Filter out all qualified restaurants into a new DataFrame`

q_restaurant = data_new_delphi.copy().loc[data_new_delphi['Votes'] >= m]

q_restaurant.shape

#(795, 5)

You use the `.copy()`

method to ensure that the new `q_restaurant`

dataframe created is independent of your original metadata DataFrame. In other words, any changes made to the `q_restaurant`

DataFrame does not affect the metadata.

You see that there are 795 restaurants which qualify to be in this list. Now, you need to calculate your metric for each qualified restaurant. To do this, you will define a function, `weighted_rating()`

and define a new feature `score`

, of which you'll calculate the value by applying this function to your DataFrame of qualified restaurants:

# Function that computes the weighted rating of each restaurant

def weighted_rating(x, m=m, C=C):

v = x['Votes']

R = x['Aggregate rating']

# Calculating the score

return (v/(v+m) * R) + (m/(m+v) * C)

# Define a new feature 'score' and calculate its value with `weighted_rating()`

q_restaurant['score'] = q_restaurant.apply(weighted_rating, axis=1)

Finally, let’s sort the DataFrame based on the `score`

feature and output the `Restaurant Name`

, `Votes`

, `Aggregate rating`

and `weighted rating`

or `score`

of the top 10 restaurants.

#Sort restaurant based on score calculated above

q_restaurant = q_restaurant.sort_values('score', ascending=False)

#Print the top 10 restaurants in Delhi NCR

q_restaurant[['Restaurant Name','Cuisines', 'Locality','Votes', 'Aggregate rating', 'score']].head(10)

As you can see, it recommends top rated restaurants based on `Aggregate rating.`