08.07.2019       Выпуск 290 (08.07.2019 - 14.07.2019)       Статьи

Строим рекомендательный движок для ресторанов. Часть 1


Экспериментальная функция:

Ниже вы видите текст статьи по ссылке. По нему можно быстро понять ссылка достойна прочтения или нет

Просим обратить внимание, что текст по ссылке и здесь может не совпадать.

In this 2 part article series we will learn how to build your own recommendation engine with the help of Python, from basic models to content-based and collaborative filtering recommender systems.

Recommender Engines or Systems are among the most popular applications of data science today. They are used to predict the “rating” or “preference” that a user would give to an item. Almost every major tech company has applied them in some form or the other: Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow. What’s more, for some companies -think Netflix and Spotify, the business model and its success revolves around the potency of their recommendations.

Netflix’s long list of suggested movies and TV shows is a fantastic example of personalized user experience. In fact, about 70 percent of everything users watch is a personalized recommendation, according to the company.

Getting to that point hasn’t been easy, and improving on its recommendation system is an ongoing process. Netflix has spent well over a decade developing and refining its recommendations.

In a very general way:

Recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries).

Why Recommender systems are even needed?

As the World Wide Web continues to grow at an exponential rate, the size and complexity of many web sites grow along with it. For the users of these web sites, it becomes increasingly difficult and time-consuming to find the information they are looking for. User interfaces could help users find the information that is in accordance with their interests by personalizing a web site.

Some web sites present users with personalized information by letting them choose from a set of predefined topics of interest. Users, however, do not always know what they are interested in beforehand and their interests may change over time which would require them to change their selection frequently. Recommender systems provide personalized information by learning the user’s interests from traces of interaction with that user.

Broadly, Recommender Systems can be classified into 3 types:

  • Simple recommenders: offer generalized recommendations to every user, based on the popularity of the restaurant. The basic idea behind this system is that restaurants that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.
  • Content-based recommenders: suggest similar items based on a particular item. This system uses item metadata, such as Locality, Cuisine, rating, etc. for restaurants, to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.
  • Collaborative filtering engines: these systems try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

In this 1st part of this article series, we will see how to build a basic model of simple as well as content-based recommender systems. While these models will be nowhere close to the industry standard in terms of complexity, quality or accuracy, it will help you to get started with building more complex models that produce even better results.

In the 2nd part, we will create a collaborative filtering recommender system using K-Nearest Neighbour(K-NN) machine learning algorithm. You can get comfortable with the K-NN algorithm before going ahead with Collaborative filtering.

Let's start building a Restaurant Recommendation Engine using the techniques discussed above which should be capable of recommending restaurants which best suits you.

We will use Zomato restaurants data and it can be downloaded from here.


Simple Recommenders

As described in the previous section, simple recommenders are basic systems that recommend the top items based on a certain score. In this section, you will build a simplified model which will give you the top 10 restaurants of your city based on user rating and score.

The following are the steps involved:

  • Decide on the metric or score to rate restaurants.
  • Calculate the score for every restaurant.
  • Sort the restaurants based on the score and output the top results.

Load the libraries

import numpy as np 
import pandas as pd
import re
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import seaborn as sns
import matplotlib.pyplot as plt

Load the datasets

data = pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/recommend/zomato.csv', encoding ='latin1')country = pd.read_excel("/Users/nageshsinghchauhan/Downloads/ML/recommend/Country-Code.xlsx")

Now, let's merge the two files.

data1 = pd.merge(data, country, on='Country Code')

Start with some data exploration.

  1. Let us check countries where the maximum number of restaurants are registered on Zomato.
abels = list(data1.Country.value_counts().index)
values = list(data1.Country.value_counts().values)

fig = {
"labels" : labels,
"values" : values,
"hoverinfo" : 'label+percent',
"domain": {"x": [0, .9]},
"hole" : 0.6,
"type" : "pie",
"layout": {
"title" : "Zomato's Presence around the World",
"annotations": [
"font": {"size":20},
"showarrow": True,
"text": "Countries",

pie chart showing countries where the maximum number of restaurants are registered on Zomato

2. Let us check Zomato’s presence in the top 10 Indian cities.

res_India = data1[data1.Country == 'India']labels1 = list(res_India.City.value_counts().index)
values1 = list(res_India.City.value_counts().values)
labels1 = labels1[:10]
values1 = values1[:10]

fig = {
"labels" : labels1,
"values" : values1,
"hoverinfo" : 'label+percent',
"domain": {"x": [0, .8]},
"hole" : 0.6,
"type" : "pie",
"layout": {
"title" : "",
"annotations": [
"font": {"size":20},
"showarrow": True,
"text": "Cities",

pie chart showing the top 10 cities where the maximum number of restaurants are registered on Zomato

3. Number of restaurants in NCR(4 cities New Delhi, Gurgaon, Noida, Faridabadtogether called NCR) with aggregate rating ranging from 1.9 to 4.9

NCR = ['New Delhi','Gurgaon','Noida','Faridabad']
res_NCR = res_India[(res_India.City == NCR[0])|(res_India.City == NCR[1])|(res_India.City == NCR[2])|
(res_India.City == NCR[3])]
agg_rat = res_NCR[res_NCR['Aggregate rating'] > 0]
f, ax = plt.subplots(1,1, figsize = (14, 4))
ax = sns.countplot(agg_rat['Aggregate rating'])

Average rating in NCR region

4. Top 10 Cuisines served by restaurants.

title="Most Popular Cuisines", autopct='%1.2f%%')

Top 10 Cuisines served by restaurants.

Now, we are going to select Country as “India” and the city as NCR(New Delhi, Gurgaon, Noida, Faridabad).

res_India = data1[data1.Country == 'India']
NCR = ['New Delhi','Gurgaon','Noida','Faridabad']
res_NCR = res_India[(res_India.City == NCR[0])|(res_India.City == NCR[1])|(res_India.City == NCR[2])|
(res_India.City == NCR[3])]

One of the most basic metrics you can think of is the rating. However, using this metric has a few caveats. For one, it does not take into consideration the popularity of a restaurant. Therefore, a restaurant with a rating of 9 from 10 voters will be considered ‘better’ than a restaurant with a rating of 8.9 from 10,000 voters.

On a related note, this metric will also tend to favor restaurants with a smaller number of voters with skewed and/or extremely high ratings. As the number of voters increases, the rating of a restaurant regularizes and approaches towards a value that is reflective of the restaurant’s quality. It is more difficult to discern the quality of a restaurant with extremely few voters.

Taking these shortcomings into consideration, it is necessary that you come up with a weighted rating that takes into account the average rating and the number of votes it has garnered.

Here, we will use its weighted rating formula as our metric/score. Mathematically, it is represented as:

where WR is Weighted Rating,

  • v is the number of votes for the restaurant;
  • m is the minimum votes required to be listed in the chart;
  • R is the average rating of the restaurant; And
  • C is the mean vote across the whole report

You already have the values to v ("Votes”) and R (“Aggregate rating”) for each restaurant in the dataset. It is also possible to directly calculate C from this data.

What you need to determine if an appropriate value for m, the minimum votes required to be listed in the chart. There is no right value for m. You can view it as a preliminary negative filter that ignores restaurants which have less than a certain number of votes. The selectivity of your filter is up to your discretion.

In this case, you will use the 90th percentile as your cutoff. In other words, for a restaurant to feature in the charts, it must have more votes than at least 90% of the restaurants in the list. (On the other hand, if you had chosen the 75th percentile, you would have considered the top 25% of the restaurants in terms of the number of votes garnered. As percentile decreases, the number of restaurants considered increases. Feel free to play with this value and observe the changes in your final chart).

As a first step, let’s calculate the value of C, the mean rating across all restaurants:

data_new_delphi=res_NCR[['Restaurant Name','Cuisines','Locality','Aggregate rating', 'Votes']]C = data_new_delphi['Aggregate rating'].mean()

The average rating of a restaurant in the NCR region is around 2.39, on a scale of 5.

Next, let’s calculate the number of votes, m, received by a restaurant in the 90th percentile. The pandas library makes this task extremely trivial using the .quantile() method of a pandas Series:

m = data_new_delphi['Votes'].quantile(0.90)

Next, you can filter the restaurants that qualify for the chart, based on their vote counts:

# Filter out all qualified restaurants into a new DataFrame
q_restaurant = data_new_delphi.copy().loc[data_new_delphi['Votes'] >= m]
#(795, 5)

You use the .copy() method to ensure that the new q_restaurant dataframe created is independent of your original metadata DataFrame. In other words, any changes made to the q_restaurant DataFrame does not affect the metadata.

You see that there are 795 restaurants which qualify to be in this list. Now, you need to calculate your metric for each qualified restaurant. To do this, you will define a function, weighted_rating() and define a new feature score, of which you'll calculate the value by applying this function to your DataFrame of qualified restaurants:

# Function that computes the weighted rating of each restaurant
def weighted_rating(x, m=m, C=C):
v = x['Votes']
R = x['Aggregate rating']
# Calculating the score
return (v/(v+m) * R) + (m/(m+v) * C)

# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_restaurant['score'] = q_restaurant.apply(weighted_rating, axis=1)

Finally, let’s sort the DataFrame based on the score feature and output the Restaurant Name, Votes, Aggregate rating and weighted rating or score of the top 10 restaurants.

#Sort restaurant based on score calculated above
q_restaurant = q_restaurant.sort_values('score', ascending=False)

#Print the top 10 restaurants in Delhi NCR
q_restaurant[['Restaurant Name','Cuisines', 'Locality','Votes', 'Aggregate rating', 'score']].head(10)

Top 10 restaurants in NCR region

As you can see, it recommends top rated restaurants based on Aggregate rating.

Разместим вашу рекламу

Пиши: mail@pythondigest.ru

Нашли опечатку?

Выделите фрагмент и отправьте нажатием Ctrl+Enter.

Система Orphus