01.10.2019       Выпуск 302 (30.09.2019 - 06.10.2019)       Статьи

Анализируем данные с Октоберфеста: Pandas, Matplotlib, Plotly

Читать>>




Экспериментальная функция:

Ниже вы видите текст статьи по ссылке. По нему можно быстро понять ссылка достойна прочтения или нет

Просим обратить внимание, что текст по ссылке и здесь может не совпадать.

Oktoberfest : Quick analysis using Pandas, Matplotlib, and Plotly

Oktoberfest 2019 has started! Oktoberfest is the world’s largest beer festival and is held annually in Munich since 1810. It lasts between 16 and 18, running from mid or late September to the first Sunday in October, with more than 6 million visitors every year. 🍺 🍺

Munchen.de is the official portal of the city of Munich with contains more than 140 datasets, covering a wide rage of topics such as economy, transport, tourism, or culture. Currently, more and more European cities provide an open data portal, allowing companies, citizens, researcher, and other public institutions to make use of the data generated. For this article, we employ one of the data sets available in Munchen.de,containing information about Oktoberfest since 1985 until now. Although most of the years are not available (Remember Oktoberfest started in 1810!), the data set provides valuable insights about how the festival evolve over the last 30 years.

Soooo let’s started! 🍻

Information about libraries

To evaluate the dataset, we use three different Python libraries: Pandas, Matplotlib, and Plotly.

  • Pandas is a Pythonopen source library for data science that allows us to easily work with structured data, such as csv files, SQL tables, or Excel spreadsheets. It provides tools for reading and writing data in different formats, carrying out exploratory analysis, and cleaning data (reshaping data sets, handling missing data, or merging data sets).
  • Matplotlibis a Python 2D plotting library that can be used in Python scripts, Jupyter notebooks, and IPython shells, among other environments, producing high quality figures. Matplotlib offers a wide range of visualizations such as histograms, bar chart, scatter plots, box plots, or pie charts.
  • Plotly Python graphing library provides as Matplotlib a wide range of visualizations. But, unlike Matplotlib the visualizations provided by Plotly are interactive and more intuitive. 😉

After this overview about the libraries used in this analysis, we are ready to start the next step: exploratory data analysis and data cleaning.

Exploratory data analysis and data cleaning

Exploratory data analysisconsists of analyzing the main characteristics of a data set usually by means of visualization methods and summary statistics. The objective is to understand the data, discover patterns and anomalies, and check assumption before we perform further evaluations.

After downloading the csv file from Munchen.de, we can load it into a Pandas data frame using the pandas.read_csv function and visualize the first 5 rows using the pandas.DataFrame.headmethod.

The data set contains 8 columns: (1) year, (2) duration, (3) visitors in total, (4) visitors per day, (5) price of beer, (6) consumption of beer, (7) price of chicken, (8) consumption of chicken.

Inappropriate data types and missing values are the most common problems of data sets. We can easily analyze both using the pandas.DataFrame.infomethod. This method prints information about a DataFrame including column types, number of non-null values and memory usage.

Other alternatives to check whether there are null entries in the data set are the following:

We can also evaluate the data types using the pandas.DataFrame.dtypes attribute. This returns a Series with the data type of each column.

The data set does not contain null values and the data types are the expected ones, therefore not important cleaning tasks are required; however, there are a few minor changes that we can perform to facilitate the further analysis of the dataset.

  1. Drop the column besucher_tag (visitors per day) as we won’t use it in our evaluation.
  2. Translate columns names into English.
  3. Set year as the index of the data frame. To easily access the information of different years.

After performing the changes, we are ready to easily draw conclusions using the data. 💪 🍻 Let’s get started! ▶️

Answering questions and drawing conclusions

Exploratory data analysis and data cleaning are the steps that allow us to get a feeling about the data set and to get the data set ready to easily draw conclusions using it. Now! We are ready to answer the following questions using the dataset.

How did beer price change over the last 33 years? And chicken price? Is there a tendency?

As we can see in the following plot, the prices have increased steadily from 1985 until 2018, being the price around three times larger in 2018 than 1985.

In 1985, we could enjoy a meal at the Oktoberfest for only 7.97 euros. Three decades later, we need almost three times more money to enjoy a meal at the festival.💰

We can also create an interactive plot of the prices using Plotly! 💚

We can observe that the prices can be approximated by a linear relation (more accurate in the case of beer). Later, we will use simple linear regression to predict the prices in 2030!

How did beer consumption change from 1985 till 2018? And chicken consumption? Is there a tendency?

The following plot shows the consumption of beer from 1985 till now.

As we can observe, the consumption of beer presents a growing trend. The vertical lines indicate the years than the Oktoberfest coincided with the celebration of the Landswirtschaftsfest. This festival is celebrated every four years since 1996 (before every 3) at Theresienwiesen at the same time that Oktoberfest. We can see that small decreases in 2004, 2008, and 2012 could be related to the celebration of this festival. On the contrary, the significant decreases of consumption of beer in 2001 and 2016 could be related to the September 11 terrorist attacks in 2001, the November 2015 Paris terrorist attacks, and the 2016 Nice truck attack. The number of visitors decreased in those years (2001 and 2016), showing the lowest values since 1985.

The following plot represents the consumption of chicken from 1985 till 2018. As we can easily observe, the consumption of chicken presents a decreasing trend, and it was also affected by the terrorist attacks in 2001 and 2016.

How did the number of visitors change from 1985 till 2018? Is there a tendency?

We can easily analyze the number of visitors by using a line plot. As we can observe, the number of visitors does not follow any trend, ranging from 5.5 to 7.1.

So, the consumption of beer is not increasing because the festival has more visitors, but because the people drink more. 😆 🍻

Is the price of beer also increasing in other folk festivals in Bavaria?

Oktoberfest may be the most famous folk festival in Bavaria, but of course is not the only one! 🍻 The Bavarians know how to celebrate properly, and they organize festivals in many cities across Bavaria, not only in Munich ❤️! Now we will analyze the evolution of prices in other famous folk festivals in Bavaria. Have the prices also increased steadily in the last years? ▶️ Let’s see!

To perform this analysis, we use the beer prices found in the following article.

The article contains information about 8 different folk festivals in Bavaria. We can easily download the table available in the article by using the pandas.read_html function as follows:

The table includes: (1) Name of the festival, (2) Date of the festival, (3) Number of visitors, (4) Organizer, (5) Information about beer price, and (6) Beer price evolution from 2008 until 2016. Sorry, the article is in German! 😊

We are only interested in analyzing beer prices, thus we can create a data frame that only contains the prices available in the last column. These prices have to be in different columns, as tidy datasets contain each value placed in its own cell. To extract both years and prices, we use regex expressions and the re.findall(pattern, string, flags=0)function. This function returns all matches of a pattern in a string as a list of strings. Then, we create a dictionary for each festival , where the keys contain the years and the values contain the prices of beer. After creating the dictionaries, we can place them in a list. Finally, we use this list of dictionaries as the argument of the pandas.DataFrame function, obtaining the following data frame.

We have to perform a few modifications in the data frame before we visualize it. First, we have to modify the data types of all columns as they must contain floats instead of strings. Then, we have to transpose the data frame, since data frame indexes are by default the x axis of the plot, and each column of the data frame represents different lines in the visualization. Finally, we include the price of beer at Oktoberfest from 2008 till 2016.

Now, we are ready to visualize the evolution of prices at different festivals across Bavaria. As we can observe, the prices of beer are also increasing at other festivals across Bavaria. However, the prices at Oktoberfest are higher than at other folk festivals, being Dachauer Volksfest the most economical one.

The prices of beer at most festivals are pretty similar; therefore, a clear visualization using matplotlib is not possible. Alternatively, we can employ plotly library as before. This library makes interactive plots, allowing to hover over points, to select and deselect lines, and to zoom in/out.

We can also calculate the percentage of change in beer prices between years and previous years.

The table shows that prices increase more evenly at Oktoberfest than at other folk festivals, showing an increment around 3% every year. Other folk festivals present less uniform price increments; however, the increments are some years much larger than those at Oktoberfest.

Plotly allows to make animated scatter plots on maps so that we can visualize not only beer prices in different years, but also where the different folk festivals are located. First, we have to include in the dataframe the latitude and longitude of the cities where the festivals are celebrated. It would be more accurate to include the exact latitude and longitude of the festival grounds, but for this visualization the latitude and longitude of the city is a pretty good approximation.

To obtain both the latitude and the longitude of the cities we employ geopy. After obtaining the location of the cities, we include this information in the previously created data frame. We have to transpose again the data frame, since we want the festivals to be the index of the frame.

Depending on the visualization we want to display, the data frame should have one shape or another. To use the Plotly Express function px.scatter_mapbox, we have to first melt our data frame, displaying year as a variable. We can take a look at pandas.melt documentation to understand how it works. 👌

Now, we have a tidy data frame to create a scatter map plot using px.scatter_mapbox.Again, I encourage you to read the documentation to understand the arguments provided to the function.

And voilà! We have our scatter map plot, where we can visualize not only the prices, but also the location of the festivals from 2008 till 2016. 💪

What are beer prices this year at most popular tents?

The information was obtained again from a web page. After cleaning the data, we obtain a data set that contains the prices in 2018 and 2019 as well as the percentage change in price.

The following plot shows the prices in increasing order. As we can observe, beer prices are between 11.4 and 11.8euros, being Augustiner the more economical one.

Prediction of prices for 2030

The rise in prices follows a linear trend (more accurate in beer prices). Therefore, we can create a linear regression model to predict both prices (chicken and beer) in 2030. To do so, we can use thelinear regression modelavailable in scikit-learn library as follows:

According to the linear prediction model, to have a meal at Oktoberfest in 10 years will cost 28.33 € 😮

Key Takeaways

  • The prices of beer and chicken at Oktoberfest are increasingsteadily over last 30 years.
  • Visitors eat more and drink less. The consumption of beer shows an increasing trend over the years. On the contrary, the consumption of chicken is decreasing. The consumption was affected by the terrorist attacks in 2001 and 2016. Those years, fewer people visited the festival.
  • The number of visitors does not follow any trend, ranging from 5.5 to 7.1.
  • The prices of beer are also increasing at other folk festivals across Bavaria. The prices at Dachauer Volksfest are half of those at Oktoberfest. If you want to get drunk and don’t be broken go to Dachau. 😆 It is near Munich!
  • Beer price at Oktoberfest is increasing around 3% every year since 2008. The prices at other folk festivals across Bavaria are increasing less uniformly; however, some years the increment is much larger than at Oktoberfest.
  • Larger tents at Oktoberfest offer beer prices between 11.4€ and 11.8€this year.
  • The price of a meal at Oktoberfest (beer+chicken) will be around 28€ in 2030 if prices continue increasing.

Thanks for reading 🍀






Разместим вашу рекламу

Пиши: mail@pythondigest.ru

Нашли опечатку?

Выделите фрагмент и отправьте нажатием Ctrl+Enter.

Система Orphus