Insightful Analysis Of Munich AirBnB Data

Marvin Lüthe
6 min readAug 11, 2019

Introduction

In this post I will analyze AirBnB data for Munich, Germany. I will focus on the prices and customer reviews as well as on the preferred hotspots in Munich. For the purpose of this analysis, I collected the following datasets from the official AirBnB website (http://insideairbnb.com/get-the-data.html):

  • calendar.csv: detailed calendar data
  • listings.csv: summary information and metrics for listings
  • reviews.csv: summary review data and listing ID.

At first, I will analyze the prices of all AirBnB apartments listed in Munich and search for patterns in the price development over the course of a calendar year.

Then, I will use a generative statistical model from natural language processing to identify the main topics Munich customers at AirBnB care about.

Finally, I will show which urban districts are the most popular in Munich.

Part I: Are there patterns in the price development?

Munich is one of the most expensive cities in Germany when it comes to living expenses. Therefore, I am particularly interested in the pricing of AirBnB listings in Munich. The dataset includes 10,057 listings. Below we can see a histogram of the listed AirBnB apartments in Munich from June 2019 to July 2020.

For the sake of visualization I removed outliers as the prices for the most expensive apartments increase up to 5,000€ per night. The distribution is right skewed, the dotted line indicates the median of 80€.

Figure 1: Price Histogram Of All AirBnB Listings In Munich

In order to find patterns in the pricing data, I decided to visualize price boxplots per calendar week.

Figure 2: Price Boxplots
Figure 2: Price Boxplots Per Calendar Week

Figure 2 depicts the price boxplots per calendar week and we can clearly see that the prices for the apartments do not fluctuate very much. However, there seem to be a two/three weeks period where the prices go up.

The reason for the price increase of the AirBnB apartments during calendar week 38, 39 and 40 will likely be the Oktoberfest.

The Oktoberfest is the world’s largest Volksfest held in Munich. It runs from mid or late September to the first weekend in October, with more than six million people from around the world attending the event every year.

Let’s run a hypothesis test to figure out whether the price increase during this phase is statistically significant. For the hypothesis test I compare the prices during the Oktoberfest with the prices during the rest of the year. I use the following set-up for the t-test:

  • One-tailed t-test
  • Null Hypothesis H0: μ = 111€ per night
  • Alternative Hypothesis H1: μ >111€ per night
  • α = 0.05
  • t = 1.645

The p-value of this t-test is almost zero with a t-statistic of 56.66. Therefore, we can reject the null hypothesis and confirm that the price increase is statistically significant.

Part II: What do AirBnB guests in Munich care about?

For this part I will apply topic modeling on the customer reviews to retrieve the information what AirBnB guests look for in Munich. One popular topic modeling technique is known as Latent Dirichlet Allocation (LDA).

LDA imagines a fixed set of topics. Each topic represents a set of words. And the goal of LDA is to map all the documents to the topics in a way, such that the words in each document are mostly captured by those imaginary topics. LDA assumes that each document can be described by a distribution of topics and each topic can be described by a distribution of words.

If you want to dive deeper into the mechanics of LDA, I can recommend reading this post: https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158.

Once the LDA model is built, we would like to interpret the results of this model. To this end, we will use the Python package pyLDAvis. pyLDAvis is designed to help users interpret the topics in a model that has been fit to a corpus of text data. It extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

Figure 3: Visualization Of The Trained LDA model

Figure 3 shows the pyLDAvis HTML file which helps us in identifying the main topics in the customer reviews. On the left part of the page, we see topics as circles. The areas of the circles are proportional to the relative prevalences of the topics in the corpus. On the right hand side, we see the terms which best desribe the given topic. On the upper right, we can adjust the value for λ. By adjusting λ, we change the order in which the terms are ranked. A user study from Iowa State university and AT&T Labs Research determined that λ=0.6 is an optimal value to interpret the underlying topics of a corpus (cf. https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf). This visualization technique is powerful as it is interactive. Therefore, I can recommend you to check out the lda.html file on my GitHub repository (https://github.com/scientist94/AirBnB_Analysis).

For the sake of this analysis, I have created multiple LDA models with different amounts of topics. Finally, I decided to retrieve 10 topics. I will not go through every topic as there are intersections between the topics and some topics are themselves combinations of different topics. The LDA algorithm is not always able to perfectly segregate the topics in a clear way.

Based on my interpretation, the main topics that AirBnB guests in Munich care about, are:

  • A comfortable, stylish apartment with a clean kitchen and bathroom. They also appreciate it if it is quite and friendly.
  • The customers also care very much about how easily accessible the flat is. They prefer flats that are within walking distance and that are very close to the public transport and the central station. Central flats are definitely the preferred choice.
  • The customers appreciate it if grocery stores are nearby.
  • Another important part is a smooth reservation process. Quick response times and flexibility are aspects that AirBnB hosts should definitely keep in mind to satisfy their guests.

Part III: What are the most popular places in Munich?

The text learning based approach already showed that AirBnB guests prefer central flats that are easily accessible and close to public transportation.

In this part, I will verify this information by looking at latitudes and longitudes of the rented flats. I will plot these flats on a heatmap using the Python package gmplot.

Figure 4: Munich AirBnB Heatmaps

Figure 4 depicts a heatmap for Munich. We clearly see that central urban districts are very popular. However, some flats on the outskirts are in very high demand either. For example, on the right part of the map there are some red areas which indicate a high demand. If we zoom into the map, we will see that these areas are near the exhibition center in Munich.

The three most popular quarters are Maxvorstadt, Ludwigsvorstadt-Isarvorstadt and Schwabing-Freimann.

If you think about becoming a future AirBnB host in Munich, you could take this map to estimate the demand for your flat based on it’s location.

Conclusion

This analysis looked at the prices and customer reviews of Munich AirBnB data. We used both descriptive and inferential statistical approaches to analyse the AirBnB pricing. Furthermore, we applied a natural language processing technique to get insights from the customer reviews. Finally, we identified the most popular areas where AirBnB tourists like to spend their time in Munich.

I hope you enjoyed reading my post, feel free to contact me for any feedback.

The code is available on GitHub: https://github.com/scientist94/AirBnB_Analysis

--

--