https://images.unsplash.com/photo-1551036746-04a04d0aaa22?ixlib=rb-4.0.3&q=85&fm=jpg&crop=entropy&cs=srgb

I am looking for an apartment in Berlin for the third time in my life. For the first two times, I was looking for a place to hang my hat. So, I was not very picky about it. This time I am pickier because I have leverage as my current apartment is decent and I don’t need to move out anytime soon.

What I want at the end is an affordable apartment near my office. This means I am looking for an apartment with a cold rent below the Berlin average, in Kreuzberg or in the quarters nearby.

My questions are:

  1. How does Kreuzberg, the quarter and borough compare to others in Berlin?
  2. Should I get a studio apartment or an apartment with two small rooms?
  3. How about amenities? Does a built-in-kitchen, balcony or a garden increase the rent?

There are probably great reports on the real estate market in Germany that could answer these questions, but they all hide behind a very tall paywall. So, I decided to scrape data from ImmobilientsScout24, one of the biggest real estate websites in Germany.

I scraped the data from Immobilienscout24 using a script that I have found here.

You can find the all data cleaning and analysis code in Python on my GitHub page.

You can find this project’s Tableau Dashboard here.

Data Cleaning

First I load up the scraped data and start exploring it:

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from matplotlib.pyplot import figure

df_orig = pd.read_csv('.../DE_rentals.csv',sep=';')
df = df_orig.copy()
df.head()

There are some unnecessary column. I’ll remove them:

df.drop(['url', 'firstname', 'lastname', 'phoneNumber','lat', 
					'lon', 'creation', 'ID', 'address'], axis=1, inplace=True)

In the dataset some values are marked with "no value" or "None". I should change that to standard null value:

df = df.replace({"no value":pd.NA})
df = df.replace({"None":pd.NA})

for col in df.columns:
    pct_missing = (np.mean(df[col].isnull())*100)
    print('{} - {}%'.format(col, pct_missing))

Untitled

About 21 per cent of listings have their company column missing. These listings are probably from private sellers. I’ll drop rows where company column is missing and not offered by a private person. Then, I’ll fill the remaining null “Company rows with the string “Private Seller”: