{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Netherlands Accommodation Prices (FCG)\n",
    "    \n",
    "The Dutch housing crisis is one of the biggest problems that residents face.\n",
    "\n",
    "Due to multiple factors, such as population growth and a shortage of construction workers, the availability of housing has decreased significantly. This decline has pushed rent to sky-high prices, which leaves many wondering whether they are being taken advantage of.\n",
    "\n",
    "In order to answer this question, you are tasked to predict the rent of a house given its data (i.e. location, size, facilities etc.)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_Uncomment next cell to download the libraries._ We use a `!` before `pip` to run the command in the terminal instead of python. When using a computer locally, it is sufficient to execute the below line in the terminal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install numpy\n",
    "# !pip install pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading data\n",
    "\n",
    "Download the csv data files from Kaggle and place it in a folder `./datasets/` in the same directory as this notebook. In python to load csv files we use a library called `pandas`. Pandas is a open source library created to handle the manipulation and storage of large scale datasets.\n",
    "\n",
    "- Pandas is optimized for handling large scale datasets\n",
    "- This optimization results in massive speedups when compared to other libraries like csv and pickle\n",
    "- Pandas contains several useful features for Machine Learning such as:\n",
    "    - Data cleaning\n",
    "    - Data inspection\n",
    "    - Statistical analysis\n",
    "    - Data normalization\n",
    "    - Loading and storing data\n",
    "\n",
    "\n",
    "A CSV file is a text file which uses a comma to separate values. An example is provided below\n",
    "\n",
    "```\n",
    "id,title,city,postalCode\n",
    "0,West-Varkenoordseweg,Rotterdam,3974HN\n",
    "3,Ruiterakker,Assen,9407BG\n",
    "8,Brusselseweg,Maastricht,6217GX\n",
    "10,Donkerslootstraat,Rotterdam,3074WL\n",
    "12,Vorselenburgstraat,Alphen aan den Rijn,2405XJ\n",
    "```\n",
    "\n",
    "While the above data might be difficult for a human to read at a first glance, machines can parse these files quickly. To read the file, we'll use a `read_csv` method.\n",
    "\n",
    "Read more about the methonds avaliable in the [documentation](https://pandas.pydata.org/docs/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>city</th>\n",
       "      <th>postalCode</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>areaSqm</th>\n",
       "      <th>firstSeenAt</th>\n",
       "      <th>lastSeenAt</th>\n",
       "      <th>isRoomActive</th>\n",
       "      <th>rawAvailability</th>\n",
       "      <th>...</th>\n",
       "      <th>matchAge</th>\n",
       "      <th>matchGender</th>\n",
       "      <th>matchCapacity</th>\n",
       "      <th>matchLanguages</th>\n",
       "      <th>matchStatus</th>\n",
       "      <th>coverImageUrl</th>\n",
       "      <th>additionalCosts</th>\n",
       "      <th>rent</th>\n",
       "      <th>deposit</th>\n",
       "      <th>registrationCost</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>West-Varkenoordseweg</td>\n",
       "      <td>Rotterdam</td>\n",
       "      <td>3074HN</td>\n",
       "      <td>51.896601</td>\n",
       "      <td>4.514993</td>\n",
       "      <td>14</td>\n",
       "      <td>2019-07-14 11:25:46.511000+00:00</td>\n",
       "      <td>2019-07-26 22:18:23.142000+00:00</td>\n",
       "      <td>True</td>\n",
       "      <td>26-06-'19 - Indefinite period</td>\n",
       "      <td>...</td>\n",
       "      <td>16 years - 99 years</td>\n",
       "      <td>Not important</td>\n",
       "      <td>1 person</td>\n",
       "      <td>Not important</td>\n",
       "      <td>Not important</td>\n",
       "      <td>https://resources.kamernet.nl/image/913b4b03-5...</td>\n",
       "      <td>50.0</td>\n",
       "      <td>500</td>\n",
       "      <td>500.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ruiterakker</td>\n",
       "      <td>Assen</td>\n",
       "      <td>9407BG</td>\n",
       "      <td>53.013494</td>\n",
       "      <td>6.561012</td>\n",
       "      <td>16</td>\n",
       "      <td>2019-07-14 11:25:46.988000+00:00</td>\n",
       "      <td>2019-07-18 22:00:31.174000+00:00</td>\n",
       "      <td>False</td>\n",
       "      <td>16-06-'19 - Indefinite period</td>\n",
       "      <td>...</td>\n",
       "      <td>18 years - 32 years</td>\n",
       "      <td>Female</td>\n",
       "      <td>1 person</td>\n",
       "      <td>Not important</td>\n",
       "      <td>Student, Working student</td>\n",
       "      <td>https://resources.kamernet.nl/image/84e95365-6...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>290</td>\n",
       "      <td>290.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Brusselseweg</td>\n",
       "      <td>Maastricht</td>\n",
       "      <td>6217GX</td>\n",
       "      <td>50.860841</td>\n",
       "      <td>5.671673</td>\n",
       "      <td>16</td>\n",
       "      <td>2019-07-14 11:25:47.814000+00:00</td>\n",
       "      <td>2019-08-10 00:14:27.130000+00:00</td>\n",
       "      <td>True</td>\n",
       "      <td>15-07-'19 - Indefinite period</td>\n",
       "      <td>...</td>\n",
       "      <td>16 years - 40 years</td>\n",
       "      <td>Male</td>\n",
       "      <td>4 persons</td>\n",
       "      <td>Dutch English</td>\n",
       "      <td>Student</td>\n",
       "      <td>https://resources.kamernet.nl/image/6e625591-d...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>425</td>\n",
       "      <td>425.0</td>\n",
       "      <td>25.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Donkerslootstraat</td>\n",
       "      <td>Rotterdam</td>\n",
       "      <td>3074WL</td>\n",
       "      <td>51.893195</td>\n",
       "      <td>4.516478</td>\n",
       "      <td>25</td>\n",
       "      <td>2019-07-14 11:25:48.140000+00:00</td>\n",
       "      <td>2019-07-16 06:05:32.183000+00:00</td>\n",
       "      <td>False</td>\n",
       "      <td>01-08-'19 - Indefinite period</td>\n",
       "      <td>...</td>\n",
       "      <td>21 years - 99 years</td>\n",
       "      <td>Not important</td>\n",
       "      <td>4 persons</td>\n",
       "      <td>Dutch English Spanish French Italian German Po...</td>\n",
       "      <td>Student, Working student, Working, Looking for...</td>\n",
       "      <td>https://resources.kamernet.nl/image/ea3aea77-0...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>600</td>\n",
       "      <td>1200.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Vorselenburgstraat</td>\n",
       "      <td>Alphen aan den Rijn</td>\n",
       "      <td>2405XJ</td>\n",
       "      <td>52.122335</td>\n",
       "      <td>4.661434</td>\n",
       "      <td>10</td>\n",
       "      <td>2019-07-14 11:25:48.465000+00:00</td>\n",
       "      <td>2019-08-01 00:02:40.516000+00:00</td>\n",
       "      <td>True</td>\n",
       "      <td>08-07-'19 - Indefinite period</td>\n",
       "      <td>...</td>\n",
       "      <td>22 years - 40 years</td>\n",
       "      <td>Not important</td>\n",
       "      <td>1 person</td>\n",
       "      <td>Dutch English</td>\n",
       "      <td>Student, Working student, Working</td>\n",
       "      <td>https://resources.kamernet.nl/image/d0780298-b...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>425</td>\n",
       "      <td>425.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 36 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                   title                 city postalCode   latitude  \\\n",
       "id                                                                    \n",
       "0   West-Varkenoordseweg            Rotterdam     3074HN  51.896601   \n",
       "3            Ruiterakker                Assen     9407BG  53.013494   \n",
       "8           Brusselseweg           Maastricht     6217GX  50.860841   \n",
       "10     Donkerslootstraat            Rotterdam     3074WL  51.893195   \n",
       "12    Vorselenburgstraat  Alphen aan den Rijn     2405XJ  52.122335   \n",
       "\n",
       "    longitude  areaSqm                       firstSeenAt  \\\n",
       "id                                                         \n",
       "0    4.514993       14  2019-07-14 11:25:46.511000+00:00   \n",
       "3    6.561012       16  2019-07-14 11:25:46.988000+00:00   \n",
       "8    5.671673       16  2019-07-14 11:25:47.814000+00:00   \n",
       "10   4.516478       25  2019-07-14 11:25:48.140000+00:00   \n",
       "12   4.661434       10  2019-07-14 11:25:48.465000+00:00   \n",
       "\n",
       "                          lastSeenAt isRoomActive  \\\n",
       "id                                                  \n",
       "0   2019-07-26 22:18:23.142000+00:00         True   \n",
       "3   2019-07-18 22:00:31.174000+00:00        False   \n",
       "8   2019-08-10 00:14:27.130000+00:00         True   \n",
       "10  2019-07-16 06:05:32.183000+00:00        False   \n",
       "12  2019-08-01 00:02:40.516000+00:00         True   \n",
       "\n",
       "                  rawAvailability  ...             matchAge    matchGender  \\\n",
       "id                                 ...                                       \n",
       "0   26-06-'19 - Indefinite period  ...  16 years - 99 years  Not important   \n",
       "3   16-06-'19 - Indefinite period  ...  18 years - 32 years         Female   \n",
       "8   15-07-'19 - Indefinite period  ...  16 years - 40 years           Male   \n",
       "10  01-08-'19 - Indefinite period  ...  21 years - 99 years  Not important   \n",
       "12  08-07-'19 - Indefinite period  ...  22 years - 40 years  Not important   \n",
       "\n",
       "   matchCapacity                                     matchLanguages  \\\n",
       "id                                                                    \n",
       "0       1 person                                      Not important   \n",
       "3       1 person                                      Not important   \n",
       "8      4 persons                                      Dutch English   \n",
       "10     4 persons  Dutch English Spanish French Italian German Po...   \n",
       "12      1 person                                      Dutch English   \n",
       "\n",
       "                                          matchStatus  \\\n",
       "id                                                      \n",
       "0                                       Not important   \n",
       "3                            Student, Working student   \n",
       "8                                             Student   \n",
       "10  Student, Working student, Working, Looking for...   \n",
       "12                  Student, Working student, Working   \n",
       "\n",
       "                                        coverImageUrl additionalCosts rent  \\\n",
       "id                                                                           \n",
       "0   https://resources.kamernet.nl/image/913b4b03-5...            50.0  500   \n",
       "3   https://resources.kamernet.nl/image/84e95365-6...             NaN  290   \n",
       "8   https://resources.kamernet.nl/image/6e625591-d...             NaN  425   \n",
       "10  https://resources.kamernet.nl/image/ea3aea77-0...             NaN  600   \n",
       "12  https://resources.kamernet.nl/image/d0780298-b...             NaN  425   \n",
       "\n",
       "   deposit registrationCost  \n",
       "id                           \n",
       "0    500.0              0.0  \n",
       "3    290.0              NaN  \n",
       "8    425.0             25.0  \n",
       "10  1200.0              0.0  \n",
       "12   425.0              NaN  \n",
       "\n",
       "[5 rows x 36 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# read the csv file from path, and index it by id\n",
    "train = pd.read_csv('datasets/train.csv', index_col='id')\n",
    "\n",
    "# show first 5 rows\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data analysis and machine learning\n",
    "\n",
    "Now, when data is loaded, we can preform data analysis and train machine learning models. This is a job for you!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making a submission file\n",
    "\n",
    "Once the machine learning model is trained, we can feed it the testing data which does not contain goal states. To evaluate our model, we need to create a submission file and upload it to the Kaggle. The submission file is a csv file which consists of the `id`s of the datapoints and their predicted goal value. An example of the submission file for the competition's dataset is given below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rent</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    rent\n",
       "id      \n",
       "1      0\n",
       "2      0\n",
       "4      0\n",
       "5      0\n",
       "6      0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# For demonstration purposes, we will load the submission file from an existing csv\n",
    "submission = pd.read_csv('datasets/sample_submission.csv', index_col='id')\n",
    "\n",
    "submission.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rent</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1391</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2419</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5381</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>5224</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    rent\n",
       "id      \n",
       "1   1391\n",
       "2   2419\n",
       "4   5381\n",
       "5   5972\n",
       "6   5224"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Normally, a submission dataframe would be generated when we feed the test data to the model\n",
    "# But we will fill it with random numbers\n",
    "\n",
    "for i in submission.index:\n",
    "    submission['rent'][i] = np.random.randint(\n",
    "        train['rent'].min(), \n",
    "        train['rent'].max()\n",
    "    )\n",
    "    \n",
    "submission.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the submission file is obtained, we can convert it into a csv file using the `to_csv()` method which takes the desired file name as an argument"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "submission.to_csv('submission.csv', index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the submission file in csv format, we can upload it to Kaggle for evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.10"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}