{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Netherlands Accommodation Prices (FCG)\n", " \n", "The Dutch housing crisis is one of the biggest problems that residents face.\n", "\n", "Due to multiple factors, such as population growth and a shortage of construction workers, the availability of housing has decreased significantly. This decline has pushed rent to sky-high prices, which leaves many wondering whether they are being taken advantage of.\n", "\n", "In order to answer this question, you are tasked to predict the rent of a house given its data (i.e. location, size, facilities etc.)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Uncomment next cell to download the libraries._ We use a `!` before `pip` to run the command in the terminal instead of python. When using a computer locally, it is sufficient to execute the below line in the terminal." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# !pip install numpy\n", "# !pip install pandas" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading data\n", "\n", "Download the csv data files from Kaggle and place it in a folder `./datasets/` in the same directory as this notebook. In python to load csv files we use a library called `pandas`. Pandas is a open source library created to handle the manipulation and storage of large scale datasets.\n", "\n", "- Pandas is optimized for handling large scale datasets\n", "- This optimization results in massive speedups when compared to other libraries like csv and pickle\n", "- Pandas contains several useful features for Machine Learning such as:\n", " - Data cleaning\n", " - Data inspection\n", " - Statistical analysis\n", " - Data normalization\n", " - Loading and storing data\n", "\n", "\n", "A CSV file is a text file which uses a comma to separate values. An example is provided below\n", "\n", "```\n", "id,title,city,postalCode\n", "0,West-Varkenoordseweg,Rotterdam,3974HN\n", "3,Ruiterakker,Assen,9407BG\n", "8,Brusselseweg,Maastricht,6217GX\n", "10,Donkerslootstraat,Rotterdam,3074WL\n", "12,Vorselenburgstraat,Alphen aan den Rijn,2405XJ\n", "```\n", "\n", "While the above data might be difficult for a human to read at a first glance, machines can parse these files quickly. To read the file, we'll use a `read_csv` method.\n", "\n", "Read more about the methonds avaliable in the [documentation](https://pandas.pydata.org/docs/)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlecitypostalCodelatitudelongitudeareaSqmfirstSeenAtlastSeenAtisRoomActiverawAvailability...matchAgematchGendermatchCapacitymatchLanguagesmatchStatuscoverImageUrladditionalCostsrentdepositregistrationCost
id
0West-VarkenoordsewegRotterdam3074HN51.8966014.514993142019-07-14 11:25:46.511000+00:002019-07-26 22:18:23.142000+00:00True26-06-'19 - Indefinite period...16 years - 99 yearsNot important1 personNot importantNot importanthttps://resources.kamernet.nl/image/913b4b03-5...50.0500500.00.0
3RuiterakkerAssen9407BG53.0134946.561012162019-07-14 11:25:46.988000+00:002019-07-18 22:00:31.174000+00:00False16-06-'19 - Indefinite period...18 years - 32 yearsFemale1 personNot importantStudent, Working studenthttps://resources.kamernet.nl/image/84e95365-6...NaN290290.0NaN
8BrusselsewegMaastricht6217GX50.8608415.671673162019-07-14 11:25:47.814000+00:002019-08-10 00:14:27.130000+00:00True15-07-'19 - Indefinite period...16 years - 40 yearsMale4 personsDutch EnglishStudenthttps://resources.kamernet.nl/image/6e625591-d...NaN425425.025.0
10DonkerslootstraatRotterdam3074WL51.8931954.516478252019-07-14 11:25:48.140000+00:002019-07-16 06:05:32.183000+00:00False01-08-'19 - Indefinite period...21 years - 99 yearsNot important4 personsDutch English Spanish French Italian German Po...Student, Working student, Working, Looking for...https://resources.kamernet.nl/image/ea3aea77-0...NaN6001200.00.0
12VorselenburgstraatAlphen aan den Rijn2405XJ52.1223354.661434102019-07-14 11:25:48.465000+00:002019-08-01 00:02:40.516000+00:00True08-07-'19 - Indefinite period...22 years - 40 yearsNot important1 personDutch EnglishStudent, Working student, Workinghttps://resources.kamernet.nl/image/d0780298-b...NaN425425.0NaN
\n", "

5 rows × 36 columns

\n", "
" ], "text/plain": [ " title city postalCode latitude \\\n", "id \n", "0 West-Varkenoordseweg Rotterdam 3074HN 51.896601 \n", "3 Ruiterakker Assen 9407BG 53.013494 \n", "8 Brusselseweg Maastricht 6217GX 50.860841 \n", "10 Donkerslootstraat Rotterdam 3074WL 51.893195 \n", "12 Vorselenburgstraat Alphen aan den Rijn 2405XJ 52.122335 \n", "\n", " longitude areaSqm firstSeenAt \\\n", "id \n", "0 4.514993 14 2019-07-14 11:25:46.511000+00:00 \n", "3 6.561012 16 2019-07-14 11:25:46.988000+00:00 \n", "8 5.671673 16 2019-07-14 11:25:47.814000+00:00 \n", "10 4.516478 25 2019-07-14 11:25:48.140000+00:00 \n", "12 4.661434 10 2019-07-14 11:25:48.465000+00:00 \n", "\n", " lastSeenAt isRoomActive \\\n", "id \n", "0 2019-07-26 22:18:23.142000+00:00 True \n", "3 2019-07-18 22:00:31.174000+00:00 False \n", "8 2019-08-10 00:14:27.130000+00:00 True \n", "10 2019-07-16 06:05:32.183000+00:00 False \n", "12 2019-08-01 00:02:40.516000+00:00 True \n", "\n", " rawAvailability ... matchAge matchGender \\\n", "id ... \n", "0 26-06-'19 - Indefinite period ... 16 years - 99 years Not important \n", "3 16-06-'19 - Indefinite period ... 18 years - 32 years Female \n", "8 15-07-'19 - Indefinite period ... 16 years - 40 years Male \n", "10 01-08-'19 - Indefinite period ... 21 years - 99 years Not important \n", "12 08-07-'19 - Indefinite period ... 22 years - 40 years Not important \n", "\n", " matchCapacity matchLanguages \\\n", "id \n", "0 1 person Not important \n", "3 1 person Not important \n", "8 4 persons Dutch English \n", "10 4 persons Dutch English Spanish French Italian German Po... \n", "12 1 person Dutch English \n", "\n", " matchStatus \\\n", "id \n", "0 Not important \n", "3 Student, Working student \n", "8 Student \n", "10 Student, Working student, Working, Looking for... \n", "12 Student, Working student, Working \n", "\n", " coverImageUrl additionalCosts rent \\\n", "id \n", "0 https://resources.kamernet.nl/image/913b4b03-5... 50.0 500 \n", "3 https://resources.kamernet.nl/image/84e95365-6... NaN 290 \n", "8 https://resources.kamernet.nl/image/6e625591-d... NaN 425 \n", "10 https://resources.kamernet.nl/image/ea3aea77-0... NaN 600 \n", "12 https://resources.kamernet.nl/image/d0780298-b... NaN 425 \n", "\n", " deposit registrationCost \n", "id \n", "0 500.0 0.0 \n", "3 290.0 NaN \n", "8 425.0 25.0 \n", "10 1200.0 0.0 \n", "12 425.0 NaN \n", "\n", "[5 rows x 36 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read the csv file from path, and index it by id\n", "train = pd.read_csv('datasets/train.csv', index_col='id')\n", "\n", "# show first 5 rows\n", "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data analysis and machine learning\n", "\n", "Now, when data is loaded, we can preform data analysis and train machine learning models. This is a job for you!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making a submission file\n", "\n", "Once the machine learning model is trained, we can feed it the testing data which does not contain goal states. To evaluate our model, we need to create a submission file and upload it to the Kaggle. The submission file is a csv file which consists of the `id`s of the datapoints and their predicted goal value. An example of the submission file for the competition's dataset is given below. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rent
id
10
20
40
50
60
\n", "
" ], "text/plain": [ " rent\n", "id \n", "1 0\n", "2 0\n", "4 0\n", "5 0\n", "6 0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# For demonstration purposes, we will load the submission file from an existing csv\n", "submission = pd.read_csv('datasets/sample_submission.csv', index_col='id')\n", "\n", "submission.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rent
id
11391
22419
45381
55972
65224
\n", "
" ], "text/plain": [ " rent\n", "id \n", "1 1391\n", "2 2419\n", "4 5381\n", "5 5972\n", "6 5224" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Normally, a submission dataframe would be generated when we feed the test data to the model\n", "# But we will fill it with random numbers\n", "\n", "for i in submission.index:\n", " submission['rent'][i] = np.random.randint(\n", " train['rent'].min(), \n", " train['rent'].max()\n", " )\n", " \n", "submission.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the submission file is obtained, we can convert it into a csv file using the `to_csv()` method which takes the desired file name as an argument" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "submission.to_csv('submission.csv', index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the submission file in csv format, we can upload it to Kaggle for evaluation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.10" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }