Predicting House Prices - Real Estate in Colombia¶

By Hayden Hoopes

My wife is from Bogotá, Colombia and moved to the United States when we were married back in 2020. During the past three years of our marriage, we have often discussed the idea of buying a house somewhere in Colombia that we can use as an investment property or that my wife's parents could move into when they retire.

Because of this, I am interested in being able to predict the price of houses in Colombia given some factors in order to determine which houses are a "good deal". In other words, if the price that I predict ends up being significantly more than the actual price of the home, I could in theory purchase the home and automatically have made money, being able to resell it for an instant profit.

I want to do a regression analysis to determine 1) how accuractely I can predict the price of a home, given a set of variables, 2) which of those variables are the most important in determining the price of a home, and 3) how well the predictions could help me make a good investment.

The Data¶

The data for this analysis comes from a web scraper that I built to collect data from the website Finca Raíz, a website used often in Colombia to list houses for sale. The website was scraped in two separate sessions with varying results, producing a data set of around 700 houses from various parts of the country.

In [362]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
pd.options.display.max_columns = None

df = pd.read_pickle('colombia_real_estate.pkl').reset_index().drop('index', axis=1)
df.head()
Out[362]:
Tipo Propiedad Habitaciones Baños Área construída Área privada Estrato Antigüedad Administración Precio m² Dirección Barrio Ciudad Precio (COP) Descripción general No. Fotos Fecha Sacada Enlace Balcón Baño Auxiliar Cocina Integral
0 Usado Apartamento 2 1 41.000000 0.000000 3 Menor a 1 año 100000 4878048 Cra 67 cll 53 # 09 sur Bogotá Bogotá 200000000 Vendo o permuto !!!!! Hermoso apt en conjunto reserva de madelena, barrio madelena acabados de ... 19 2023-01-27 11:44:51.166921 https://www.fincaraiz.com.co/inmueble/apartamento-en-venta/bogota/bogota-dc/10006740 0 0 0
1 Usado Apartamento 2 1 32.400002 32.400002 3 9 a 15 años 58000 5092592 Transversal 73 11B 33 Int 7 Apt 101 Bogotá Bogotá 165000000 Apartamento tipo apartaestudio, con patio, un baño, dos habitaciones, cocina semiintegral. 7 2023-01-27 11:44:52.096644 https://www.fincaraiz.com.co/inmueble/apartamento-en-venta/bogota/bogota-dc/10001169 0 0 0
2 Usado Apartamento 3 1 52.000000 0.000000 3 9 a 15 años 110000 3557692 Diagonal 4 b # 39-90 Bogotá Bogotá 185000000 Apartamento iluminado, 3 habitaciones, 1 baño, cocina integral, ascensor, 5to piso, parqueadero ... 7 2023-01-27 11:44:53.900864 https://www.fincaraiz.com.co/inmueble/apartamento-en-venta/bogota/bogota-dc/10003762 0 0 0
3 Usado Apartamento 3 2 70.000000 70.000000 3 16 a 30 años 280000 4142857 Kra. 103 d # 83-82 Bogotá Bogotá 290000000 apartamento de tres alcobas sala comedor cocina zona de ropas alcoba principal con baño y otro b... 11 2023-01-27 11:44:54.797900 https://www.fincaraiz.com.co/inmueble/apartamento-en-venta/bogota/bogota-dc/10003565 0 1 1
4 Usado Apartamento 2 1 43.000000 0.000000 2 1 a 8 años 66000 2790697 CLL 104 sur 14-80 Bogotá Bogotá 120000000 Apartamento con excelente vista a parque natural, ubicado en el piso séptimo, esquinero y consta... 9 2023-01-27 11:44:55.683734 https://www.fincaraiz.com.co/inmueble/apartamento-en-venta/bogota/bogota-dc/10005549 0 0 0
In [363]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 691 entries, 0 to 690
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Tipo                 691 non-null    object 
 1   Propiedad            691 non-null    object 
 2   Habitaciones         691 non-null    int16  
 3   Baños                691 non-null    int8   
 4   Área construída      691 non-null    float32
 5   Área privada         691 non-null    float32
 6   Estrato              691 non-null    int8   
 7   Antigüedad           691 non-null    object 
 8   Administración       691 non-null    int64  
 9   Precio m²            691 non-null    int64  
 10  Dirección            691 non-null    object 
 11  Barrio               691 non-null    object 
 12  Ciudad               691 non-null    object 
 13  Precio (COP)         691 non-null    int64  
 14  Descripción general  691 non-null    object 
 15  No. Fotos            691 non-null    int64  
 16  Fecha Sacada         691 non-null    object 
 17  Enlace               691 non-null    object 
 18  Balcón               691 non-null    int32  
 19  Baño Auxiliar        691 non-null    int32  
 20  Cocina Integral      691 non-null    int32  
dtypes: float32(2), int16(1), int32(3), int64(4), int8(2), object(9)
memory usage: 86.5+ KB
In [364]:
df.isna().sum()
Out[364]:
Tipo                   0
Propiedad              0
Habitaciones           0
Baños                  0
Área construída        0
Área privada           0
Estrato                0
Antigüedad             0
Administración         0
Precio m²              0
Dirección              0
Barrio                 0
Ciudad                 0
Precio (COP)           0
Descripción general    0
No. Fotos              0
Fecha Sacada           0
Enlace                 0
Balcón                 0
Baño Auxiliar          0
Cocina Integral        0
dtype: int64
In [365]:
df.shape
Out[365]:
(691, 21)

Data Cleaning¶

The first step in creating this analysis is to clean the data.

The things that I notice that I need to do before I can process this data in a regression analysis are the following:

  • Barrio - This is the neighborhood of the house. This varies highly and different cities can have the same neighborhood name, so this isn't a good way to measure house price.
  • Descripción general - Perhaps this could be used in a future analysis with text features, but this general description of the property won't be very useful for this regression analysis.
  • Enlace - This is the link to the house. I'll keep it in the original data set but will drop it for the regression.
  • Fecha Sacada - This is the date that the data was collected. It's purely for informational purposes but won't be useful in regression.
  • Dirección - This is the street address ofthe property. Won't be useful in this analysis.
  • Estrato - In Colombia, areas of the city are organized into classes called estratos. The lower the number, the poorer the sector. The Estrato column should be converted to a categorical column, since it may not share a linear relationship with the price.

The other categorical columns will be converted to dummy columns.

In [366]:
# Drop columns that aren't needed (save Enlace for late)
enlaces = df['Enlace']
df = df.drop(['Barrio', 'Descripción general', 'Enlace', 'Fecha Sacada', 'Dirección'], axis=1)

# Make Estrato a categorical variable
df['Estrato'] = df['Estrato'].apply(lambda x : 'Estrato ' + str(x))
In [367]:
# Change data type of columns
df['Habitaciones'] = df['Habitaciones'].astype('int16')
df['Baños'] = df['Baños'].astype('int8')
df['Área construída'] = df['Área construída'].astype('float32')
df['Área privada'] = df['Área privada'].astype('float32')
df['Administración'] = df['Administración'].astype('int64')
df['Precio m²'] = df['Precio m²'].astype('int64')
df['Precio (COP)'] = df['Precio (COP)'].astype('int64')
df['No. Fotos'] = df['No. Fotos'].astype('int64')
In [368]:
# Get dummy values for the categorical columns
clean_df = df.copy()

for col in df.select_dtypes(include='object').columns:
    clean_df = pd.concat([clean_df, pd.get_dummies(df[col])], axis=1)
    clean_df = clean_df.drop(col, axis=1)
In [369]:
clean_df
Out[369]:
Habitaciones Baños Área construída Área privada Administración Precio m² Precio (COP) No. Fotos Balcón Baño Auxiliar Cocina Integral Duplex Inmobiliaria Loft Negociable Oportunidad PentHouse Usado Apartamento Casa Estrato 0 Estrato 1 Estrato 2 Estrato 3 Estrato 4 Estrato 5 Estrato 6 1 a 8 años 16 a 30 años 9 a 15 años Menor a 1 año Mnan Más de 30 años Armenia Bogotá Cali Fusagasugá Medellín Santa Marta Santa marta Villavicencio
0 2 1 41.000000 0.000000 100000 4878048 200000000 19 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
1 2 1 32.400002 32.400002 58000 5092592 165000000 7 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
2 3 1 52.000000 0.000000 110000 3557692 185000000 7 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
3 3 2 70.000000 70.000000 280000 4142857 290000000 11 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0
4 2 1 43.000000 0.000000 66000 2790697 120000000 9 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
686 3 3 89.000000 0.000000 0 4213483 375000000 24 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
687 3 1 9.260000 9.260000 0 19438 180000000 8 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
688 3 1 92.000000 92.000000 0 1847826 170000000 8 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
689 4 3 94.000000 94.000000 385000 3925531 369000000 30 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
690 3 2 54.000000 54.000000 115000 2870370 155000000 11 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0

691 rows × 41 columns

Regression Analysis¶

Now that the data is ready, I can start to perform the regression analysis. In this case, the response variable is Precio (COP) and the predictor variables are everything else.

Get train and validation set¶

In this step, I will create a training data set that I can use to tune the model and then a training set that I can use to test the model.

In [370]:
from sklearn.model_selection import train_test_split

X = clean_df.drop(['Precio (COP)', 'Precio m²'], axis=1)
y = clean_df['Precio (COP)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=1)

Initial Regression¶

Here, I run a blind regression analysis to get an idea of what kind of data I am working with.

The results of this initial regression are unsatisfying, to say the least. The R-squared value is only 0.044, meaning that the predictor variables weren't able to explain the response variable very well. This could be due to random chance, or it could be due to bad data. It doesn't really make sense that none of the predictor variables impact the price significantly, so I'll check my

In [371]:
results = sm.OLS(y_train, sm.add_constant(X_train)).fit()
results.summary()
Out[371]:
OLS Regression Results
Dep. Variable: Precio (COP) R-squared: 0.043
Model: OLS Adj. R-squared: -0.023
Method: Least Squares F-statistic: 0.6528
Date: Tue, 25 Apr 2023 Prob (F-statistic): 0.933
Time: 02:46:13 Log-Likelihood: -13117.
No. Observations: 518 AIC: 2.630e+04
Df Residuals: 484 BIC: 2.645e+04
Df Model: 33
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -4.185e+09 3.46e+09 -1.210 0.227 -1.1e+10 2.61e+09
Habitaciones -4.779e+08 1.04e+09 -0.461 0.645 -2.51e+09 1.56e+09
Baños 3.017e+09 1.5e+09 2.012 0.045 7.1e+07 5.96e+09
Área construída -4.61e+06 1.59e+07 -0.291 0.771 -3.58e+07 2.65e+07
Área privada -4.311e+06 1.14e+07 -0.377 0.707 -2.68e+07 1.82e+07
Administración -6413.7365 4458.025 -1.439 0.151 -1.52e+04 2345.737
No. Fotos 6.829e+07 1.86e+08 0.366 0.714 -2.98e+08 4.35e+08
Balcón 4.364e+09 2.61e+09 1.670 0.096 -7.71e+08 9.5e+09
Baño Auxiliar -3.421e+09 2.59e+09 -1.322 0.187 -8.51e+09 1.67e+09
Cocina Integral 3.212e+09 2.43e+09 1.324 0.186 -1.55e+09 7.98e+09
Duplex -6.247e+09 8.32e+09 -0.751 0.453 -2.26e+10 1.01e+10
Inmobiliaria 1.096e+09 5.29e+09 0.207 0.836 -9.3e+09 1.15e+10
Loft -9.944e+08 8.74e+09 -0.114 0.909 -1.82e+10 1.62e+10
Negociable -3.508e-06 1.31e-05 -0.267 0.789 -2.93e-05 2.23e-05
Oportunidad 3.865e+09 1.31e+10 0.294 0.769 -2.2e+10 2.97e+10
PentHouse -5.259e+09 2.17e+10 -0.243 0.808 -4.78e+10 3.73e+10
Usado 3.354e+09 5.46e+09 0.614 0.539 -7.38e+09 1.41e+10
Apartamento -6.941e+08 2.07e+09 -0.336 0.737 -4.75e+09 3.36e+09
Casa -3.491e+09 2.47e+09 -1.412 0.158 -8.35e+09 1.37e+09
Estrato 0 -4.371e+07 9.29e+09 -0.005 0.996 -1.83e+10 1.82e+10
Estrato 1 2.233e+09 6.03e+09 0.371 0.711 -9.61e+09 1.41e+10
Estrato 2 -1.638e+09 3.77e+09 -0.434 0.665 -9.05e+09 5.78e+09
Estrato 3 -1.203e+09 2.57e+09 -0.468 0.640 -6.26e+09 3.85e+09
Estrato 4 5.561e+08 2.88e+09 0.193 0.847 -5.1e+09 6.21e+09
Estrato 5 -2.052e+09 3.38e+09 -0.608 0.544 -8.68e+09 4.58e+09
Estrato 6 -2.039e+09 4.57e+09 -0.446 0.656 -1.1e+10 6.94e+09
1 a 8 años -1.981e+09 2.42e+09 -0.818 0.414 -6.74e+09 2.78e+09
16 a 30 años -6.876e+08 2.5e+09 -0.275 0.783 -5.6e+09 4.22e+09
9 a 15 años 3.352e+09 2.63e+09 1.273 0.204 -1.82e+09 8.53e+09
Menor a 1 año -2.466e+09 5.51e+09 -0.447 0.655 -1.33e+10 8.36e+09
Mnan -8.519e+08 2.7e+09 -0.316 0.752 -6.15e+09 4.45e+09
Más de 30 años -1.55e+09 4.03e+09 -0.384 0.701 -9.48e+09 6.38e+09
Armenia 6.112e+08 3.89e+09 0.157 0.875 -7.02e+09 8.25e+09
Bogotá 4.325e+09 3.99e+09 1.083 0.279 -3.52e+09 1.22e+10
Cali 8.834e+08 3.51e+09 0.252 0.801 -6.01e+09 7.78e+09
Fusagasugá -1.626e+09 6.71e+09 -0.242 0.809 -1.48e+10 1.16e+10
Medellín 7.63e+08 3.58e+09 0.213 0.831 -6.27e+09 7.79e+09
Santa Marta -8.77e+08 3.41e+09 -0.257 0.797 -7.58e+09 5.82e+09
Santa marta -5.372e+09 1.33e+10 -0.404 0.686 -3.15e+10 2.07e+10
Villavicencio -2.893e+09 3.46e+09 -0.837 0.403 -9.69e+09 3.9e+09
Omnibus: 1247.824 Durbin-Watson: 1.980
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4804563.386
Skew: 21.248 Prob(JB): 0.00
Kurtosis: 472.893 Cond. No. 3.36e+16


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.81e-20. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Assumptions¶

Here I will check the assumptions of linear regression.

Linear relationship¶

In this first plot, it is apparent that there is at least one outlier greatly skewing the data set for Área construída. This point is clearly an outlier and I will remove it from my analysis for the sake of the regression.

In [372]:
plt.scatter(clean_df['Área construída'], clean_df['Precio (COP)'])
plt.xlabel('Área construída')
plt.ylabel('Precio')
Out[372]:
Text(0, 0.5, 'Precio')
In [373]:
# Remove the outlier
clean_df = clean_df[clean_df['Precio (COP)'] < 2e10]

# Re-create the X and y variables
X = clean_df.drop(['Precio (COP)', 'Precio m²'], axis=1)
y = clean_df['Precio (COP)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=1)

This second plot shows a bettter linear relationship, showing that as the area of the property increases, so does the price, in general.

In [374]:
plt.scatter(clean_df['Área construída'], clean_df['Precio (COP)'])
plt.xlabel('Área construída')
plt.ylabel('Precio')
Out[374]:
Text(0, 0.5, 'Precio')

The Área privada column also seems to have a similar linear relationship, but seems to be more influenced by some outliers on the left side. I probably can't simply remove those points, but I may be able to drop the entire column since "private area" and "constructed area" are likely very correlated.

In [375]:
plt.scatter(clean_df['Área privada'], clean_df['Precio (COP)'])
plt.xlabel('Área privada')
plt.ylabel('Precio')
Out[375]:
Text(0, 0.5, 'Precio')

No. Fotos (number of pictures) seems to have a very light positive correlation to price.

In [376]:
plt.scatter(clean_df['No. Fotos'], clean_df['Precio (COP)'])
plt.xlabel('No. Fotos')
plt.ylabel('Precio')
Out[376]:
Text(0, 0.5, 'Precio')

The Baños (bathrooms) column seems to be indicative of an increasing house price.

In [377]:
plt.scatter(clean_df['Baños'], clean_df['Precio (COP)'])
plt.xlabel('Baños')
plt.ylabel('Precio')
Out[377]:
Text(0, 0.5, 'Precio')

The Administración (administration, a secutiry cost) is also linearly related to the response variable, although there are some lower outliers that will likely skew the model.

In [378]:
plt.scatter(clean_df['Administración'], clean_df['Precio (COP)'])
plt.xlabel('Administración')
plt.ylabel('Precio')
Out[378]:
Text(0, 0.5, 'Precio')
In [379]:
results = sm.OLS(y_train, sm.add_constant(X_train)).fit()
results.summary()
Out[379]:
OLS Regression Results
Dep. Variable: Precio (COP) R-squared: 0.701
Model: OLS Adj. R-squared: 0.680
Method: Least Squares F-statistic: 33.31
Date: Tue, 25 Apr 2023 Prob (F-statistic): 1.16e-104
Time: 02:46:13 Log-Likelihood: -10810.
No. Observations: 517 AIC: 2.169e+04
Df Residuals: 482 BIC: 2.184e+04
Df Model: 34
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 8.32e+07 4.22e+07 1.973 0.049 3.47e+05 1.66e+08
Habitaciones -3.435e+07 1.24e+07 -2.764 0.006 -5.88e+07 -9.93e+06
Baños 5.943e+07 1.76e+07 3.377 0.001 2.48e+07 9.4e+07
Área construída 2.447e+06 1.8e+05 13.564 0.000 2.09e+06 2.8e+06
Área privada 5.425e+05 1.31e+05 4.130 0.000 2.84e+05 8.01e+05
Administración 106.0197 56.731 1.869 0.062 -5.450 217.490
No. Fotos -2.809e+06 2.29e+06 -1.224 0.221 -7.32e+06 1.7e+06
Balcón -1.597e+07 3.14e+07 -0.509 0.611 -7.76e+07 4.57e+07
Baño Auxiliar -8.054e+07 3.1e+07 -2.594 0.010 -1.42e+08 -1.95e+07
Cocina Integral -1.3e+07 2.93e+07 -0.444 0.657 -7.05e+07 4.45e+07
Duplex -5.879e+06 1.06e+08 -0.055 0.956 -2.15e+08 2.03e+08
Inmobiliaria -4.82e+07 6.15e+07 -0.784 0.434 -1.69e+08 7.26e+07
Loft -1.052e+07 1.01e+08 -0.104 0.917 -2.08e+08 1.87e+08
Negociable 2.458e+07 2.77e+08 0.089 0.929 -5.2e+08 5.69e+08
Oportunidad -2.278e+07 1.63e+08 -0.140 0.889 -3.43e+08 2.98e+08
PentHouse 2.151e+08 1.65e+08 1.303 0.193 -1.09e+08 5.39e+08
Usado -6.906e+07 6.81e+07 -1.014 0.311 -2.03e+08 6.47e+07
Apartamento 4.095e+07 2.39e+07 1.714 0.087 -5.99e+06 8.79e+07
Casa 4.225e+07 3.16e+07 1.337 0.182 -1.98e+07 1.04e+08
Estrato 0 -2.206e+07 1.36e+08 -0.162 0.871 -2.9e+08 2.46e+08
Estrato 1 -9.109e+07 7.63e+07 -1.194 0.233 -2.41e+08 5.88e+07
Estrato 2 -1.683e+08 4.81e+07 -3.498 0.001 -2.63e+08 -7.37e+07
Estrato 3 -8.054e+07 3.43e+07 -2.349 0.019 -1.48e+08 -1.32e+07
Estrato 4 1.287e+07 3.69e+07 0.349 0.727 -5.96e+07 8.54e+07
Estrato 5 7.565e+07 4.09e+07 1.847 0.065 -4.81e+06 1.56e+08
Estrato 6 3.567e+08 5.55e+07 6.425 0.000 2.48e+08 4.66e+08
1 a 8 años -3.568e+06 2.96e+07 -0.121 0.904 -6.16e+07 5.45e+07
16 a 30 años -1.441e+07 3.09e+07 -0.466 0.641 -7.52e+07 4.63e+07
9 a 15 años 1.23e+07 3.09e+07 0.399 0.690 -4.83e+07 7.29e+07
Menor a 1 año 2.755e+08 7.11e+07 3.873 0.000 1.36e+08 4.15e+08
Mnan -7.795e+06 3.5e+07 -0.223 0.824 -7.65e+07 6.09e+07
Más de 30 años -1.788e+08 4.77e+07 -3.745 0.000 -2.73e+08 -8.5e+07
Armenia 5.41e+07 4.67e+07 1.158 0.247 -3.77e+07 1.46e+08
Bogotá 1.19e+08 4.89e+07 2.431 0.015 2.28e+07 2.15e+08
Cali -9.185e+06 4.27e+07 -0.215 0.830 -9.31e+07 7.47e+07
Fusagasugá -4.401e+07 8.16e+07 -0.540 0.590 -2.04e+08 1.16e+08
Medellín 1.685e+08 4.31e+07 3.911 0.000 8.39e+07 2.53e+08
Santa Marta -9.156e+06 4.21e+07 -0.217 0.828 -9.19e+07 7.36e+07
Santa marta -1.641e+08 1.62e+08 -1.016 0.310 -4.82e+08 1.53e+08
Villavicencio -3.195e+07 4.23e+07 -0.755 0.451 -1.15e+08 5.12e+07
Omnibus: 387.531 Durbin-Watson: 2.008
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11242.681
Skew: 2.916 Prob(JB): 0.00
Kurtosis: 25.088 Cond. No. 4.01e+16


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.81e-20. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [380]:
fig = sm.qqplot(results.resid, line="s")
plt.show()

Heteroskedastic Data¶

This data set looks like it exhibits slight heteroskedasticity. Not all of the residuals are equally varied around the predicted data points for the variable Área construída, which is the variable that I believe will be the best predictor of house price. For smaller houses, at least, the price seems to be easier to predict than houses that are larger in area.

In [381]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=1)

fig, ax = plt.subplots(figsize=(15,10))
sm.graphics.plot_regress_exog(results,'Área construída', fig=fig)
plt.show()
eval_env: 1

Only using the Área construída variable in the regression analysis led to an R-squared value of 0.51, meaning that this single variable can explain 51% of the variance in the home's price.

In [382]:
results = sm.OLS(y_train, sm.add_constant(X_train['Área construída'])).fit()
results.summary()
Out[382]:
OLS Regression Results
Dep. Variable: Precio (COP) R-squared: 0.512
Model: OLS Adj. R-squared: 0.511
Method: Least Squares F-statistic: 541.2
Date: Tue, 25 Apr 2023 Prob (F-statistic): 2.34e-82
Time: 02:46:14 Log-Likelihood: -10937.
No. Observations: 517 AIC: 2.188e+04
Df Residuals: 515 BIC: 2.189e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 3.989e+07 2.55e+07 1.565 0.118 -1.02e+07 9e+07
Área construída 3.186e+06 1.37e+05 23.263 0.000 2.92e+06 3.46e+06
Omnibus: 394.014 Durbin-Watson: 1.975
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10975.611
Skew: 3.010 Prob(JB): 0.00
Kurtosis: 24.755 Cond. No. 289.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Outliers¶

In the plot below, I can see that there are several outliers in the Área construída column. Most of these values are extremely high, meaning that there are a few homes with varying house sizes that are outliers in some other dimension besides area.

In [383]:
results = sm.OLS(y_train, sm.add_constant(X_train)).fit()

stud_res = results.outlier_test()

sns.scatterplot(y=stud_res['student_resid'], x=X_train['Área construída'], hue=(stud_res['student_resid']>3)|(stud_res['student_resid']<-3))
plt.axhline(y=0, color='black', linestyle='--')
plt.show()
C:\Users\HaydenH\AppData\Local\Programs\Python\Python310\lib\site-packages\numpy\linalg\linalg.py:2146: RuntimeWarning: overflow encountered in det
  r = _umath_linalg.det(a, signature=signature)
C:\Users\HaydenH\AppData\Local\Programs\Python\Python310\lib\site-packages\statsmodels\stats\outliers_influence.py:696: RuntimeWarning: invalid value encountered in sqrt
  return self.resid / sigma / np.sqrt(1 - hii)

Leverage Points¶

In terms of leverage points, if I were to use the three variables Área construída, Baños, Habitaciones, there is one distinct outlier which is a house in Medellin that has 5 bathrooms and 18 bedrooms. Because the regression model sees additional rooms as subtracting from the price (which doesn't make sense to me), the huge number of rooms seems to be giving this house a lot of weight in the model.

In [384]:
results = sm.OLS(y_train, sm.add_constant(X_train[['Área construída', 'Baños', 'Habitaciones']])).fit()
results.summary()
Out[384]:
OLS Regression Results
Dep. Variable: Precio (COP) R-squared: 0.564
Model: OLS Adj. R-squared: 0.562
Method: Least Squares F-statistic: 221.6
Date: Tue, 25 Apr 2023 Prob (F-statistic): 3.62e-92
Time: 02:46:18 Log-Likelihood: -10908.
No. Observations: 517 AIC: 2.182e+04
Df Residuals: 513 BIC: 2.184e+04
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 7.065e+07 4.21e+07 1.678 0.094 -1.21e+07 1.53e+08
Área construída 2.811e+06 1.88e+05 14.984 0.000 2.44e+06 3.18e+06
Baños 1.089e+08 1.7e+07 6.408 0.000 7.55e+07 1.42e+08
Habitaciones -8.061e+07 1.27e+07 -6.326 0.000 -1.06e+08 -5.56e+07
Omnibus: 411.682 Durbin-Watson: 1.940
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12413.628
Skew: 3.190 Prob(JB): 0.00
Kurtosis: 26.142 Cond. No. 517.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [385]:
sm.graphics.influence_plot(results)
Out[385]:
In [386]:
df.loc[250]
Out[386]:
Tipo               Inmobiliaria
Propiedad                  Casa
Habitaciones                 18
Baños                         5
Área construída           428.0
Área privada                0.0
Estrato               Estrato 4
Antigüedad         16 a 30 años
Administración                0
Precio m²               3037383
Ciudad                 Medellín
Precio (COP)         1300000000
No. Fotos                     8
Balcón                        1
Baño Auxiliar                 1
Cocina Integral               0
Name: 250, dtype: object

Collinearity¶

From the variance inflation factors shown below, it appears that the number of bathrooms and bedrooms that a property has is highly covariant with other variables in the house. This makes sense logically: a house with a greater area is more likely to have more bathrooms, and more bedrooms. This indicates to me that I may not actually need to use Baños or Habitaciones in the analysis at all, since the information that they carry might be contained in some other, more predictive/useful variable.

In [387]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = clean_df.select_dtypes(['int64', 'float64', 'int32', 'float32', 'int8', 'int16']).drop(columns=['Precio (COP)'])

vif_df = pd.DataFrame()
vif_df['feature'] = X.columns
vif_df['VIF'] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

vif_df.sort_values(by='VIF', ascending=False)
Out[387]:
feature VIF
1 Baños 13.908978
0 Habitaciones 7.588467
6 No. Fotos 6.541586
2 Área construída 5.739424
5 Precio m² 4.110849
3 Área privada 2.153872
8 Baño Auxiliar 2.090382
7 Balcón 2.027569
9 Cocina Integral 1.855900
4 Administración 1.729555

Preprocessing¶

In this step, I plan to convert the Precio (COP) column to the Precio (USD) column using today's exchange rate. This will make the analysis a little bit more interpretable to me. Then, I'll run another regression to see if I can interpret some of the variables. Finally, I plan to scale both the X and y variables to be able to more easily interpret which coefficients are the most important in determining the price of a house.

In [388]:
clean_df['Precio (USD)'] = round(clean_df['Precio (COP)'] / 4500, 2)
C:\Users\HaydenH\AppData\Local\Temp\ipykernel_31004\883967646.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_df['Precio (USD)'] = round(clean_df['Precio (COP)'] / 4500, 2)
In [389]:
clean_df
Out[389]:
Habitaciones Baños Área construída Área privada Administración Precio m² Precio (COP) No. Fotos Balcón Baño Auxiliar Cocina Integral Duplex Inmobiliaria Loft Negociable Oportunidad PentHouse Usado Apartamento Casa Estrato 0 Estrato 1 Estrato 2 Estrato 3 Estrato 4 Estrato 5 Estrato 6 1 a 8 años 16 a 30 años 9 a 15 años Menor a 1 año Mnan Más de 30 años Armenia Bogotá Cali Fusagasugá Medellín Santa Marta Santa marta Villavicencio Precio (USD)
0 2 1 41.000000 0.000000 100000 4878048 200000000 19 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 44444.44
1 2 1 32.400002 32.400002 58000 5092592 165000000 7 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 36666.67
2 3 1 52.000000 0.000000 110000 3557692 185000000 7 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 41111.11
3 3 2 70.000000 70.000000 280000 4142857 290000000 11 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 64444.44
4 2 1 43.000000 0.000000 66000 2790697 120000000 9 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 26666.67
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
686 3 3 89.000000 0.000000 0 4213483 375000000 24 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 83333.33
687 3 1 9.260000 9.260000 0 19438 180000000 8 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 40000.00
688 3 1 92.000000 92.000000 0 1847826 170000000 8 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 37777.78
689 4 3 94.000000 94.000000 385000 3925531 369000000 30 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 82000.00
690 3 2 54.000000 54.000000 115000 2870370 155000000 11 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 34444.44

690 rows × 42 columns

The regression below shows that each additional meter-squared of space in a home in Colombia increases the price of the home by a little bit more than $700. When only using this variable to predict the home's price, the R-squared value is 51.2% and the confidence interval for the variable is clearly non-zero.

In [390]:
X = clean_df.drop(['Precio (COP)', 'Precio m²', 'Precio (USD)'], axis=1)
y = clean_df['Precio (USD)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=1)

results = sm.OLS(y_train, sm.add_constant(X_train['Área construída'])).fit()
results.summary()
Out[390]:
OLS Regression Results
Dep. Variable: Precio (USD) R-squared: 0.512
Model: OLS Adj. R-squared: 0.511
Method: Least Squares F-statistic: 541.2
Date: Tue, 25 Apr 2023 Prob (F-statistic): 2.34e-82
Time: 02:46:19 Log-Likelihood: -6588.4
No. Observations: 517 AIC: 1.318e+04
Df Residuals: 515 BIC: 1.319e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 8865.4148 5663.087 1.565 0.118 -2260.178 2e+04
Área construída 708.0216 30.435 23.263 0.000 648.229 767.814
Omnibus: 394.014 Durbin-Watson: 1.975
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10975.611
Skew: 3.010 Prob(JB): 0.00
Kurtosis: 24.755 Cond. No. 289.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Next, I'll try to transform all of the variables to see which ones are the most impactful on the price of the house. These need to be transformed first-- otherwise, the variables with greater magnitude could dominate variables with smaller magnitude and seem to be more important.

In [391]:
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler()
y_scaler = StandardScaler()

X_train_std = pd.DataFrame(X_scaler.fit_transform(X_train), columns=X_train.columns)
y_train_std = y_scaler.fit_transform(pd.DataFrame(y_train))

results = sm.OLS(y_train_std, sm.add_constant(X_train_std)).fit()
results.summary()
Out[391]:
OLS Regression Results
Dep. Variable: y R-squared: 0.701
Model: OLS Adj. R-squared: 0.680
Method: Least Squares F-statistic: 33.31
Date: Tue, 25 Apr 2023 Prob (F-statistic): 1.16e-104
Time: 02:46:19 Log-Likelihood: -421.13
No. Observations: 517 AIC: 912.3
Df Residuals: 482 BIC: 1061.
Df Model: 34
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 5.985e-17 0.025 2.4e-15 1.000 -0.049 0.049
Habitaciones -0.0963 0.035 -2.764 0.006 -0.165 -0.028
Baños 0.1529 0.045 3.377 0.001 0.064 0.242
Área construída 0.5497 0.041 13.564 0.000 0.470 0.629
Área privada 0.1224 0.030 4.130 0.000 0.064 0.181
Administración 0.0578 0.031 1.869 0.062 -0.003 0.119
No. Fotos -0.0359 0.029 -1.224 0.221 -0.093 0.022
Balcón -0.0149 0.029 -0.509 0.611 -0.072 0.043
Baño Auxiliar -0.0750 0.029 -2.594 0.010 -0.132 -0.018
Cocina Integral -0.0121 0.027 -0.444 0.657 -0.065 0.041
Duplex 0.0106 0.025 0.427 0.670 -0.038 0.059
Inmobiliaria 0.0009 0.021 0.042 0.966 -0.040 0.041
Loft 0.0105 0.025 0.419 0.676 -0.039 0.060
Negociable 0.0061 0.026 0.233 0.816 -0.045 0.057
Oportunidad 0.0038 0.025 0.150 0.881 -0.046 0.053
PentHouse 0.0376 0.026 1.461 0.145 -0.013 0.088
Usado -0.0158 0.023 -0.696 0.486 -0.060 0.029
Apartamento -0.0006 0.017 -0.035 0.972 -0.034 0.033
Casa 0.0006 0.017 0.035 0.972 -0.033 0.034
Estrato 0 -0.0063 0.025 -0.249 0.804 -0.056 0.044
Estrato 1 -0.0349 0.027 -1.292 0.197 -0.088 0.018
Estrato 2 -0.1050 0.025 -4.179 0.000 -0.154 -0.056
Estrato 3 -0.0851 0.021 -4.150 0.000 -0.125 -0.045
Estrato 4 -0.0029 0.020 -0.146 0.884 -0.043 0.037
Estrato 5 0.0434 0.022 1.939 0.053 -0.001 0.087
Estrato 6 0.1996 0.028 7.031 0.000 0.144 0.255
1 a 8 años 0.0050 0.020 0.247 0.805 -0.034 0.044
16 a 30 años -0.0038 0.020 -0.188 0.851 -0.044 0.036
9 a 15 años 0.0168 0.020 0.824 0.410 -0.023 0.057
Menor a 1 año 0.0866 0.025 3.527 0.000 0.038 0.135
Mnan 0.0012 0.022 0.053 0.958 -0.043 0.045
Más de 30 años -0.0838 0.025 -3.403 0.001 -0.132 -0.035
Armenia 0.0048 0.024 0.197 0.844 -0.043 0.053
Bogotá 0.0536 0.032 1.651 0.099 -0.010 0.117
Cali -0.0378 0.023 -1.628 0.104 -0.083 0.008
Fusagasugá -0.0274 0.026 -1.046 0.296 -0.079 0.024
Medellín 0.0863 0.025 3.497 0.001 0.038 0.135
Santa Marta -0.0395 0.024 -1.667 0.096 -0.086 0.007
Santa marta -0.0299 0.026 -1.159 0.247 -0.081 0.021
Villavicencio -0.0531 0.024 -2.239 0.026 -0.100 -0.006
Omnibus: 387.531 Durbin-Watson: 2.008
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11242.681
Skew: 2.916 Prob(JB): 0.00
Kurtosis: 25.088 Cond. No. 1.36e+17


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.19e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Let's only look at the variables that have a significant P-value and whose coefficients confidentally do not pass through 0.

In [392]:
results_as_html = results.summary().tables[1].as_html()
results_df = pd.read_html(results_as_html, header=0, index_col=0)[0]

results_df[
    (results_df['P>|t|'] < 0.05) &
    (results_df['[0.025'] * results_df['0.975]'] > 0)
].sort_values(by='coef', ascending=False)
Out[392]:
coef std err t P>|t| [0.025 0.975]
Área construída 0.5497 0.041 13.564 0.000 0.470 0.629
Estrato 6 0.1996 0.028 7.031 0.000 0.144 0.255
Baños 0.1529 0.045 3.377 0.001 0.064 0.242
Área privada 0.1224 0.030 4.130 0.000 0.064 0.181
Menor a 1 año 0.0866 0.025 3.527 0.000 0.038 0.135
Medellín 0.0863 0.025 3.497 0.001 0.038 0.135
Villavicencio -0.0531 0.024 -2.239 0.026 -0.100 -0.006
Baño Auxiliar -0.0750 0.029 -2.594 0.010 -0.132 -0.018
Más de 30 años -0.0838 0.025 -3.403 0.001 -0.132 -0.035
Estrato 3 -0.0851 0.021 -4.150 0.000 -0.125 -0.045
Habitaciones -0.0963 0.035 -2.764 0.006 -0.165 -0.028
Estrato 2 -0.1050 0.025 -4.179 0.000 -0.154 -0.056

What factors are important in determining the value of a house in Colombia?¶

In the table above, all variables which contribute to the value of a house in Colombia that are confidently non-zero and are statistically significant are listed. They are also ordered from top to bottom based on the weight of their coefficients. Because these variables were normalized before the model was fitted, the coefficient is an indication of how much weight is given to each variable in terms of its power in determining the price of a house.

As expected, Área construída was the most impactful variable with a normalized coefficient of 0.5497. This means that, as expected, as the area of a house increases, so does its price.

Next, we can see that houses that are Estrato 6 are worth significantly more money than houses in other estratos. This is likely an indication of the zone that the house is in, sinsce estrato 6 is the highest possible class in Colombia and houses there are few and far between.

Next, we can also see that the number of bathrooms Baños tends to push up the price, and that houses that are less than 1 year old Menor a 1 año are more expensive than houses that are more than 30 years old Más de 30 años. We can also see that houses in Medellin tend to be more expensive than houses in Villavicencio (two different cities in Colombia) and houses in poorer, working class zones (estratos 3 and 2) tend to be cheaper than other houses.

Forward subset selection¶

Here, I would like to create a model that is able to predict house prices at a high level with the least number of variables. To do this, I will use the code below to add variables one at a time to see how each additional one affects the performance of the regression model. Each additional predictor will be penalized via the Bayesian Information Criterion (BIC).

In [393]:
%%time

models_fwd = pd.DataFrame(columns=["RSS", "MSE", "model", "AIC", "BIC", "AdjR2"])

predictors = []

for i in range(len(X_train.columns)):
    remaining_predictors = [p for p in X_train_std.columns if p not in predictors]
    results = []
    
    for p in remaining_predictors:
        pred = predictors + [p]
        model = sm.OLS(y_train_std, sm.add_constant(X_train_std[pred]))
        regr = model.fit()
        
        model_info = {
            'model': regr,
            'RSS': ((regr.predict(sm.add_constant(X_train_std[pred])) - pd.DataFrame(y_train_std)[0]) ** 2).sum(),
            'AIC': regr.aic,
            'BIC': regr.bic,
            'AdjR2': regr.rsquared_adj
        }
        results.append(model_info)
        
    models = pd.DataFrame(results)
    best_model = models.loc[models['RSS'].argmin()]
    
    models_fwd.loc[i, 'RSS'] = best_model['RSS']
    models_fwd.loc[i, 'AIC'] = best_model['AIC']
    models_fwd.loc[i, 'BIC'] = best_model['BIC']
    models_fwd.loc[i, 'model'] = best_model['model']
    models_fwd.loc[i, 'AdjR2'] = best_model['AdjR2']
    
    exog_names = best_model['model'].model.exog_names
    exog_names.remove('const')
    
    variable_added = list(set(exog_names).difference(set(predictors)))[0]
    predictors.append(variable_added)
    
CPU times: total: 8.94 s
Wall time: 7.43 s

The graph below shows how the number of predictor variables (x-axis) affects the BIC. As you can see, the lowest BIC occurs somewhere between 9 and 13 predictor variables. Technically, the best model would be the one that occurs at about 13 predictors, but I believe that this is random chance and would prefer to use a model that only uses around 10 predictor variables instead. This will give me nearly the exact same performance but with more interpretable results.

In [394]:
plt.plot(np.arange(len(models_fwd)) +1, models_fwd['BIC'])
plt.title('BIC as Number of Predictor Variables Increase')
plt.ylabel('BIC')
plt.xlabel('Number of Predictors')
plt.show()
In [395]:
models_fwd.sort_values(by='BIC').iloc[0, 2].model.exog_names
Out[395]:
['Área construída',
 'Estrato 6',
 'Estrato 5',
 'Estrato 4',
 'Menor a 1 año',
 'Área privada',
 'Medellín',
 'Bogotá',
 'Más de 30 años',
 'Baño Auxiliar',
 'Administración',
 'Baños',
 'Habitaciones']

The list above shows that variables that I should use in the optimal model. However, I opted to use the second best model instead, since it uses less variables and has an extremely similar performance. The second model's variables are shown below (got rid of number of bathrooms and number of bedrooms).

In [396]:
models_fwd.sort_values(by='BIC').iloc[1, 2].model.exog_names
Out[396]:
['Área construída',
 'Estrato 6',
 'Estrato 5',
 'Estrato 4',
 'Menor a 1 año',
 'Área privada',
 'Medellín',
 'Bogotá',
 'Más de 30 años',
 'Baño Auxiliar',
 'Administración']
In [397]:
variables_to_use = models_fwd.sort_values(by='BIC').iloc[1, 2].model.exog_names

results = sm.OLS(y_train_std, sm.add_constant(X_train_std[variables_to_use])).fit()
results.summary()
Out[397]:
OLS Regression Results
Dep. Variable: y R-squared: 0.685
Model: OLS Adj. R-squared: 0.678
Method: Least Squares F-statistic: 99.66
Date: Tue, 25 Apr 2023 Prob (F-statistic): 6.82e-119
Time: 02:46:27 Log-Likelihood: -435.29
No. Observations: 517 AIC: 894.6
Df Residuals: 505 BIC: 945.6
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -1.258e-17 0.025 -5.03e-16 1.000 -0.049 0.049
Área construída 0.5850 0.029 19.871 0.000 0.527 0.643
Estrato 6 0.2964 0.032 9.137 0.000 0.233 0.360
Estrato 5 0.1592 0.029 5.539 0.000 0.103 0.216
Estrato 4 0.1052 0.028 3.790 0.000 0.051 0.160
Menor a 1 año 0.0918 0.025 3.638 0.000 0.042 0.141
Área privada 0.1147 0.029 4.019 0.000 0.059 0.171
Medellín 0.1161 0.029 4.038 0.000 0.060 0.173
Bogotá 0.0712 0.028 2.567 0.011 0.017 0.126
Más de 30 años -0.0773 0.026 -2.957 0.003 -0.129 -0.026
Baño Auxiliar -0.0689 0.026 -2.633 0.009 -0.120 -0.018
Administración 0.0756 0.029 2.609 0.009 0.019 0.132
Omnibus: 388.030 Durbin-Watson: 2.053
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11156.288
Skew: 2.926 Prob(JB): 0.00
Kurtosis: 24.992 Cond. No. 2.56


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Using the variables determined using forward subset selection, I was able to create a model that has no collinearity and which has a high R-squared value at 0.685. This means that I can use these fewer variables to predict house prices with some degree of accuracy.

How well does the model work?¶

Below, I normalized the test set of data and attempted to predict the prices of houses using the model previously created. Upon doin so, I obtained a mean squared error of 0.542. This means that when this model predicts housing prices, it is usually off from the true price of the house by about 0.542 standard deviations squared. This does not seem like a great score for this model and suggests to me that I may need either more data or more tuning of this regression model in the future to make it even better.

In [451]:
X_test_std = pd.DataFrame(X_scaler.transform(X_test), columns=X_test.columns)[variables_to_use]
y_test_std = y_scaler.transform(pd.DataFrame(y_test))

predicted = results.predict(sm.add_constant(X_test_std))
In [399]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test_std, predicted)
Out[399]:
0.5418669876320477

An example¶

The main idea behind this regression is that it doesn't have to be perfect-- it just has to be good enough. If I were to run predictions on the test set, for example, I might see the table below appear displaying the price that my model predicted and the house's actual price. As seen below, there was a positive error of $328,865.

That means that, in theory, the regression model saw the house as being worth much more money than it is actually being sold for. I can go to the link for house number 677 and check to see if purchasing it is a deal that's "too good to be true".

The houses that are really expensive are probably not as safe investments using this model given the high amount of heteroskedacity that exists as houses get bigger. Houses that are smaller or lower in price, but which the model predicts to have a higher price, are likely the best properties to invest in to get a high rate of return.

In [400]:
predicted = np.array(predicted.values.tolist()).reshape(-1, 1)
predicted = np.ravel(y_scaler.inverse_transform(predicted))

predictions = pd.DataFrame({'Predicted Price': predicted, 'Actual Price': y_test})
predictions['Error'] = predictions['Predicted Price'] - predictions['Actual Price']
In [401]:
predictions.sort_values(by='Error', ascending=False).head()
Out[401]:
Predicted Price Actual Price Error
677 706642.981395 377777.78 328865.201395
671 335328.019435 200000.00 135328.019435
102 199433.732672 95555.56 103878.172672
603 163728.140858 60000.00 103728.140858
258 216364.436065 120000.00 96364.436065
In [404]:
pd.options.display.max_colwidth = 100

enlaces[677]
Out[404]:
'https://www.fincaraiz.com.co/inmueble/casa-en-venta/calima/cali/7387424'

In the case above, the high price of the house makes me nervous. I would probably prefer to look more into house number 102, since it is a low priced house but which the model predicted to have a much higher value. In this case, this house in Santa Marta actually does seem like a steal (only $95,000) and is located in a coastal town and is very beautiful.

In [407]:
enlaces[102]
Out[407]:
'https://www.fincaraiz.com.co/inmueble/apartamento-en-venta/san-jose/santa-marta/6431027'

Turning the predictions table the other way could give similar insights to homes that are overvalued. If the actual price is significantly higher than the predicted price, the model did not see the value that the sellers are reflecting in the price. Again, the model will likely give more stable results for houses whose actual price is lower, due to the high heteroskedasticity of the model.

In [408]:
predictions.sort_values(by='Error', ascending=True).head()
Out[408]:
Predicted Price Actual Price Error
249 480144.617638 1222222.22 -742077.602362
305 596894.006345 1111111.11 -514217.103655
83 366645.633554 666666.67 -300021.036446
572 5985.152453 255555.56 -249570.407547
315 402700.062228 511111.11 -108411.047772
In [409]:
enlaces[249]
Out[409]:
'https://www.fincaraiz.com.co/inmueble/casa-en-venta/medellin/medellin/7213889'

Non-Parametric Models¶

Just for exploration's sake, I want to see how well a random forest regressor and support vector regressor can perform on this data set.

In [456]:
y_train_std = pd.DataFrame(y_train_std)[0]
y_test_std = pd.DataFrame(y_test_std)[0]

X_test_std = pd.DataFrame(X_scaler.transform(X_test), columns=X_test.columns)
X_test_std.insert(0, 'const', 1)

Random Forest¶

In [427]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators': [100, 200, 500],
    'criterion': ['squared_error', 'absolute_error'],
    'max_features': ['sqrt', 'log2', None]
}

grid = GridSearchCV(RandomForestRegressor(), params, cv=10)
grid.fit(sm.add_constant(X_train_std), y_train_std)
Out[427]:
GridSearchCV(cv=10, estimator=RandomForestRegressor(),
             param_grid={'criterion': ['squared_error', 'absolute_error'],
                         'max_features': ['sqrt', 'log2', None],
                         'n_estimators': [100, 200, 500]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=10, estimator=RandomForestRegressor(),
             param_grid={'criterion': ['squared_error', 'absolute_error'],
                         'max_features': ['sqrt', 'log2', None],
                         'n_estimators': [100, 200, 500]})
RandomForestRegressor()
RandomForestRegressor()

Support Vector Regressor¶

In [428]:
from sklearn.svm import SVR

params = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'C': [.1, .5, 1, 5, 10]
}

svr_grid = GridSearchCV(SVR(), params, cv=10)
svr_grid.fit(sm.add_constant(X_train_std), y_train_std)
Out[428]:
GridSearchCV(cv=10, estimator=SVR(),
             param_grid={'C': [0.1, 0.5, 1, 5, 10],
                         'kernel': ['linear', 'poly', 'rbf', 'sigmoid']})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=10, estimator=SVR(),
             param_grid={'C': [0.1, 0.5, 1, 5, 10],
                         'kernel': ['linear', 'poly', 'rbf', 'sigmoid']})
SVR()
SVR()

Comparison¶

By comparing the models below, it actually looks like linear regression performed the best after all. Out of random forests, support vector machines, and linear regression, linear regression had the lowest mean squared error out of all the models.

In [452]:
# Linear Regression MSE
mean_squared_error(y_test_std, predicted)
Out[452]:
0.5418669876320477
In [449]:
# Random Forest MSE
rf = RandomForestRegressor(**grid.best_params_)
rf.fit(sm.add_constant(X_train_std), y_train_std)
predicted = rf.predict(X_test_std)
mean_squared_error(y_test_std, predicted)
Out[449]:
0.6395066158850118
In [457]:
# Support Vector Regressor
svr = SVR(**svr_grid.best_params_)
svr.fit(sm.add_constant(X_train_std), y_train_std)
predicted = svr.predict(X_test_std)
mean_squared_error(y_test_std, predicted)
Out[457]:
0.6116958259105769

Conclusion¶

In this analysis, I gathered and prepared data for a regression analysis and was able to identify reasons that this data set may not completely conform to the assumptions of linear regression. However, I was also able to use normalization to standardize the variables and determine which ones have the greatest impact on the price of a house in Colombia. I also used forward subset selection to create a model with fewer variables that can still predict the price of a house in Colombia with some (limited) degree of accuracy.

In the future, I will try to gather more data and see if I can tune this model to allow me to make good purchasing decisions when it comes to investing in a house in Colombia.