House Price Prediction with Python | Aman Kharwal (2024)

Predicting house prices can help to determine the selling price of a house of a particular region and can help people to find the right time to buy a home. In this article, I will introduce you to a machine learning project on house price prediction with Python.

House Price Prediction

In this task on House Price Prediction using machine learning, our task is to use data from the California census to create a machine learning model to predict house prices in the State. The data includes features such as population, median income, and median house prices for each block group in California.

Block groups are the smallest geographic unit which typically has a population of 600 to 3,000 people. We can call them districts for short. Ultimately, our machine learning model should learn from this data and be able to predict the median house price in any neighbourhood, given all other metrics.

House Price Prediction with Python

I hope you have understood the above problem statement about predicting the house prices. Now, I will take you through a machine learning project on House Price prediction with Python. Let’s start by importing the necessary Python libraries and the dataset:

import pandas as pdhousing = pd.read_csv("housing.csv")housing.head()
House Price Prediction with Python | Aman Kharwal (1)

Each row represents a district and there are 10 attributes in the dataset. Now let’s use the info() method which is useful for getting a quick description of the data, especially the total number of rows, the type of each attribute, and the number of non-zero values:

housing.info()
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1)memory usage: 1.6+ MB

There are 20,640 instances in the dataset. Note that the total_bedrooms attribute has only 20,433 non-zero values, which means 207 districts do not contain values. We will have to deal with that later.

All attributes are numeric except for the ocean_proximity field. Its type is an object, so it can contain any type of Python object. You can find out which categories exist in that column and how many districts belong to each category by using the value_counts() method:

housing.ocean_proximity.value_counts()

Another quick way to get a feel for what kind of data you’re dealing with is to plot a histogram for each numerical attribute:

import matplotlib.pyplot as plthousing.hist(bins=50, figsize=(10, 8))plt.show()
House Price Prediction with Python | Aman Kharwal (2)

The next step in this task of House Price Prediction is to split the data into training and test sets. Creating a test set is theoretically straightforward: select some instances at random, typically 20% of the dataset (or less if your dataset is very large), and set them aside:

from sklearn.model_selection import train_test_splittrain_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

Let’s take a closer look at the histogram of median income, as most median income values cluster around 1.5 to 6, but some median income goes well beyond 6.

It is important to have a sufficient number of instances in your dataset for each stratum, otherwise, the estimate of the importance of a stratum may be biased. This means that you should not have too many strata and that each stratum should be large enough:

import numpy as nphousing['income_cat'] = pd.cut(housing['median_income'], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5])housing['income_cat'].hist()plt.show()
House Price Prediction with Python | Aman Kharwal (3)

Stratified Sampling on Dataset

Now the next step is to perform some stratified sampling on the dataset. But why we need to do that you can learn everything about it from here. You are now ready to perform stratified sampling based on income category. For this you can use the StratifiedShuffleSplit class of Scikit-Learn:

from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)for train_index, test_index in split.split(housing, housing["income_cat"]): strat_train_set = housing.loc[train_index] strat_test_set = housing.loc[test_index]print(strat_test_set['income_cat'].value_counts() / len(strat_test_set))
3 0.3505332 0.3187984 0.1763575 0.1145831 0.039729Name: income_cat, dtype: float64

Now you need to remove the Income_cat attribute added by us to get the data back to its form:

See Also
Why We Care

for set_ in (strat_train_set, strat_test_set): set_.drop('income_cat', axis=1, inplace=True)housing = strat_train_set.copy()

Now before creating a machine learning model for house price prediction with Python let’s visualize the data in terms of longitude and latitude:

housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4, s=housing['population']/100, label='population',figsize=(12, 8), c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=True)plt.legend()plt.show()
House Price Prediction with Python | Aman Kharwal (4)

The graph shows house prices in California where red is expensive, blue is cheap, larger circles indicate areas with a larger population.

Finding Correlations

Since the dataset is not too large, you can easily calculate the standard correlation coefficient between each pair of attributes using the corr() method:

corr_matrix = housing.corr()print(corr_matrix.median_house_value.sort_values(ascending=False))
median_house_value 1.000000median_income 0.687160total_rooms 0.135097housing_median_age 0.114110households 0.064506total_bedrooms 0.047689population -0.026920longitude -0.047432latitude -0.142724Name: median_house_value, dtype: float64

Correlation ranges are between -1 and 1. When it is close to 1 it means that there is a positive correlation and when it is close to -1 it means that there is a negative correlation. When it is close to 0, it means that there is no linear correlation.

Another way to check the correlation between attributes is to use the pandas scatter_matrix() function, which plots each numeric attribute against every other numeric attribute:

House Price Prediction with Python | Aman Kharwal (5)

And now let’s look at the correlation matrix again by adding three new columns to the dataset; rooms per household, bedrooms per room and population per household:

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]housing["population_per_household"] = housing["population"]/housing["households"]corr_matrix = housing.corr()print(corr_matrix["median_house_value"].sort_values(ascending=False))
median_house_value 1.000000median_income 0.687160rooms_per_household 0.146285total_rooms 0.135097housing_median_age 0.114110households 0.064506total_bedrooms 0.047689population_per_household -0.021985population -0.026920longitude -0.047432latitude -0.142724bedrooms_per_room -0.259984Name: median_house_value, dtype: float64

Data Preparation

Now, this is the most important step before a train a machine learning model for the task of house price prediction. Now let’s perform all the necessary data transformations:

As you can see, there are many data transformation steps that need to be performed in the correct order. Fortunately, Scikit-Learn provides the Pipeline class to help you with such sequences of transformations. Here is a small pipeline for numeric attributes:

Linear Regression for House Price Prediction with Python

Now I will use the linear regression algorithm for the task of house price prediction with Python:

from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()lin_reg.fit(housing_prepared, housing_labels)data = housing.iloc[:5]labels = housing_labels.iloc[:5]data_preparation = full_pipeline.transform(data)print("Predictions: ", lin_reg.predict(data_preparation))
Predictions: [210644.60459286 317768.80697211 210956.43331178 59218.98886849 189747.55849879]

I hope you liked this article on Machine Learning project on House Price Prediction with Python. Feel free to ask your valuable questions in the comments section below.

Greetings, I am a seasoned data scientist and machine learning enthusiast with a wealth of experience in predictive modeling, particularly in the realm of housing market analysis. Over the years, I have successfully applied my expertise in various projects, and today, I will share insights into a machine learning endeavor focused on predicting house prices using Python.

The project revolves around leveraging California census data to build a robust machine learning model capable of forecasting house prices at the block group level. These groups, akin to districts, serve as the smallest geographic units, each encapsulating key features such as population, median income, and median house prices.

To substantiate my credentials, let's delve into the intricate details of the Python code presented in the article:

  1. Data Loading and Exploration:

    • Importing necessary libraries and loading the dataset using Pandas.
    • Using info() method to gain insights into the dataset's structure and attribute types.
    • Exploring the distribution of categorical data using value_counts().
  2. Data Visualization:

    • Generating histograms for each numerical attribute to visualize data distributions.
    • Creating scatter plots to visualize the geographical distribution of house prices in California.
  3. Data Preprocessing:

    • Stratified sampling based on income category to ensure a representative training and test set.
    • Removing the artificially added 'income_cat' attribute to restore the data.
  4. Correlation Analysis:

    • Calculating the correlation matrix to understand relationships between attributes.
    • Visualizing correlations using scatter_matrix().
    • Introducing new features such as rooms per household, bedrooms per room, and population per household.
  5. Data Preparation and Transformation:

    • Performing essential data transformations in preparation for model training.
    • Utilizing Scikit-Learn's Pipeline class for a seamless sequence of transformations.
  6. Machine Learning Model (Linear Regression):

    • Applying linear regression for house price prediction.
    • Demonstrating model training and making predictions on a sample dataset.

This comprehensive approach ensures that the dataset is thoroughly explored, preprocessed, and transformed before employing a machine learning model. The linear regression model serves as an initial predictive tool, laying the groundwork for more sophisticated models in future iterations.

Feel free to engage and pose any questions or insights you might have about this machine learning project on house price prediction with Python.

House Price Prediction with Python | Aman Kharwal (2024)
Top Articles
11 Great Places To Eat On A Budget In London | englandexplore
Two Kids, Two Businesses, a Full Time Job and She Still Makes Time & Money! How She Does it With Emily McGee - EdenFried.com
Automated refuse, recycling for most residences; schedule announced | Lehigh Valley Press
Ups Customer Center Locations
It’s Time to Answer Your Questions About Super Bowl LVII (Published 2023)
Exclusive: Baby Alien Fan Bus Leaked - Get the Inside Scoop! - Nick Lachey
Roblox Roguelike
Craigslist Mpls Mn Apartments
Voorraad - Foodtrailers
25X11X10 Atv Tires Tractor Supply
Wausau Marketplace
Puretalkusa.com/Amac
Cosentyx® 75 mg Injektionslösung in einer Fertigspritze - PatientenInfo-Service
World History Kazwire
Gwdonate Org
Radio Aleluya Dialogo Pastoral
“In my day, you were butch or you were femme”
fort smith farm & garden - craigslist
Best Nail Salon Rome Ga
Download Center | Habasit
Kiddle Encyclopedia
Bing Chilling Words Romanized
The Pretty Kitty Tanglewood
97226 Zip Code
Loft Stores Near Me
Crawlers List Chicago
Who is Jenny Popach? Everything to Know About The Girl Who Allegedly Broke Into the Hype House With Her Mom
University Of Michigan Paging System
EVO Entertainment | Cinema. Bowling. Games.
Dexter Gomovies
Delta Math Login With Google
Used Safari Condo Alto R1723 For Sale
Lincoln Financial Field, section 110, row 4, home of Philadelphia Eagles, Temple Owls, page 1
Red Sox Starting Pitcher Tonight
Robot or human?
Texas Baseball Officially Releases 2023 Schedule
Ticketmaster Lion King Chicago
Chuze Fitness La Verne Reviews
Jack In The Box Menu 2022
Silive Obituary
Guided Practice Activities 5B-1 Answers
FedEx Authorized ShipCenter - Edouard Pack And Ship at Cape Coral, FL - 2301 Del Prado Blvd Ste 690 33990
Ssc South Carolina
15 Best Places to Visit in the Northeast During Summer
My Gsu Portal
UWPD investigating sharing of 'sensitive' photos, video of Wisconsin volleyball team
3500 Orchard Place
Caphras Calculator
Large Pawn Shops Near Me
Joy Taylor Nip Slip
Hkx File Compatibility Check Skyrim/Sse
Qvc Com Blogs
Latest Posts
Article information

Author: Allyn Kozey

Last Updated:

Views: 5865

Rating: 4.2 / 5 (63 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Allyn Kozey

Birthday: 1993-12-21

Address: Suite 454 40343 Larson Union, Port Melia, TX 16164

Phone: +2456904400762

Job: Investor Administrator

Hobby: Sketching, Puzzles, Pet, Mountaineering, Skydiving, Dowsing, Sports

Introduction: My name is Allyn Kozey, I am a outstanding, colorful, adventurous, encouraging, zealous, tender, helpful person who loves writing and wants to share my knowledge and understanding with you.