Weather data visualization for San Francisco Bay Area – a Python Pandas and Matplotlib Tutorial

 In Data Science

Weather data is a great type of input when starting to learn tools and technologies for your data science skills. This project will introduce us to the basics of Pandas and Matplotlib Python libraries using data for San Francisco, San Mateo, Santa Clara, Mountain View and San Jose in California.

Why Pandas?

Pandas is an open source Python library which will help with data cleaning, manipulation, transformation, and visualizations. This project will introduce basic concepts like

  • data cleaning
  • data frames
  • data manipulation
  • data transformation
  • data visualisation

Weather data analysis – the code

Let’s start with defining all of the classes and functions we need and loading the weather data. You can download the weather.csv file here.

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Import San Francisco Bay Area Weather data from CSV file
data = pd.read_csv('weather.csv')

Then we make the variables names a little friendlier for users:

# Make variables some friendlier names for users
old_names = ['Max TemperatureF', 'Min TemperatureF', 'Mean TemperatureF', 'Max Dew PointF', 'MeanDew PointF',
             'Min DewpointF', 'Max Humidity',
             ' Mean Humidity', ' Min Humidity', ' Max Sea Level PressureIn', ' Mean Sea Level PressureIn',
             ' Min Sea Level PressureIn', ' Max VisibilityMiles', ' Mean VisibilityMiles',
             ' Min VisibilityMiles', ' Max Wind SpeedMPH', ' Mean Wind SpeedMPH', ' Max Gust SpeedMPH', 'PrecipitationIn',
             ' CloudCover', ' WindDirDegrees', ' Events']
new_names = ['maxTemp', 'minTemp', 'meanTemp', 'maxDew', 'meanDew', 'minDew', 'maxHum', 'meanHum', 'minHum', 'maxPress',
             'minPress', 'meanPress', 'maxVis', 'meanVis',
             'minVis', 'maxWind', 'meanWind', 'maxGust', 'preIn', 'cloud', 'WindDir', 'events']
data.rename(columns=dict(zip(old_names, new_names)), inplace=True)

After that we can delete the unused column in CSV file and remove the unreal samples (i.e. -100 degrees Celsius) – this makes our chart more accurate. In our example, we remove unrealistic temperature values. For this we will use the following code:

# Delete unused column in CSV File

del data['preIn']

# Remove the bad samples in temperature
data = data[(data['maxTemp'] <= 110) & (data['minTemp'] >= 25)]

Now we are going to extract a list of unique ZIP code values from CSV file. This will allow us to determine how many different cities we have in our data.

# List unique values on example column using drop_duplicates(We can also use unique())
df2 = pd.DataFrame(data, columns=['ZIP'])
u = df2.drop_duplicates(['ZIP'])

The results are the following ZIP codes, which we will use in the next steps.

# Get data for cities
# 94107 -> San Francisco
# 94063 -> San Mateo
# 94301 -> Santa Clara
# 94041 -> Mountain View
# 95113 -> San Jose
zipcodes = [94107, 94063, 94301, 94041, 95113]

After that, we can create plots of the data, such for example for mean temperature in San Francisco Area using following code:

# Plots of Mean temperature in Fahrenheit scale

plt.figure()
for zcode in zipcodes:
  local = data.loc[data['ZIP'] == zcode]
  df1 = pd.DataFrame(local, columns=['meanTemp'])
  plt.plot(df1.as_matrix(), '-', label=str(zcode))

plt.xticks(x,labels,rotation='vertical',fontsize=12)
plt.grid(True)
plt.xlabel('Month')
plt.ylabel('Temperature in Fahrenheit scale', fontsize=15)
plt.title('Fahrenheit Mean Temperature on Bay Area Cities',fontsize=20)
plt.legend(["San Francisco", "San Mateo","Santa Clara", "Mountain View","San Jose"])
plt.show()

Pandas linear plot Temperature

…and that’s it 🙂

Want to launch the project? Download code from GitHub .

If you have any questions about the project, the libraries or this post, please ask the questions in the comments.

Resources

Recommended Posts