Pandas notes

This is a tutorial I gave for the strategy consultants and developers in my company. The aim is to encourage them to use python(pandas) in their daily work of analytics, instead of Excel.

Import file

First of all, import pandas module

import pandas as pd

Import example excel file

df = pd.read_excel("job_examples.xlsx")

Import specific excel sheet(jobs)

df = pd.read_excel("job_examples.xlsx", sheet_name="jobs")

Use a column(job_id) as an index

df = pd.read_excel("job_examples.xlsx", sheet_name="jobs", index_col="job_id")

Overview

Gives a summary of dataframe

df.info()

Summary statistics of dataframe

df.describe()

Datatypes of the dataframe

df.dtypes

Get the name of all columns

df.columns

Dimensions of dataframe

df.shape

Select

Head or Tail - display first/final 5 rows

df.head()
df.tail()

Select a column

df['location']
df.loc[:,['location']]
df.iloc[:,[3]] 							# 4th column

Select some columns

df[['job_title','location','company_name']]
df.loc[:, ['job_title','location','company_name']]
df.iloc[:, [3, 4, 5]] 					# 3, 4, 5th column

Select some rows

df.loc[22779711:22779706]
df.loc[[22779711, 22465415, 22779706]]	# job id as index

df.iloc[10:20]
df.iloc[[1,3,5]]

Select some rows and columns

df.loc[[22779711, 22465415, 22779706], ['job_title','location','company_name']]
df.iloc[10:20, [3, 4, 5]]		

Filter categorical data

df[df['country_code'] == 'DE']
df[df['country_code'].isin(['DE', 'GB'])]	# multiple filter
df[~df['country_code'] == 'DE']				# negative filter

Filter numerical data

df[df['job_id']>1000]
df[(df['job_id']>1000) & (df['job_id']<10000)]

Filter string data

df[df['location'].str.contains("Essen")]

Sort value

df.sort_values("language_id")

Statistic summary

Shape of dataframe

df.shape					# #row, #column
df.size						# #cells
df.count()

Sum

df.sum(axis = 0)			# sum column
df.sum(axis = 1)			# sum row

Mean, max, min

df.mean()
df.max()
df.min()

Unique - all unique values

df['language_id'].unique()

Groupby

df.groupby('language_id').mean()
df.groupby('language_id').max()
df.groupby('language_id').min()

Lambda function

It is a very flexible way to manipulate data in a specific column.

The following two examples are the same, the query with lambda function may look very long, but you can define any function as you like.

df[df['language_id']>1]
df[df['language_id'].apply(lambda x: x > 1)]

A more complex example

## find long job titles
df[df.apply(lambda x : len(x['job_title'].split(" "))>10,axis=1)]

References