Pandas数据处理

注意:这是一篇英文文章

Pandas is a powerful python library for data processing, here I will introduce some Apis in Pandas that is frequently used.

Two basic data structures

Series

Pandas Series. Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.

DataFrame

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects or a table with rows and columns. The primary pandas data structure.

Basic usage

import pandas

1
import pandas as pd

Read data from file

You can get a DataFame data structure from dataset.

CSV dataset

read data
1
df = pd.read_csv("/path/to/file.csv")

More arguments please visit pandas.read_csv

write data
1
df.to_csv("/path/to/save.csv", sep=",", index=False)

documentation

Excel dataset

read data
1
df = pd.read_excel("/path/to/file.xlsx")

More arguments please visit pandas.read_excel

write data
1
df.to_excel("/path/to/save.csv", sep=",", index=False)

documentation

Create Series

1
2
3
4
names = ['sean li', 'joey liu', 'mary chan', 'john zhang', 'harry liang', 'tim huang', 'chandler tan']
series = pd.Series(data=names)

print(series)
1
2
3
4
5
6
7
8
0         sean li
1 joey liu
2 mary chan
3 john zhang
4 harry liang
5 tim huang
6 chandler tan
dtype: object

Create DataFram

from list

1
2
3
4
5
6
7
8
9
10
11
12
datas = [
("sean li", "male", 22, 99, 98),
("joey liu", "female", 22, 100, 99),
("mary chan", "female", 21, 97, 98),
("john zhang", "male", 22, 99, 96),
("harry liang", "male", 20, 89, 86),
("tim huang", "male", 21, 90, 94),
("chandler tan", "male", 20, 93, 95)
]

df = pd.DataFrame(data=datas, columns=["name", "gender", "age",
"math", "english"])

from dict

1
2
3
4
5
6
7
8
data_dict = {
"name": ['sean li', 'joey liu', 'mary chan', 'john zhang', 'harry liang', 'tim huang', 'chandler tan'],
"gender": ['male', 'female', 'female', 'male', 'male', 'male', 'male'],
"math": [99, 100, 97, 99, 89, 90, 93],
"english": [98, 99, 98, 96, 86, 94, 95]
}

df = pd.DataFrame(data_dict)

Basic information of DataFrame

1
df.info()

The function will print the DataFrame`s structure information.

1
2
3
4
5
6
7
8
9
10
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
name 7 non-null object
gender 7 non-null object
age 7 non-null int64
math 7 non-null int64
english 7 non-null int64
dtypes: int64(3), object(2)
memory usage: 352.0+ bytes
1
print(df.describe())
1
2
3
4
5
6
7
8
9
             age        math    english
count 7.000000 7.000000 7.000000
mean 21.142857 95.285714 95.142857
std 0.899735 4.572173 4.413184
min 20.000000 89.000000 86.000000
25% 20.500000 91.500000 94.500000
50% 21.000000 97.000000 96.000000
75% 22.000000 99.000000 98.000000
max 22.000000 100.000000 99.000000

list the column names

1
df.columns
1
['name', 'gender', 'age', 'math', 'english']

list the index names

1
df.index
1
RangeIndex(start=0, stop=7, step=1)

get shape

1
print(df.shape)
1
(7, 5)

Get rows

1
df.head(3)
1
2
3
4
        name  gender  age  math  english
0 sean li male 22 99 98
1 joey liu female 22 100 99
2 mary chan female 21 97 9
1
df.tail(3)
1
2
3
4
           name gender  age  math  english
4 harry liang male 20 89 86
5 tim huang male 21 90 94
6 chandler tan male 20 93 95

pandas.DataFrame.iloc

Purely integer-location based indexing for selection by position.

get single row
1
df.iloc[3]
1
2
3
4
5
6
name       john zhang
gender male
age 22
math 99
english 96
Name: 3, dtype: object
get rows in a continuous range
1
df.iloc[0:2]  #== df.iloc[range(3)]
1
2
3
4
        name  gender  age  math  english
0 sean li male 22 99 98
1 joey liu female 22 100 99
2 mary chan female 21 97 98
get rows in a discontinuous range
1
df.iloc[[0,2]]
1
2
3
        name  gender  age  math  english
0 sean li male 22 99 98
2 mary chan female 21 97 98

pandas.DataFrame.loc

Access a group of rows and columns by label(s) or a boolean array. It`s similar usage to iloc to get row(s) with default index (not specify index names), but two differences in them:

  • loc can only use positive integer to indicate number or range, so df.loc[-1] is incorrect, df.iloc[-1] is correct syntax.
  • df.iloc[:2] == df.iloc[range(2)] == df.loc[range(2)] \(\neq\) df.loc[:2], three rows(0, 1, 2) in df.loc[:2], while two rows(0, 1) in others.

Get columns

just specify the column name in square brackets,(like with a dictionary)

1
df['name']
1
2
3
4
5
6
7
8
0         sean li
1 joey liu
2 mary chan
3 john zhang
4 harry liang
5 tim huang
6 chandler tan
Name: name, dtype: object

pick column(s) by df.columns

1
df[df.columns[0:2]]
1
2
3
4
5
6
7
8
           name  gender
0 sean li male
1 joey liu female
2 mary chan female
3 john zhang male
4 harry liang male
5 tim huang male
6 chandler tan male

Get specific rows and columns

usage

1
2
3
4
# 1. selection is integer-based
df.iloc[<row selection], <column selection]
# 2. selection is label-based
df.loc[<row selection], <column selection>]

iloc

1
df.iloc[:2, [0,3,4]]
1
2
3
       name  math  english
0 sean li 99 98
1 joey liu 100 99

loc

1
df.loc[:2, ['name', 'math', 'english']]
1
2
3
4
        name  math  english
0 sean li 99 98
1 joey liu 100 99
2 mary chan 97 98

Basic functions of pandas

logical operations

1
2
3
df[df['age'] == 22]
df[(df['age'] == 22) & (df['math'] > 96)]
df[(df['age'] == 22) & (df['math'] > 96) & ~(df['english'] < 96)]

Subset the data

pandas DataFrame methods

type

Series
1
type(df['name'])
1
<class 'pandas.core.series.Series'>
DataFrame
1
type(df[['name', 'age']])
1
<class 'pandas.core.frame.DataFrame'>

head() / tail()

1
2
df['name'].head()
df['name'].tail()

get the unique entries of a column

1
df['math'].unique()

split string

1
df.name.str.split(" ", expand=True)
1
2
3
4
5
6
7
8
          0      1
0 sean li
1 joey liu
2 mary chan
3 john zhang
4 harry liang
5 tim huang
6 chandler tan

renaming a column

1
df.rename(columns = {'name': 'fullname'}, inplace=True)

calculate for numerical columns

  • pandas.DataFrame.corr() — finds the correlation between columns in a DataFrame.
  • pandas.DataFrame.count() — counts the number of non-null values in each DataFrame column.
  • pandas.DataFrame.max() — finds the highest value in each column.
  • pandas.DataFrame.min() — finds the lowest value in each column.
  • pandas.DataFrame.median() — finds the median of each column.
  • pandas.DataFrame.std() — finds the standard deviation of each column.
1
2
df.mean()
df['age'].mean()

counting occurrences

1
df['age'].value_counts()
1
2
3
4
22    3
21 2
20 2
Name: age, dtype: int64

group by and value_counts

1
df.groupby('age')["math"].value_counts()
1
2
3
4
5
6
7
8
age  math
20 89 1
93 1
21 90 1
97 1
22 99 2
100 1
Name: math, dtype: int64

map(), apply(), applymap()

  • map() : applies a function to each element of a column.
  • apply() : applies a function to columns. Use .apply(, axis=1) to do it on the rows.
  • applymap() : applies a function to all cells in the table (DataFrame).
map()
1
2
df['name_lens'] = df['name'].map(len)
print(df[['name', 'name_lens']])
1
2
3
4
5
6
7
8
           name  name_lens
0 sean li 7
1 joey liu 8
2 mary chan 9
3 john zhang 10
4 harry liang 11
5 tim huang 9
6 chandler tan 12
apply()
1
df.apply(len)
1
2
3
4
5
6
name       7
gender 7
age 7
math 7
english 7
dtype: int64
applymap()
1
df.applymap(lambda x: len(str(x)))
1
2
3
4
5
6
7
8
   name  gender  age  math  english
0 7 4 2 2 2
1 8 6 2 3 2
2 9 6 2 2 2
3 10 4 2 2 2
4 11 4 2 2 2
5 9 4 2 2 2
6 12 4 2 2 2

Advanced operations in pandas

SQL join

joining on 3 columns takes just one line

1
df.merge(other_df, on=['column_1', 'column_2', 'column_3'])

pandas.DataFrame.merge

groupby

1
2
for gid, item in df.groupby('age'):
# TODO

Group DataFrame or Series using a mapper or by a Series of columns. pandas.DataFrame.groupby

Refrence

sean lee wechat
欢迎关注我的公众号!
感谢支持!