pandas入门

约 5870 个字 1591 行代码预计阅读时间 46 分钟

本章笔记将涉及pandas入门内容，相比Numpy，它可以同时处理不同格式的数据，但两者同样坚持面向数组编程的思想。

在正式开始之前，请务必保证自己的环境里有pandas：

# !/bin/bash
pip install pandas

正式开始之前，我们先约定如下导入方式：

import numpy as np
import pandas as pd

# 由于 Series 和 DataFrame 太常用了，将它们独立导入命名空间也是推荐的
from pandas import Series, DataFrame

本笔记使用的Python版本为3.12.3，是笔者常用的一个版本，实例程序无特殊说明均通过IPython交互进行。

本笔记的参考书是《Python for Data Analysis》，很好的书，使我的蟒蛇旋转。

pandas数据结构

正如学习Numpy先从ndarray入手，pandas学习先从熟悉两大核心数据结构Series与DataFrame开始。虽然它们并不是万能的（在面对及其复杂的数据处理场景中），但是它们包含了最基本的思想。

Series

Series，用中文来说是“系列”，在pandas中它实际上是一个一维数组类对象，包含一系列相同的值（类似于NumPy）与一个关联的数据标签数组。最简单的Series仅有一个数据数组组成：

In [1]: import pandas as pd

In [2]: obj = pd.Series([4, 7, -5, 3])

In [3]: obj
Out[3]:
0    4
1    7
2   -5
3    3
dtype: int64

在交互显示中，Series的字符串表示会将索引列显示在左侧，数据列显示在右侧，由于未指定数据索引，pandas会自动生成0~N-1的整数索引。

可以通过Series的array与index属性查看其数组表示与索引对象：

In [4]: obj.array
Out[4]:
<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [5]: obj.index
Out[5]: RangeIndex(start=0, stop=4, step=1)

Note

.array会返回一个pandasArray对象，其通常封装了一个NumPy数组，但也可以是特殊的扩展类型数组

当然，有时候你期望用特定的索引来创建一个Series，以标识不同的数据点：

In [6]: obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])

In [7]: obj2
Out[7]:
d    4
b    7
a   -5
c    3
dtype: int64

In [8]: obj2.index
Out[8]: Index(['d', 'b', 'a', 'c'], dtype='object')

相比NumPy数组，pandas允许通过标签来索引特定元素：

In [9]: obj2['a']
Out[9]: np.int64(-5)

In [10]: obj2['b']
Out[10]: np.int64(7)

In [11]: obj2[['b', 'c', 'd']]
Out[11]:
b    7
c    3
d    4
dtype: int64

pandas支持很多类似于NumPy对数组的操作，比如通过布尔数组过滤、标量乘法或应用数组运算，但这些均不会改变索引与值的关联关系：

In [12]: obj2[obj2 > 0]
Out[12]:
d    4
b    7
c    3
dtype: int64

In [13]: obj2 * 2
Out[13]:
d     8
b    14
a   -10
c     6
dtype: int64

In [14]: import numpy as np

In [15]: np.exp(obj2)
Out[15]:
d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

实际上，你可以将Series看作一种固定长度的有序字典，鉴于它将索引值映射到值的固定关系：

In [16]: 'b' in obj2
Out[16]: True

In [17]: 'e' in obj2
Out[17]: False

同样的，pandas允许直接从Python字典类型创建一个Series对象：

In [18]: sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

In [19]: obj3 = pd.Series(sdata)

In [20]: obj31
Out[20]:
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

一个Series对象通过to_dict方法可以转化回字典：

In [21]: obj3.to_dict()
Out[21]: {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

当仅从字典开始创建一个Series对象时，生成的Series索引将遵循字典keys方法所确定的键顺序，其取决于键被插入的顺序。通过传递一个所需排列顺序的字典键索引序列就可以覆盖此设置：

In [22]: states = ["California", "Ohio", "Oregon", "Texas"]

In [23]: obj4 = pd.Series(sdata, index=states)

In [24]: obj4
Out[24]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

此处states中的"Califronia"由于未在sdata中出现，所以pandas将此处索引的值初始化为NaN，而由于指定索引中未出现Utah，所以原数组中对应的值被丢弃。

值得一提的是，在本章笔记中，"空值"、"NA"或"null"，都代指缺失的值。pandas提供了isna与notna函数来检测缺失值的存在，Series本身也有.isna()与.notna()方法：

In [25]: pd.isna(obj4)
Out[25]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [26]: pd.notna(obj4)
Out[26]:
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [27]: obj4.isna()
Out[27]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [28]: obj4.notna()
Out[28]:
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series可以在数学运算中实现索引对齐：

In [29]: obj3
Out[29]:
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [30]: obj4
Out[30]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [31]: obj3 + obj4
Out[31]:
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

如果你接触过数据库，那这里的对齐就有点像"join operation"。

Series与其索引均有name属性，它们可以和pandas其他功能无缝结合：

In [32]: obj4.name = "population"

In [33]: obj4.index.name = "state"

In [34]: obj4
Out[34]:
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

可以通过赋值来改变Series的索引：

In [35]: obj
Out[35]:
0    4
1    7
2   -5
3    3
dtype: int64

In [36]: obj.index = ["Bob", "Steve", "Jeff", "Ryan"]

In [37]: obj
Out[37]:
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

DataFrame是数据的矩形表格，包含有序的、具有命名的的列集合，每列可以是不同的值类型（数值型、字符串型、布尔型等）。DataFrame具有行索引与列索引，所以可以看做一个共享相同索引的Series。

Note

虽然DataFrame在物理结构上是二维的，但是其可以通过层次索引来表示更高维度的数据。

有非常多种方法来创建一个DataFrame，一种最常见的方法是从具有等长列表或NumPy数组的Python字典中构建：

In [1]: import pandas as pd

In [2]: data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
   ...:  "year": [2000, 2001, 2002, 2001, 2002, 2003],
   ...:  "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [3]: frame = pd.DataFrame(data)

被创建的DataFrame会自动分配行索引（与Series类似），其列排列顺序取决于数据中的键值顺序，在以上案例中取决于字典中键值对的插入顺序：

In [4]: frame
Out[4]:
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2

使用head方法可以仅获取DataFrame的前五行，如果你熟悉Linux，你应该知道我在说什么：

In [5]: frame.head()
Out[5]:
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9

与之相似，使用tail方法可以获取DataFrame的后五行：

In [5]: frame.tail()
Out[5]:
    state  year  pop
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2

如果你指定了列名，那么DataFrame的列会按照给定列名的顺序排序：

In [7]: pd.DataFrame(data, columns=["year", "state", "pop"])
Out[7]:
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2

如果你传入了原数据中不存在的列名，pandas会将该列填充为空值：

In [8]: frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])

In [9]: frame2
Out[9]:
   year   state  pop debt
0  2000    Ohio  1.5  NaN
1  2001    Ohio  1.7  NaN
2  2002    Ohio  3.6  NaN
3  2001  Nevada  2.4  NaN
4  2002  Nevada  2.9  NaN
5  2003  Nevada  3.2  NaN

In [10]: frame2.columns()
Out[10]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

在DataFrame中，列既可以通过字典式引用提取为Series，也可以通过直接访问同名属性：

In [11]: frame2['year']
Out[11]:
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

In [12]: frame2.year
Out[12]:
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

Tip

以DataFrame[column]形式的引用支持所有列名，而DataFrame.column的属性方法则只支持符合Python命名规范且不与DataFrame方法冲突的列名。

通过特殊的iloc与loc属性可以提取出指定行：

In [13]: frame2.loc[1]
Out[13]:
year     2001
state    Ohio
pop       1.7
debt      NaN
Name: 1, dtype: object

In [14]: frame2.iloc[2]
Out[14]:
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object

DataFrame列的值可以通过赋值的方式修改：

In [15]: frame2['debt'] = 2.5

In [16]: frame2
Out[16]:
   year   state  pop  debt
0  2000    Ohio  1.5   2.5
1  2001    Ohio  1.7   2.5
2  2002    Ohio  3.6   2.5
3  2001  Nevada  2.4   2.5
4  2002  Nevada  2.9   2.5
5  2003  Nevada  3.2   2.5

In [17]: import numpy as np

In [18]: frame2['debt'] = np.arange(6)

In [19]: frame2
Out[19]:
   year   state  pop  debt
0  2000    Ohio  1.5     0
1  2001    Ohio  1.7     1
2  2002    Ohio  3.6     2
3  2001  Nevada  2.4     3
4  2002  Nevada  2.9     4
5  2003  Nevada  3.2     5

注意，将列表或数组赋值给列时，应保持列表或数组的长度与列长度一致，若将Series赋值给列，则会将Series的标签精确对齐到DataFrame的行索引上，若索引缺失则插入空值：

In [20]: val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])

In [21]: frame2["debt"] = val

In [22]: frame2
Out[22]:
   year   state  pop  debt
0  2000    Ohio  1.5   NaN
1  2001    Ohio  1.7   NaN
2  2002    Ohio  3.6   NaN
3  2001  Nevada  2.4   NaN
4  2002  Nevada  2.9   NaN
5  2003  Nevada  3.2   NaN

当待赋值的列不存在时，pandas会自动创建一个新的列。

Warning

不可以通过DataFrame.column的方式创建一个新列！

del关键词可以像删除字典键值对那样删除DataFrame的一个列：

In [23]: frame2['eastern'] = frame2['state'] == 'Ohio'

In [24]: frame2
Out[24]:
   year   state  pop  debt  eastern
0  2000    Ohio  1.5   NaN     True
1  2001    Ohio  1.7   NaN     True
2  2002    Ohio  3.6   NaN     True
3  2001  Nevada  2.4   NaN    False
4  2002  Nevada  2.9   NaN    False
5  2003  Nevada  3.2   NaN    False

In [25]: del frame2['eastern']

In [26]: frame2.columns
Out[26]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

Warning

通过索引返回的DataFrame列是底层数据的视图而非副本，因此对其的直接修改均会反应在DataFrame中，如果需要复制该列应使用Series的copy方法显示操作。

另一种常见的创建一个DataFrame对象的方法是使用嵌套字典：

In [27]: populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6}, "Nevada": {2001: 2.4, 2002: 2.9}}

如果使用嵌套字典创建DataFrame，pandas会将外层键名视作列名，内层键名视作行索引。

In [28]: frame3 = pd.DataFrame(populations)

In [29]: frame3
Out[29]:
      Ohio  Nevada
2000   1.5     NaN
2001   1.7     2.4
2002   3.6     2.9

通过T属性，你可以像NumPy中一样对DataFrame进行转置：

In [30]: frame3.T
Out[30]:
        2000  2001  2002
Ohio     1.5   1.7   3.6
Nevada   NaN   2.4   2.9

Warning

如果原来的DataFrame某行数据类型不一致，转置会丢弃对应列的数据类型，pandas会使用object来存储而不是高效的类型。

这种丢失是不可恢复的，即使在转置后的基础上再次转置，在数学上两次转置后的二维数组与原数组等价，但是此时DataFrame已经丢失了原始的数据类型了。

嵌套字典中的内部字典的键名会被自动合并、排列为DataFrame的行索引，但如果显式指定索引，则不适用此规则：

In [31]: pd.DataFrame(populations, index=[2001, 2002, 2003])
Out[31]:
      Ohio  Nevada
2001   1.7     2.4
2002   3.6     2.9
2003   NaN     NaN

从值为Series的字典创建DataFrame的逻辑是相似的：

In [32]: pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}

In [33]: pd.DataFrame(pdata)
Out[33]:
      Ohio  Nevada
2000   1.5     NaN
2001   1.7     2.4

以下表格展示了大部分可以用来创建DataFrame的数据类型：

Type	说明
2D ndarray	一个数据矩阵，可以传递可选的行和列标签
Dictionary of arrays, lists, or tuples	每个序列成为 DataFrame 中的一列；所有序列的长度必须相同
NumPy structured/record array	被视为“数组字典”的情况
Dictionary of Series	每个值成为一列；如果未传递显式索引，则来自每个 Series 的索引将合并（unioned）在一起，形成结果的行索引
Dictionary of dictionaries	每个内部字典成为一列；键（key）将合并在一起形成行索引，与“Series 字典”的情况类似
List of dictionaries or Series	每个项目成为 DataFrame 中的一行；字典键或 Series 索引的并集（unions）成为 DataFrame 的列标签
List of lists or tuples	被视为“2D ndarray”的情况
Another DataFrame	除非传递了不同的索引，否则使用原 DataFrame 的索引
NumPy MaskedArray	与“2D ndarray”的情况类似，但被掩码（masked）的值在 DataFrame 结果中会缺失

DataFrame支持为列索引和行索引增加标签（或者说命名）：

In [34]: frame3.index.name = 'year'

In [35]: frame3.columns.name = 'state'

In [36]: frame3
Out[36]:
state  Ohio  Nevada
year
2000    1.5     NaN
2001    1.7     2.4
2002    3.6     2.9

与Series不同，DataFrame并没有name属性，其的to_numpy方法会以二维ndarray形式返回其包含的数据：

In [37]: frame3.to_numpy()
Out[37]:
array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

索引对象

pandas的Index对象负责存储轴标签（包括数据框列名）以及其他元数据（比如轴名），在构建Series或者DataFrame时候传入的任何形式的标签序列都会被pandas转化为Index对象：

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: obj = pd.Series(np.arange(3), index = ['a', 'b', 'c'])

In [4]: index = obj.index

In [5]: index
Out[5]: Index(['a', 'b', 'c'], dtype='object')

In [6]: index[1:]
Out[6]: Index(['b', 'c'], dtype='object')

Note

Index对象是不可变的，所以我们不能直接修改一个已经被创建的Index对象。这样设计的目的是保证索引对象便于共享的同时维持其一致性。

In [7]: labels = pd.Index(np.arange(3))

In [8]: labels
Out[8]: Index([0, 1, 2], dtype='int64')

In [9]: obj2 = pd.Series([1.5, -2.5, 0], index = labels)

In [10]: obj2
Out[10]:
0    1.5
1   -2.5
2    0.0
dtype: float64

In [11]: obj2.index is labels
Out[11]: True

值得一提的是，Index对象不仅具有类似数组的特点，还有固定大小集合的特性。

In [12]: populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6}, "Nevada": {2001: 2.4, 2002: 2.9}}

In [13]: frame = pd.DataFrame(populations)

In [14]: frame.index.name = 'year'

In [15]: frame.columns.name = 'state'

In [16]: frame
Out[16]:
state  Ohio  Nevada
year
2000    1.5     NaN
2001    1.7     2.4
2002    3.6     2.9

In [17]: frame.columns
Out[17]: Index(['Ohio', 'Nevada'], dtype='object', name='state')

In [18]: 'Ohio' in frame.columns
Out[18]: True

In [19]: '2003' in frame.columns
Out[19]: False

但与Python集合不同的是，Index对象允许存在重复值。

In [20]: pd.Index(['foo', 'foo', 'bar', 'bar'])
Out[20]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

以有重复的标签索引会返回所有相同标签的结果。

每个索引都具有若干用于集合逻辑的方法，常用的在下表中列举：

Method/Property	Description
append()	与额外的 Index 对象连接，生成一个新的 Index
difference()	计算集合差集，返回一个 Index
intersection()	计算集合交集
union()	计算集合并集
isin()	计算布尔数组，指示每个值是否存在于传入的集合中
delete()	生成新的 Index，删除指定索引 i 处的元素
drop()	通过删除传入的值生成新的 Index
insert()	在索引 i 处插入元素，生成新的 Index
is_monotonic_decreasing	如果每个元素都小于或等于前一个元素，则返回 True
is_monotonic_increasing	如果每个元素都大于或等于前一个元素，则返回 True
is_unique	如果 Index 中没有重复值，则返回 True
unique()	计算 Index 中唯一值的数组

核心功能

本部分将介绍展示如何通过Series和DataFrame与数据交互，将涉及到pandas中常用的功能。

pandas更复杂、冷门或专业的功能与特性，建议查询pandas官方在线文档: https://pandas.pydata.org/docs/。

Reindexing 重构索引

reindex是pandas中的一个重要方法，其会在原对象基础上创建一个与新索引对齐的新对象：

In [1]: import pandas as pd

In [2]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])

In [3]: obj
Out[3]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

在原对象（这里是Series）将会根据新索引重排数据，如果新索引中的某项在原索引中不存在，则会填入空值：

In [4]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [5]: obj2
Out[5]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

对于某些有序数据，比如时间，在重新构建索引的时候可能需要进行插值或数值填充（就像Excel中的自动填充），对此reindex提供了一个method参数，允许你指定特定的模式，比如ffill代表“前向填充”：

In [6]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index = [0, 2
      ⋮ , 4])

In [7]: obj3
Out[7]:
0      blue
2    purple
4    yellow
dtype: object

In [8]: import numpy as np

In [9]: obj3.reindex(np.arange(6), method='ffill')
Out[9]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

对于DataFrame来说，reindex方法可以分别修改行索引、列索引或两者同时修改，当只传入一个索引序列时，默认修改行索引：

In [10]: frame =pd.DataFrame(np.arange(9).reshape((3, 3)), index = [
       ⋮ 'a', 'c', 'd'], columns= ['Ohio', 'Texas', 'California'])

In [11]: frame
Out[11]:
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

In [12]: frame2 = frame.reindex(index = ['a', 'b', 'c', 'd'])

In [13]: frame2
Out[13]:
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0

通过传入columns参数可以实现对列索引的修改：

In [14]: states = ['Texas', 'Utah', 'California']

In [15]: frame.reindex(columns=states)
Out[15]:
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8

当然，可以先将标签直接传入reindex，再通过axis参数指定需要修改的索引：

In [16]: frame.reindex(states, axis = 1)
Out[16]:
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8

Note

实际上，向axis传入"1"或"columns"是等价的！

回顾前文，loc与iloc属性可以索引到DataFrame的行，通过传入第二个参数作为新的列索引，也可以实现列索引的重构：

In [17]: frame.loc[['a', 'd', 'c'], ['California', 'Texas']]
Out[17]:
   California  Texas
a           2      1
d           8      7
c           5      4

Note

通过loc与iloc重构列索时，新索引只能包含原索引已有索引值、

Dropping 删除

如果你已有不包含所需删除条目的新索引，通过reindex方法或loc属性就可以实现索引的删除。而pandas又提供了另外一种方便的方法来删除索引，可以通过drop方法来删除指定的条目，其会返回一个全新的，删除了指定条目的新对象：

In [18]: obj = pd.Series(np.arange(5.), index = ['a', 'b', 'c', 'd', 'e'])

In [19]: obj
Out[19]:
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [20]: new_obj = obj.drop('c')

In [21]: new_obj
Out[21]:
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [22]: obj.drop(['d', 'c'])
Out[22]:
a    0.0
b    1.0
e    4.0
dtype: float64

对于DataFrame来说，drop方法可以删除其任意一条轴上的数据：

In [23]: data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=["Ohio", "Colorado", "Utah", "New York"], columns=["one", "two", "three", "four"])

In [24]: data
Out[24]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

向drop直接传入索引序列默认会删除对应的行索引：

In [25]: data.drop(index = ['Colorado', 'Ohio'])
Out[25]:
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

通过columns参数传入索引会删除对应的列索引：

In [26]: data.drop(columns = ['two'])
Out[26]:
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

当然，drop也支持通过axis参数指定要删除的索引所在轴：

In [27]: data.drop('two', axis = 1)
Out[27]:
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

In [28]: data.drop(['two', 'four'], axis = 'columns')
Out[28]:
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

Indexing, Selection and Flitering 索引，选择与过滤

Series的索引特性与NumPy数组类似，但是功能更加强大，因为不止整数可以用作索引值：

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: obj = pd.Series(np.arange(4.), index = ['a', 'b', 'c', 'd'])

In [4]: obj
Out[4]:
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [5]: obj['b']
Out[5]: np.float64(1.0)

In [6]: obj[1]
Out[6]: np.float64(1.0)

In [7]: obj[2:4]
Out[7]:
c    2.0
d    3.0
dtype: float64

In [8]: obj[['b', 'a', 'd']]
Out[8]:
b    1.0
a    0.0
d    3.0
dtype: float64

In [9]: obj[[1, 3]]
Out[9]:
b    1.0
d    3.0
dtype: float64

In [10]: obj[obj < 2]
Out[10]:
a    0.0
b    1.0
dtype: float64

Tip

使用整数作为下标索引（与切片相区别）将会在未来被pandas抛弃，之后使用下标法进行索引会一律被视作索引标签而非索引号。

如果需要通过整数索引，使用iloc属性即可。

在上面的例子当中，可以通过obj.iloc[1]与obj.iloc[[1, 3]]来代替。

虽然pandas支持直接通过标签来选择数据，但是我们还是推荐使用特定的loc属性来选取数据：

In [11]: obj.loc[['b', 'a', 'd']]
Out[11]:
b    1.0
a    0.0
d    3.0
dtype: float64

使用loc来选取数据的好处是我们无需担心“歧义”的出现，比如数据的索引是以整数进行索引的：

In [12]: obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])

In [13]: obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])

In [14]: obj1
Out[14]:
2    1
0    2
1    3
dtype: int64

In [15]: obj2
Out[15]:
a    1
b    2
c    3
dtype: int64

In [16]: obj1[[0, 1, 2]]
Out[16]:
0    2
1    3
2    1
dtype: int64

In [17]: obj2[[0, 1, 2]]
Out[17]:
a    1
b    2
c    3
dtype: int6

我们可以看到，在第一个案例中，整数被当作了“索引标签”，而第二个案例则被当作了“索引号”。

如果使用loc属性来选取数据，其会将传入的索引序列元素统一视为“索引标签”，从而避免歧义的发生：

In [18]: obj1.loc[[0, 1, 2]]
Out[18]:
0    2
1    3
2    1
dtype: int64

In [19]: obj2.loc[[0, 1, 2]]
--------------------------------------------------------------------
KeyError                           Traceback (most recent call last)
Cell In[19], line 1
----> 1 obj2.loc[[0, 1, 2]]

KeyError: "None of [Index([0, 1, 2], dtype='int64')] are in the [index]"

当然，如果你想用使用“整数索引”而非“标签索引”，可以通过iloc属性来实现，它会将传入的索引序列元素统一视作“索引号”，保证一致性：

In [20]: obj1.iloc[[0, 1, 2]]
Out[20]:
2    1
0    2
1    3
dtype: int64

In [21]: obj2.iloc[[0, 1, 2]]
Out[21]:
a    1
b    2
c    3
dtype: int64

Warning

在pandas中，使用标签进行切片索引是允许的，但是它和Python中的行为不同，pandas中的标签切片索引是两端包含的：

In [22]: obj2.loc['b': 'c']
Out[22]:
b    2
c    3
dtype: int64

通过索引可以修改指定数据的值：

In [23]: obj2.loc['b': 'c'] = 5

In [24]: obj2
Out[24]:
a    1
b    5
c    5
dtype: int64

对于DataFrame来说，默认通过列索引来选择数据：

In [25]: data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=["
       ⋮ Ohio", "Colorado", "Utah", "New York"], columns=["one", "tw
       ⋮ o", "three", "four"])

In [26]: data
Out[26]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [27]: data["two"]
Out[27]:
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [28]: data[["three", "one"]]
Out[28]:
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12

特别地，如果使用切片索引或布尔索引来选取DataFrame的数据，pandas默认对行操作：

In [29]: data[:2]
Out[29]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

In [30]: data[data["three"] > 5]
Out[30]:
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

我们还可以通过布尔型DataFrame来实现索引（data['three']<5产生的是一个布尔Series）：

In [8]: data < 5
Out[8]:
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

In [9]: data[data < 5] = 0

In [10]: data
Out[10]:
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

用布尔型DataFrame的索引行为和用布尔型Series索引的逻辑有所不同，前者是对元素进行索引（未索引值为NaN），后者对整合进行索引：

通过 loc 与 iloc 选择

DataFrame亦有loc与iloc两个特殊属性，分别用于标签索引和整数索引，由于DataFrame是二维结构，我们可以通过类似NumPy的方式来选择特定的行列子集：

In [11]: data
Out[11]:
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [12]: data.loc['Colorado']
Out[12]:
one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

仅选择单行只会返回一个带有数据框标签的Series，如选择多行会返回一个DataFrame：

In [13]: data.loc[["Colorado", "New York"]]
Out[13]:
          one  two  three  four
Colorado    0    5      6     7
New York   12   13     14    15

当然，通过传入第二个参数可以同时选择列：

In [14]: data.loc['Colorado', ['two', 'three']]
Out[14]:
two      5
three    6
Name: Colorado, dtype: int64

通过iloc可以进行整数索引：

In [15]: data.iloc[1]
Out[15]:
one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [16]: data.iloc[[2, 1]]
Out[16]:
          one  two  three  four
Utah        8    9     10    11
Colorado    0    5      6     7

In [17]: data.iloc[1, [3, 0, 1]]
Out[17]:
four    7
one     0
two     5
Name: Colorado, dtype: int64

In [18]: data.iloc[[2, 1], [3, 0, 1]]
Out[18]:
          four  one  two
Utah        11    8    9
Colorado     7    0    5

loc与iloc同样支持通过切片来索引：

In [19]: data.loc[:'Utah', 'two']
Out[19]:
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [20]: data.iloc[:, :3][data.three > 5]
Out[20]:
          one  two  three
Colorado    0    5      6
Utah        8    9     10
New York   12   13     14

请注意，loc支持布尔索引，而iloc并不支持：

In [21]: data.loc[data.three >= 2]
Out[21]:
          one  two  three  four
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

pandas支持多种方法对Series和DataFrame进行选择与重排，而在处理层次索引时还有更多种功能可供选择，下表简单总结了Series和DataFrame选择与重排的常见用法：

Type	Notes
df[column]	从 DataFrame 中选择单列或列序列；特殊用法包括：布尔数组（筛选行）、切片（切分行）、或布尔 DataFrame（基于某些条件设置值）
df.loc[rows]	通过标签选择单行或行子集
df.loc[:, cols]	通过标签选择单列或列子集
df.loc[rows, cols]	通过标签同时选择行和列
df.iloc[rows]	通过整数位置选择单行或行子集
df.iloc[:, cols]	通过整数位置选择单列或列子集
df.iloc[rows, cols]	通过整数位置同时选择行和列
df.at[row, col]	通过行和列标签选择单个标量值
df.iat[row, col]	通过行和列的整数位置选择单个标量值
reindex method	通过标签选择行或列

整数索引的意外

始终建议使用loc与iloc来索引数据，以避免索引歧义，以下就是一个经典案例：

In [22]: ser = pd.Series(np.arange(3.))

In [23]: ser
Out[23]:
0    0.0
1    1.0
2    2.0
dtype: float64

In [24]: ser[-1]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/Python_for_Data_Analysis/lib/python3.12/site-packages/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key)
    412 try:
--> 413     return self._range.index(new_key)
    414 except ValueError as err:

ValueError: -1 is not in range

In [25]: ser2 = pd.Series(np.arange(3.), index = ['a', 'b', 'c'])

In [26]: ser2
Out[26]:
a    0.0
b    1.0
c    2.0
dtype: float64

In [27]: ser2[-1]
Out[27]: np.float64(2.0)

值得注意的是，使用[]进行切片索引将始终以整数为基准，除非直接以字符串作为标签的切片索引。

链式索引的意外

我们常用索引来选取数据范围，同时联合赋值来批量修改符合特定条件的数据：

In [28]: data.loc[:, 'one'] = 1

In [29]: data
Out[29]:
          one  two  three  four
Ohio        1    0      0     0
Colorado    1    5      6     7
Utah        1    9     10    11
New York    1   13     14    15

In [30]: data.iloc[2] = 5

In [31]: data
Out[31]:
          one  two  three  four
Ohio        1    0      0     0
Colorado    1    5      6     7
Utah        5    5      5     5
New York    1   13     14    15

In [32]: data.loc[data['four'] > 5] = 3

In [33]: data
Out[33]:
          one  two  three  four
Ohio        1    0      0     0
Colorado    3    3      3     3
Utah        5    5      5     5
New York    3    3      3     3

但是如果我们通过链式索引赋值，就可能出现以下警告：

In [34]: data.loc[data.three == 5]['three'] = 6
<ipython-input-34-2aff7b9a150a>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[data.three == 5]['three'] = 6

In [35]: data
Out[35]:
          one  two  three  four
Ohio        1    0      0     0
Colorado    3    3      3     3
Utah        5    5      5     5
New York    3    3      3     3

实际上在链式索引中，你所做的修改可能只作用于一个临时值上，而没有放映到原数据当中，在这种时候pandas就会抛出SettingWithCopyWarning来提示。

在具体的开发中，我们应当尽可能避免使用链式索引，对于以上的案例，我们可以将链式索引改写为loc的一步执行：

In [36]: data.loc[data.three == 5, 'three'] = 6

In [37]: data
Out[37]:
          one  two  three  four
Ohio        1    0      0     0
Colorado    3    3      3     3
Utah        5    5      6     5
New York    3    3      3     3

Arithmetic and Data Alignment 数学计算与数据对齐

前面我们已然提到，pandas可以自动处理两个不同索引数据框之间的数据对齐，包括插入、合并与填充空值：

In [1]: import pandas as pd
im
In [2]: import numpy as np

In [3]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])

In [4]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])

In [5]: s1
Out[5]:
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [6]: s2
Out[6]:
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [7]: s1 + s2
Out[7]:
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

内部数据的自动对齐可以保证最后生成的新Series不存在重复的索引值，而对于DataFrame，对齐同时发生在行索引与列索引上：

In [8]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"), index=["Ohio", "Texas", "Colora
      ⋮ do"])

In [9]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), index=["Utah", "Ohio", "Texas"
      ⋮ , "Oregon"])

In [10]: df1
Out[10]:
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

In [11]: df2
Out[11]:
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [12]: df1 + df2
Out[12]:
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

特别地，行索引和列索引的合并遵循并集的原则，但是数据的“继承”遵循交集原则，比如'c'与'e'列索引均不同时在原先两个数据框中存在，最终两列均为空值。对行索引是同样的道理。

自定义填充

在对具有不同索引的对象之间进行算术运算时，如果存在不共有标签，则会以NaN值填充对应的数据。有时候我们需要指定这种填充值（比如0），有时候我们想自己填充NaN值，也就是np.nan，pandas同样也支持：

In [13]: df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list("abcd"))

In [14]: df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list("abcde"))

In [15]: df1
Out[15]:
     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0

In [16]: df2
Out[16]:
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0

In [17]: df1 + df2
Out[17]:
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

默认情况下pandas以空值填充不共有索引的行/列，通过add方法，传入fill_value参数可以自定义填充值:

In [17]: df1 + df2
Out[17]:
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

In [18]: df1.add(df2, fill_value=0)
Out[18]:
      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0  11.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0

下表展示了常用的Series与DataFrame支持的数学运算：

Method	Description
add, radd	加法 (+) 的方法；radd 用于右侧操作数（即当对象在右边时）
sub, rsub	减法 (-) 的方法；rsub 用于右侧操作数
div, rdiv	除法 (/) 的方法；rdiv 用于右侧操作数
floordiv, rfloordiv	整除 (//) 的方法；rfloordiv 用于右侧操作数
mul, rmul	乘法 (*) 的方法；rmul 用于右侧操作数
pow, rpow	幂运算 (** ) 的方法；rpow 用于右侧操作数

值得注意的是，所有算术方法都有对应的r方法，也就是反方法，a.rsub(b)等价于b.sub(a)、b - a。

In [19]: 1 / df1
Out[19]:
       a         b         c         d
0    inf  1.000000  0.500000  0.333333
1  0.250  0.200000  0.166667  0.142857
2  0.125  0.111111  0.100000  0.090909

In [20]: df1.rdiv(1)
Out[20]:
       a         b         c         d
0    inf  1.000000  0.500000  0.333333
1  0.250  0.200000  0.166667  0.142857
2  0.125  0.111111  0.100000  0.090909

同样的，reindex方法也支持传入fill_value参数来指定填充值：

In [21]: df1.reindex(columns=df2.columns, fill_value=0)
Out[21]:
     a    b     c     d  e
0  0.0  1.0   2.0   3.0  0
1  4.0  5.0   6.0   7.0  0
2  8.0  9.0  10.0  11.0  0

Series 与 DataFrame 间操作

与NumPy中定义了不同维度数组之间的算术运算类似，pandas也为DataFrame和Series两个不同维度的数据类型定义了运算规则。

首先以NumPy为例，看一下它如何处理二维数组与其任一一行的差：

In [22]: arr = np.arange(12.).reshape((3, 4))

In [23]: arr
Out[23]:
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [24]: arr - arr[0]
Out[24]:
array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

当从数组arr中减去arr[0]时候，NumPy会通过广播的方式将对原数组每行都执行一次减法。而在pandas中，DataFrame与Series之间的运算与之类似：

In [25]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), index=["Utah", "Ohio", "Tex
       ⋮ as", "Oregon"])

In [26]: series = frame.iloc[0]

In [27]: frame
Out[27]:
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [28]: series
Out[28]:
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [29]: frame - series
Out[29]:
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0

Note

DataFrame与Series之间的算术运算默认对行操作，可以通过算术函数的axis参数来指定操作的轴。

若存在不共有的索引值，pandas将重新生成索引并填充值：

In [30]: series2 = pd.Series(np.arange(3), index=["b", "e", "f"])

In [31]: series2
Out[31]:
b    0
e    1
f    2
dtype: int64

In [32]: frame + series2
Out[32]:
          b   d     e   f
Utah    0.0 NaN   3.0 NaN
Ohio    3.0 NaN   6.0 NaN
Texas   6.0 NaN   9.0 NaN
Oregon  9.0 NaN  12.0 NaN

如果想要对列广播，则需要使用算术方法并制定沿着行匹配：

In [33]: series3 = frame["d"]

In [34]: frame
Out[34]:
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [35]: series3
Out[35]:
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [36]: frame.sub(series3, axis=0)
Out[36]:
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0

Note

通过axis指定的轴就是所需要匹配的轴，选择行轴后，广播操作会沿着列传播。

Function Application and Mapping 函数应用与映射

Numpy中的ufunc一样可以应用在pandas的数据结构上：

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: frame = pd.DataFrame(np.random.standard_normal((4, 3)), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Or
      ⋮ egon"])

In [4]: frame
Out[4]:
               b         d         e
Utah    0.911910 -0.888903  1.202699
Ohio    0.037189 -1.045611 -1.754939
Texas   0.175772  1.990400 -0.677816
Oregon -2.042728  0.621887  0.729554

In [5]: np.abs(frame)
Out[5]:
               b         d         e
Utah    0.911910  0.888903  1.202699
Ohio    0.037189  1.045611  1.754939
Texas   0.175772  1.990400  0.677816
Oregon  2.042728  0.621887  0.729554

另一个常见需求是将每行或每列当做一个一维数组来操作，DataFrame的确支持这个功能：

In [6]: frame.apply(lambda x: x.max() - x.min())
Out[6]:
b    2.954639
d    3.036011
e    2.957638
dtype: float64

如你所见，默认情况下按行匹配操作，而axis参数传入1可以按列匹配操作：

In [7]: frame.apply(lambda x: x.max() - x.min(), axis=1)
Out[7]:
Utah      2.091602
Ohio      1.792128
Texas     2.668216
Oregon    2.772282
dtype: float64

Note

传入apply的方法应当返回一个Series类型，而非标量。

对元素操作的Python函数同样适用于Dateframe与Series，通过map：

In [8]: frame.map(lambda x: f"{x:.2f}")
Out[8]:
            b      d      e
Utah     0.91  -0.89   1.20
Ohio     0.04  -1.05  -1.75
Texas    0.18   1.99  -0.68
Oregon  -2.04   0.62   0.73

Sorting and Ranking 排序与排名

按照指定的标准进行排序也是一个十分重要的功能，在pandas中，Series与DataFrame均支持sort_index方法，其返回一个指定索引字典序排序的新对象：

In [1]: import pandas as pd
impr
oIn [2]: import numpy as np

In [3]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=["three", "one"], columns=["d", "a", "b", "c"])

In [4]: frame
Out[4]:
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

In [5]: frame.sort_index()
Out[5]:
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

In [6]: frame.sort_index(axis=1)
Out[6]:
       a  b  c  d
three  1  2  3  0
one    5  6  7  4

In [7]: obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])

In [8]: obj
Out[8]:
d 0
a 1
b 2
c 3
dtype: int64

In [9]: obj.sort_index()
Out[9]:
a 1
b 2
c 3
d 0
dtype: int64

默认为升序排列，将ascending参数设定为False则以降序排列：

In [10]: frame.sort_index(axis="columns", ascending=False)
Out[10]:
      d c b a
three 0 3 2 1
one   4 7 6 5

通过sort_value，实现Series以值排列：

In [11]: obj = pd.Series([4, 7, -3, 2])

In [12]: obj.sort_values()
Out[12]:
2 -3
3 2
0 4
1 7
dtype: int64

空值默认排列至末尾，na_position传入"first"可以使其排序在最前：

In [13]: obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

In [14]: obj.sort_values()
Out[14]:
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [15]: obj.sort_values(na_position="first")
Out[15]:
1    NaN
3    NaN
4   -3.0
5    2.0
0    4.0
2    7.0
dtype: float64

对DataFrame排序时，可以以单列或多列作为排序依据，只需要传入对应列名即可：

In [16]: frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})

In [17]: frame
Out[17]:
   b  a
0  4  0
1  7  1
2 -3  0
3  2  1

In [18]: frame.sort_values("b")
Out[18]:
   b  a
2 -3  0
3  2  1
0  4  0
1  7  1

In [19]: frame.sort_values(["a", "b"])
Out[19]:
   b  a
2 -3  0
0  4  0
3  2  1
1  7  1

而排名 Ranking 则是一个更有意思的功能，默认情况下，rank方法会为数组内的每个元素从低到高，分配一个从1到最大有效元素个数的数字来表示其的排名，如果出现并列，则取并列者所有排名的平均值作为排名：

In [20]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [21]: obj.rank()
Out[21]:
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

如果想以数据出现的先后顺序排名，而不采取平均处理，通过向method传入"first"即可：

In [22]: obj.rank(method="first")
Out[22]:
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

下表展示了pandas提供了所有面对并列数据排名的处理方式：

method 参数	说明
average	（默认）取排名平均值
min	取排名最小值
max	取排名最大值
dense	排名连续，且并列排名不占用后续排名
first	按数据出现先后顺序排序

默认为升序排名，当然也可以降序:

In [23]: obj.rank(ascending=False)
Out[23]:
0    1.5
1    7.0
2    1.5
3    3.5
4    5.0
5    6.0
6    3.5
dtype: float64

对于 DataFrame 来说，可以跨列排列：

In [24]: frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1], "c": [-2, 5, 8, -2.5]})

In [25]: frame
Out[25]:
     b  a    c
0  4.3  0 -2.0
1  7.0  1  5.0
2 -3.0  0  8.0
3  2.0  1 -2.5

In [26]: frame.rank(axis=1)
Out[26]:
     b    a    c
0  3.0  2.0  1.0
1  3.0  1.0  2.0
2  1.0  2.0  3.0
3  3.0  2.0  1.0

Axis Indexes with Duplicate Labels 具有重复标签的轴索引

大部分情况下，我们所面对的轴索引都是不重复的（当然也推荐这样），pandas里面的部分函数也要求索引唯一，但我们仍然有可能面对索引重复的情况，让我们从一个简单的Series开始：

In [1]: import pandas as pd
imp
In [2]: import numpy as np

In [3]: obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])

In [4]: obj
Out[4]:
a    0
a    1
b    2
b    3
c    4
dtype: int64

In [5]: obj.index.is_unique
Out[5]: False

通过索引选取值时候，如果所选取的标签不是唯一的，pandas会返回一个Series，而不是一个标量：

In [6]: obj['a']
Out[6]:
a    0
a    1
dtype: int64

而 DataFrame 同理：

In [7]: df = pd.DataFrame(np.random.standard_normal((5, 3)), index=["a", "a", "b", "b", "c"])

In [8]: df
Out[8]:
          0         1         2
a -1.649760 -0.003162  1.030816
a -0.626376  0.693747  0.062382
b  1.906361  0.101620  0.212420
b  0.070924  0.828396 -0.926583
c  1.355518  0.728007 -1.273681

In [9]: df.loc['b']
Out[9]:
          0         1         2
b  1.906361  0.101620  0.212420
b  0.070924  0.828396 -0.926583

In [10]: df.loc['c']
Out[10]:
0    1.355518
1    0.728007
2   -1.273681
Name: c, dtype: float64

Summarizing and Computing Descriptive Statistics 描述性统计：总结与计算

Panda 集成了多种统计学方法，大多属于简化or汇总统计方法，例如均值、总和等等。相比 Numpy，pandas的统计学方法大多内置了缺失值处理机制，示例如下：

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=["a", "b", "c", "d"], colu
      ⋮ mns=["one", "two"])

In [4]: df
Out[4]:
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

调用 DataFrame 的sum方法会返回各列的总和。

In [5]: df.sum()
Out[5]:
one    9.25
two   -5.80
dtype: float64

对行也可以：

In [6]: df.sum(axis=1)
Out[6]:
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

默认情况下sum会跳过所有NA值，如果整行或整列均为NA，则求和值默认为0。通过指定参数skipna为False，则当存在NA值时均返回NA：

In [7]: df.sum(skipna=False)
Out[7]:
one   NaN
two   NaN
dtype: float64

In [8]: df.sum(axis=1, skipna=False)
Out[8]:
a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

对于某些函数，如mean，则需要至少存在一个非NA值，否则结果均为NA：

In [9]: df.mean()
Out[9]:
one    3.083333
two   -2.900000
dtype: float64

In [10]: df.mean(axis=1)
Out[10]:
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

Tip

新版本的pandas中，描述性统计方法中的level参数已被弃用，取而代之的是而外的groupby方法：

对于idxmax与idxmin，其返回的结果是对原数据的索引值（间接统计量）：

In [11]: df.idxmax()
Out[11]:
one    b
two    d
dtype: object

其他的则为累计方法：

In [12]: df.cumsum()
Out[12]:
    one  two
a  1.40  NaN
b  8.50 -4.5
c   NaN  NaN
d  9.25 -5.8

有些方法则既不属于简化统计方法（平均值、总和等），也不属于累积方法（cumsum方法等），它们一次性可以生成多个维度的数据，这属于描述性统计方法：

In [13]: df.describe()
Out[13]:
            one       two
count  3.000000  2.000000
mean   3.083333 -2.900000
std    3.493685  2.262742
min    0.750000 -4.500000
25%    1.075000 -3.700000
50%    1.400000 -2.900000
75%    4.250000 -2.100000
max    7.100000 -1.300000

对于非数值数据，描述性统计可以生成替代总结数据：

In [14]: obj = pd.Series(["a", "a", "b", "c"] * 4)

In [15]: obj.describe()
Out[15]:
count     16
unique     3
top        a
freq       8
dtype: object

Correlation and Covariance 相关性与协方差

对于相关性与协方差这两个统计量，它们是在成对数据中计算得出的，一个典型的例子如下：

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: price = pd.read_pickle("examples/yahoo_price.pkl")

In [4]: volume = pd.read_pickle("examples/yahoo_volume.pkl")

In [5]: returns = price.pct_change()

In [6]: returns.tail()
Out[6]:
                AAPL      GOOG       IBM      MSFT
Date
2016-10-17 -0.000680  0.001837  0.002072 -0.003483
2016-10-18 -0.000681  0.019616 -0.026168  0.007690
2016-10-19 -0.002979  0.007846  0.003583 -0.002255
2016-10-20 -0.000512 -0.005652  0.001719 -0.004867
2016-10-21 -0.003930  0.003011 -0.012474  0.042096

In [7]: returns["MSFT"].corr(returns["IBM"])
Out[7]: np.float64(0.4997636114415114)

In [8]: returns["MSFT"].cov(returns["IBM"])
Out[8]: np.float64(8.870655479703545e-05)

In [9]: returns.corr()
Out[9]:
          AAPL      GOOG       IBM      MSFT
AAPL  1.000000  0.407919  0.386817  0.389695
GOOG  0.407919  1.000000  0.405099  0.465919
IBM   0.386817  0.405099  1.000000  0.499764
MSFT  0.389695  0.465919  0.499764  1.000000

In [10]: returns.cov()
Out[10]:
          AAPL      GOOG       IBM      MSFT
AAPL  0.000277  0.000107  0.000078  0.000095
GOOG  0.000107  0.000251  0.000078  0.000108
IBM   0.000078  0.000078  0.000146  0.000089
MSFT  0.000095  0.000108  0.000089  0.000215

In [11]: returns.corrwith(returns["IBM"])
Out[11]:
AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

In [12]: returns.corrwith(volume)
Out[12]:
AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

Unique Values, Value Counts, and Membership 唯一值、频数和包含关系

这是另一类从一维Series或二维Dataframe中提取数据信息的统计学方法。

第一个方法是unique，它会返回一个数组，里面包含唯一值：

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])

In [4]: uniques = obj.unique()

In [5]: uniques
Out[5]: array(['c', 'a', 'd', 'b'], dtype=object)

unique返回的数组中的元素不一定按照其在原Series中首次出现的顺序排序，也未经过排序。如果需要排序，应当通过sort方法。同样的，value_counts方法会返回原Series中的值频数统计：

In [6]: obj.value_counts()
Out[6]:
c    3
a    3
b    2
d    1
Name: count, dtype: int64

isin以向量化的形式检查元素是否在给定的数据中，可以用于快速选定原数据的子集：

In [7]: obj
Out[7]:
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [8]: obj.isin(['b', 'c'])
Out[8]:
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [9]: obj[obj.isin(['b', 'c'])]
Out[9]:
0    c
5    b
6    b
7    c
8    c
dtype: object

get_indexer方法会将一个数组中，每个值映射到另一个不包含重复值数组的整数索引上，并由此生成一个等长的整数数组，这在你需要对齐数组之间的数据，或合并数组的时候非常有用。

In [10]: to_match = pd.Series(["c", "a", "b", "b", "c", "a"])

In [11]: unique_vals = pd.Series(["c", "b", "a"])

In [12]: indices = pd.Index(unique_vals).get_indexer(to_match)

In [13]: indices
Out[13]: array([0, 2, 1, 1, 0, 2])

在 Dataframe 中，有时候你需要在多列之间统计数据，以绘制直方图，以下例子向你展示了如何实现这一点：

In [14]: data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4], "Qu2": [2, 3, 1, 2, 3], "Qu3": [1, 5, 2, 4, 4]})

In [15]: data
Out[15]:
   Qu1  Qu2  Qu3
0    1    2    1
1    3    3    5
2    4    1    2
3    3    2    4
4    4    3    4

In [16]: data["Qu1"].value_counts().sort_index()
Out[16]:
Qu1
1    1
3    2
4    2
Name: count, dtype: int64

In [17]:  result = data.apply(lambda x: x.value_counts()).fillna(0)

In [18]: result
Out[18]:
   Qu1  Qu2  Qu3
1  1.0  1.0  1.0
2  0.0  2.0  1.0
3  2.0  2.0  0.0
4  2.0  0.0  2.0
5  0.0  0.0  1.0

当然，DataFrame也有value_counts方法，不过和 Series 不同的是，df.value_counts会将行视为一个元组，通过比较独特的行来进行统计。

In [19]: data = pd.DataFrame({"a": [1, 1, 1, 2, 2], "b": [0, 0, 1, 0, 0]})

In [20]: data
Out[20]:
   a  b
0  1  0
1  1  0
2  1  1
3  2  0
4  2  0

In [21]: data.value_counts()
Out[21]:
a  b
1  0    2
2  0    2
1  1    1
Name: count, dtype: int64