Pandas库进阶

pandas库进阶

缺失值NaN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>>> import pandas as pd
>>> import numpy as np
>>> d = pd.DataFrame(np.random.randn(5, 4))
0 1 2 3
0 -0.467109 0.155209 -0.352105 -0.145243
1 -1.000026 -0.718102 0.492405 -0.314355
2 -0.570243 0.815769 0.675293 0.059136
3 -1.209361 1.489262 0.977589 -0.538479
4 -0.980784 1.026060 0.021682 -0.824890
>>> d[np.random.randint(4)][np.random.randint(5)] = np.NaN
>>> d[np.random.randint(4)][np.random.randint(5)] = np.NaN
>>> d[np.random.randint(4)][np.random.randint(5)] = np.NaN
>>> d[np.random.randint(4)][np.random.randint(5)] = np.NaN
>>> d
0 1 2 3
0 -0.467109 0.155209 -0.352105 NaN
1 -1.000026 -0.718102 0.492405 NaN
2 -0.570243 NaN 0.675293 0.059136
3 NaN 1.489262 0.977589 -0.538479
4 -0.980784 1.026060 0.021682 -0.824890

判断是否存在NaN

.isnumm()函数可以产一组带索引的布尔值

1
2
3
4
5
6
7
>>> d.isnull()
0 1 2 3
0 False False False True
1 False False False True
2 False True False False
3 True False False False
4 False False False False

.any(axis=0)可查看每一列是否有NaN,.any(axis=1)可查看每一行是否有NaN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>>> d.isnull()
0 1 2 3
0 False False False True
1 False False False True
2 False True False False
3 True False False False
4 False False False False
>>> d.isnull().any(axis=0)
0 True
1 True
2 False
3 True
dtype: bool
>>> d.isnull().any(axis=1)
0 True
1 True
2 True
3 True
4 False
dtype: bool

而使用.isnull().values.any()可判断整个数据是否存在NaN

1
2
3
4
5
6
7
8
9
# 判断是否存在NaN
>>> d.isnull().values
array([[False, False, False, True],
[False, False, False, True],
[False, True, False, False],
[ True, False, False, False],
[False, False, False, False]])
>>> d.isnull().values.any()
True

处理NaN及其他标记

  • 对于其他标记(例如'?')可以使用.replace(to_replace, value, inplace=False)函数进行替换

    • to_replace:要替换的标记
    • value:替换后的值
    • inplace:是否就地更改
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    >>> d
    0 1 2 3
    0 NaN -0.117094 0.201809 0.0936998
    1 ? 1.166718 0.833323 0.341343
    2 2.28867 0.075756 1.3238 0.703166
    3 -0.322933 1.890391 NaN 0.770076
    4 -0.32468 NaN -0.0308925 ?
    >>> d.replace('?', 1)
    0 1 2 3
    0 NaN -0.117094 0.201809 0.093700
    1 1.000000 1.166718 0.833323 0.341343
    2 2.288671 0.075756 1.323798 0.703166
    3 -0.322933 1.890391 NaN 0.770076
    4 -0.324680 NaN -0.030892 1.000000
  • 对于NaN,有两种处理方式

    • .dropna(inplace=False):去除含有默缺失值NaN的行

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      >>> d.replace('?', 1)
      0 1 2 3
      0 NaN -0.117094 0.201809 0.093700
      1 1.000000 1.166718 0.833323 0.341343
      2 2.288671 0.075756 1.323798 0.703166
      3 -0.322933 1.890391 NaN 0.770076
      4 -0.324680 NaN -0.030892 1.000000
      # 去除含有默缺失值NaN的行
      >>> d.replace('?', 1).dropna()
      0 1 2 3
      1 1.000000 1.166718 0.833323 0.341343
      2 2.288671 0.075756 1.323798 0.703166
    • .fillna(value, inplace=False):将NaN填充为value

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      >>> d.replace('?', 1, inplace=True)
      >>> d
      0 1 2 3
      0 NaN -0.117094 0.201809 0.093700
      1 1.000000 1.166718 0.833323 0.341343
      2 2.288671 0.075756 1.323798 0.703166
      3 -0.322933 1.890391 NaN 0.770076
      4 -0.324680 NaN -0.030892 1.000000
      # 替换d[0][0]处的NaN
      >>> d[0].fillna(d[0].mean(), inplace=True)
      >>> d
      0 1 2 3
      0 0.660264 -0.117094 0.201809 0.093700
      1 1.000000 1.166718 0.833323 0.341343
      2 2.288671 0.075756 1.323798 0.703166
      3 -0.322933 1.890391 NaN 0.770076
      4 -0.324680 NaN -0.030892 1.000000

数据离散化

数据离散化就是将连续的值域划分为若干个离散的区间,用不同的符号或数值代表每个区间中的属性

分组

  • 自动分组pandas.qcut(data, bins)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    >>> d
    A B C D
    c0 -5.011405 51.598027 30.736787 -155.574060
    c1 103.312304 -40.163837 -125.380597 -149.740251
    c2 -44.149836 40.952584 22.689684 -59.907136
    c3 115.680639 -7.897926 78.096531 14.463511
    c4 32.387194 -16.200097 57.731634 42.056148
    # 对d['C']进行分组,分为5组
    >>> pd.qcut(d['C'], 5)
    c0 (27.518, 41.535]
    c1 (-125.382, -6.924]
    c2 (-6.924, 27.518]
    c3 (61.805, 78.097]
    c4 (41.535, 61.805]
    Name: C, dtype: category
    Categories (5, interval[float64]): [(-125.382, -6.924] < (-6.924, 27.518] < (27.518, 41.535] <
    (41.535, 61.805] < (61.805, 78.097]]
  • 自定义分组pandas.cut(data, bins)

    1
    2
    3
    4
    5
    6
    7
    8
    >>> pd.cut(d['C'], [-100, 0, 20, 30, 100])
    c0 (30.0, 100.0]
    c1 NaN
    c2 (20.0, 30.0]
    c3 (30.0, 100.0]
    c4 (30.0, 100.0]
    Name: C, dtype: category
    Categories (4, interval[int64]): [(-100, 0] < (0, 20] < (20, 30] < (30, 100]]
  • .value_counts()可计算每组数据个数

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    # 自动分组
    >>> pd.qcut(d['C'], 5).value_counts()
    (61.805, 78.097] 1
    (41.535, 61.805] 1
    (27.518, 41.535] 1
    (-6.924, 27.518] 1
    (-125.382, -6.924] 1
    Name: C, dtype: int64
    # 自定义分组
    >>> pd.cut(d['C'], [-100, 0, 20, 30, 100]).value_counts()
    (30, 100] 3
    (20, 30] 1
    (0, 20] 0
    (-100, 0] 0
    Name: C, dtype: int64

分组转换为one-hot编码

one-hot编码即每个类别生成一个布尔列,样本位于该类别中,则对应的布尔值为true,反之为false

  • pandas.get_dummies(data, prefix=None)可将分组结果转换为one-hot编码

    • prefix:分组名
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    >>> pd.get_dummies(pd.qcut(d['C'], 5))
    (-125.382, -6.924] (-6.924, 27.518] (27.518, 41.535] (41.535, 61.805] (61.805, 78.097]
    c0 0 0 1 0 0
    c1 1 0 0 0 0
    c2 0 1 0 0 0
    c3 0 0 0 0 1
    c4 0 0 0 1 0
    >>> pd.get_dummies(pd.qcut(d['C'], 5), prefix='section')
    section_(-125.382, -6.924] section_(-6.924, 27.518] ... section_(41.535, 61.805] section_(61.805, 78.097]
    c0 0 0 ... 0 0
    c1 1 0 ... 0 0
    c2 0 1 ... 0 0
    c3 0 0 ... 0 1
    c4 0 0 ... 1 0

合并

按方向合并

  • pandas.concat([data1, data2], axis=0)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    >>> dd = pd.get_dummies(pd.qcut(d['C'], 5), prefix='section')
    # 列不同,行相同,按行拼接
    >>> pd.concat([d, dd], axis=1)
    A B C ... section_(27.518, 41.535] section_(41.535, 61.805] section_(61.805, 78.097]
    c0 -5.011405 51.598027 30.736787 ... 1 0 0
    c1 103.312304 -40.163837 -125.380597 ... 0 0 0
    c2 -44.149836 40.952584 22.689684 ... 0 0 0
    c3 115.680639 -7.897926 78.096531 ... 0 0 1
    c4 32.387194 -16.200097 57.731634 ... 0 1 0

    [5 rows x 9 columns]


    # 按列拼接
    >>> pd.concat([d, dd], axis=0)
    A B C ... section_(27.518, 41.535] section_(41.535, 61.805] section_(61.805, 78.097]
    c0 -5.011405 51.598027 30.736787 ... NaN NaN NaN
    c1 103.312304 -40.163837 -125.380597 ... NaN NaN NaN
    c2 -44.149836 40.952584 22.689684 ... NaN NaN NaN
    c3 115.680639 -7.897926 78.096531 ... NaN NaN NaN
    c4 32.387194 -16.200097 57.731634 ... NaN NaN NaN
    c0 NaN NaN NaN ... 1.0 0.0 0.0
    c1 NaN NaN NaN ... 0.0 0.0 0.0
    c2 NaN NaN NaN ... 0.0 0.0 0.0
    c3 NaN NaN NaN ... 0.0 0.0 1.0
    c4 NaN NaN NaN ... 0.0 1.0 0.0

按索引合并

  • pd.merge(left, right, how='inner', on=None)
    • left:左表
    • right:右表
    • how:如何拼接,一般填'inner'
    • on:按哪个索引拼接

inner

1
pd.merge(left, right, on=['key1', 'key2'])

默认情况下内连接,以key1key2作为连接键,保留共有键,即左边的字段(A、B)来自于左表,右边的字段(C、D)来自于右表,中间字段(key1、key2)两表取共有

left

1
pd.merge(left, right, how='left', on=['key1', 'key2'])

以左表为基准进行连接,左表全部保留,右边有的直接填,没有的就填NaN

1
pd.merge(left, right, how='right', on=['key1', 'key2'])

以右表为基准进行连接,右表全部保留,左表有的直接填,没有的就填NaN

outer

1
pd.merge(left, right, how='outer', on=['key1', 'key2'])

左右两个表有的键都保留,没有的都填NaN

交叉表与透视表

  • 交叉表:用于计算一列数据对于另一列数据的分组个数(寻找两列之间的关系)

    • pandas.crosstab(value1, value2)
      • value1:列数据1
      • value2:列数据2
  • 透视表:相比于交叉表更简便

    • DataFrame.pivot_table([], index=[])

分组与聚合

先分组,后聚合,只有完成聚合之后才能显示

  • 分组
    • DataFrame.groupby(by=key, as_index=False)
      • key:分组的列数据,可以多个
  • 聚合
    • 各个统计函数
打赏
  • 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!
  • Copyrights © 2021 lzeroyuee
  • 访问人数: | 浏览次数:

请我喝杯咖啡吧~

支付宝
微信