当前位置:首页 » 《资源分享》 » 正文

面对小白的pandas命令手册+练习题【三万字详解】_AI19970205的博客

11 人参与  2021年10月28日 08:43  分类 : 《资源分享》  评论

点击全文阅读


大家好,我是辣条。

Pandas 是Python的核心数据分析支持库,提供了快速、灵活、明确的数据结构,旨在简单、直观地处理关系型、标记型数据。Pandas常用于处理带行列标签的矩阵数据、与 SQL 或 Excel 表类似的表格数据,应用于金融、统计、社会科学、工程等领域里的数据整理与清洗、数据分析与建模、数据可视化与制表等工作。

练习题索引

习题编号内容相应数据集
练习1 - 开始了解你的数据探索Chipotle快餐数据chipotle.tsv
练习2 - 数据过滤与排序探索2012欧洲杯数据Euro2012_stats.csv
[练习3 - 数据分组]探索酒类消费数据drinks.csv
[练习4 -Apply函数]探索1960 - 2014 美国犯罪数据US_Crime_Rates_1960_2014.csv
[练习5 - 合并]探索虚拟姓名数据练习中手动内置的数据
[练习6 - 统计探索风速数据wind.data
[练习7 - 可视化]探索泰坦尼克灾难数据train.csv
[练习8 - 创建数据框探索Pokemon数据练习中手动内置的数据
[练习9 - 时间序列探索Apple公司股价数据Apple_stock.csv
[练习10 - 删除数据探索Iris纸鸢花数据iris.csv

练习1-开始了解你的数据

探索Chipotle快餐数据

image description

步骤1 导入必要的库

In [7]:

# 运行以下代码
import pandas as pd
 

步骤2 从如下地址导入数据集

In [5]:

# 运行以下代码
path1 = "./exercise_data/chipotle.tsv"    # chipotle.tsv

步骤3 将数据集存入一个名为chipo的数据框内

In [8]:

# 运行以下代码
chipo = pd.read_csv(path1, sep = '\t')

步骤4 查看前10行内容

In [9]:

# 运行以下代码
chipo.head(10)

Out[9]:

order_idquantityitem_namechoice_descriptionitem_price
011Chips and Fresh Tomato SalsaNaN$2.39
111Izze[Clementine]$3.39
211Nantucket Nectar[Apple]$3.39
311Chips and Tomatillo-Green Chili SalsaNaN$2.39
422Chicken Bowl[Tomatillo-Red Chili Salsa (Hot), [Black Beans...$16.98
531Chicken Bowl[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...$10.98
631Side of ChipsNaN$1.69
741Steak Burrito[Tomatillo Red Chili Salsa, [Fajita Vegetables...$11.75
841Steak Soft Tacos[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...$9.25
951Steak Burrito[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...$9.25

步骤6 数据集中有多少个列(columns)

In [236]:

# 运行以下代码
chipo.shape[1]

Out[236]:

5

步骤7 打印出全部的列名称

In [237]:

# 运行以下代码
chipo.columns

Out[237]:

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

步骤8 数据集的索引是怎样的

In [238]:

# 运行以下代码
chipo.index

Out[238]:

RangeIndex(start=0, stop=4622, step=1)

步骤9 被下单数最多商品(item)是什么?

In [239]:

# 运行以下代码,做了修正
c = chipo[['item_name','quantity']].groupby(['item_name'],as_index=False).agg({'quantity':sum})
c.sort_values(['quantity'],ascending=False,inplace=True)
c.head()

Out[239]:

item_namequantity
17Chicken Bowl761
18Chicken Burrito591
25Chips and Guacamole506
39Steak Burrito386
10Canned Soft Drink351

步骤10 在item_name这一列中,一共有多少种商品被下单?

In [240]:

# 运行以下代码
chipo['item_name'].nunique()

Out[240]:

50

步骤11 在choice_description中,下单次数最多的商品是什么?

In [241]:

# 运行以下代码,存在一些小问题
chipo['choice_description'].value_counts().head()

Out[241]:

[Diet Coke]                                                                          134
[Coke]                                                                               123
[Sprite]                                                                              77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]]                42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]]     40
Name: choice_description, dtype: int64

步骤12 一共有多少商品被下单?

In [242]:

# 运行以下代码
total_items_orders = chipo['quantity'].sum()
total_items_orders

Out[242]:

4972

步骤13 将item_price转换为浮点数

In [243]:

# 运行以下代码
dollarizer = lambda x: float(x[1:-1])
chipo['item_price'] = chipo['item_price'].apply(dollarizer)

步骤14 在该数据集对应的时期内,收入(revenue)是多少

In [244]:

# 运行以下代码,已经做更正
chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2)
chipo['sub_total'].sum()

Out[244]:

39237.02

步骤15 在该数据集对应的时期内,一共有多少订单?

In [245]:

# 运行以下代码
chipo['order_id'].nunique()

Out[245]:

1834

步骤16 每一单(order)对应的平均总价是多少?

In [246]:

# 运行以下代码,已经做过更正
chipo[['order_id','sub_total']].groupby(by=['order_id']
).agg({'sub_total':'sum'})['sub_total'].mean()

Out[246]:

21.39423118865867

步骤17 一共有多少种不同的商品被售出?

In [247]:

# 运行以下代码
chipo['item_name'].nunique()

Out[247]:

练习2-数据过滤与排序

探索2012欧洲杯数据

image description

步骤1 - 导入必要的库

In [248]:

# 运行以下代码
import pandas as pd

步骤2 - 从以下地址导入数据集

In [249]:

# 运行以下代码
path2 = "./exercise_data/Euro2012_stats.csv"      # Euro2012_stats.csv

步骤3 - 将数据集命名为euro12

In [250]:

# 运行以下代码
euro12 = pd.read_csv(path2)
euro12

Out[250]:

TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Saves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed CardsSubs onSubs offPlayers Used
0Croatia4131251.9%16.0%32000...1381.3%41622909916
1Czech Republic4131841.9%12.9%39000...960.1%5373870111119
2Denmark4101050.0%20.0%27100...1066.7%25388407715
3England5111850.0%17.2%40000...2288.1%4345650111116
4France3222437.9%6.5%65100...654.6%3651560111119
5Germany10323247.8%15.6%80210...1062.6%63491240151517
6Greece581830.7%19.2%32111...1365.1%67481291121220
7Italy6344543.0%7.5%110200...2074.1%1018916160181819
8Netherlands2123625.0%4.1%60200...1270.6%35303507715
9Poland2152339.4%5.2%48000...666.7%48563717717
10Portugal6224234.3%9.3%82600...1071.5%739010120141416
11Republic of Ireland171236.8%5.2%28000...1765.4%43511161101017
12Russia593122.5%12.5%59200...1077.0%34434607716
13Spain12423355.9%16.0%100010...1593.8%1028319110171718
14Sweden5171947.2%13.8%39300...861.6%35517709918
15Ukraine272621.2%6.0%38000...1376.5%48314509918

16 rows × 35 columns

步骤4 只选取 Goals 这一列

In [251]:

# 运行以下代码
euro12.Goals

Out[251]:

0      4
1      4
2      4
3      5
4      3
5     10
6      5
7      6
8      2
9      2
10     6
11     1
12     5
13    12
14     5
15     2
Name: Goals, dtype: int64

步骤5 有多少球队参与了2012欧洲杯?

In [252]:

# 运行以下代码
euro12.shape[0]

Out[252]:

16

步骤6 该数据集中一共有多少列(columns)?

In [253]:

# 运行以下代码
euro12.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
Team                          16 non-null object
Goals                         16 non-null int64
Shots on target               16 non-null int64
Shots off target              16 non-null int64
Shooting Accuracy             16 non-null object
% Goals-to-shots              16 non-null object
Total shots (inc. Blocked)    16 non-null int64
Hit Woodwork                  16 non-null int64
Penalty goals                 16 non-null int64
Penalties not scored          16 non-null int64
Headed goals                  16 non-null int64
Passes                        16 non-null int64
Passes completed              16 non-null int64
Passing Accuracy              16 non-null object
Touches                       16 non-null int64
Crosses                       16 non-null int64
Dribbles                      16 non-null int64
Corners Taken                 16 non-null int64
Tackles                       16 non-null int64
Clearances                    16 non-null int64
Interceptions                 16 non-null int64
Clearances off line           15 non-null float64
Clean Sheets                  16 non-null int64
Blocks                        16 non-null int64
Goals conceded                16 non-null int64
Saves made                    16 non-null int64
Saves-to-shots ratio          16 non-null object
Fouls Won                     16 non-null int64
Fouls Conceded                16 non-null int64
Offsides                      16 non-null int64
Yellow Cards                  16 non-null int64
Red Cards                     16 non-null int64
Subs on                       16 non-null int64
Subs off                      16 non-null int64
Players Used                  16 non-null int64
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB

步骤7 将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框

In [254]:

# 运行以下代码
discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']]
discipline

Out[254]:

TeamYellow CardsRed Cards
0Croatia90
1Czech Republic70
2Denmark40
3England50
4France60
5Germany40
6Greece91
7Italy160
8Netherlands50
9Poland71
10Portugal120
11Republic of Ireland61
12Russia60
13Spain110
14Sweden70
15Ukraine50

步骤8 对数据框discipline按照先Red Cards再Yellow Cards进行排序

In [255]:

# 运行以下代码
discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending = False)

Out[255]:

TeamYellow CardsRed Cards
6Greece91
9Poland71
11Republic of Ireland61
7Italy160
10Portugal120
13Spain110
0Croatia90
1Czech Republic70
14Sweden70
4France60
12Russia60
3England50
8Netherlands50
15Ukraine50
2Denmark40
5Germany40

步骤9 计算每个球队拿到的黄牌数的平均值

In [256]:

# 运行以下代码
round(discipline['Yellow Cards'].mean())

Out[256]:

7.0

步骤10 找到进球数Goals超过6的球队数据

In [257]:

# 运行以下代码
euro12[euro12.Goals > 6]

Out[257]:

TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Saves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed CardsSubs onSubs offPlayers Used
5Germany10323247.8%15.6%80210...1062.6%63491240151517
13Spain12423355.9%16.0%100010...1593.8%1028319110171718

2 rows × 35 columns

步骤11 选取以字母G开头的球队数据

In [258]:

# 运行以下代码
euro12[euro12.Team.str.startswith('G')]

Out[258]:

TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Saves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed CardsSubs onSubs offPlayers Used
5Germany10323247.8%15.6%80210...1062.6%63491240151517
6Greece581830.7%19.2%32111...1365.1%67481291121220

2 rows × 35 columns

步骤12 选取前7列

In [259]:

# 运行以下代码
euro12.iloc[: , 0:7]

Out[259]:

TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)
0Croatia4131251.9%16.0%32
1Czech Republic4131841.9%12.9%39
2Denmark4101050.0%20.0%27
3England5111850.0%17.2%40
4France3222437.9%6.5%65
5Germany10323247.8%15.6%80
6Greece581830.7%19.2%32
7Italy6344543.0%7.5%110
8Netherlands2123625.0%4.1%60
9Poland2152339.4%5.2%48
10Portugal6224234.3%9.3%82
11Republic of Ireland171236.8%5.2%28
12Russia593122.5%12.5%59
13Spain12423355.9%16.0%100
14Sweden5171947.2%13.8%39
15Ukraine272621.2%6.0%38

步骤13 选取除了最后3列之外的全部列

In [260]:

# 运行以下代码
euro12.iloc[: , :-3]

Out[260]:

TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Clean SheetsBlocksGoals concededSaves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed Cards
0Croatia4131251.9%16.0%32000...01031381.3%4162290
1Czech Republic4131841.9%12.9%39000...1106960.1%5373870
2Denmark4101050.0%20.0%27100...11051066.7%2538840
3England5111850.0%17.2%40000...22932288.1%4345650
4France3222437.9%6.5%65100...175654.6%3651560
5Germany10323247.8%15.6%80210...11161062.6%63491240
6Greece581830.7%19.2%32111...12371365.1%67481291
7Italy6344543.0%7.5%110200...21872074.1%1018916160
8Netherlands2123625.0%4.1%60200...0951270.6%3530350
9Poland2152339.4%5.2%48000...083666.7%4856371
10Portugal6224234.3%9.3%82600...21141071.5%739010120
11Republic of Ireland171236.8%5.2%28000...02391765.4%43511161
12Russia593122.5%12.5%59200...0831077.0%3443460
13Spain12423355.9%16.0%100010...5811593.8%1028319110
14Sweden5171947.2%13.8%39300...1125861.6%3551770
15Ukraine272621.2%6.0%38000...0441376.5%4831450

16 rows × 32 columns

步骤14 找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)

In [261]:

# 运行以下代码
euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]

Out[261]:

TeamShooting Accuracy
3England50.0%
7Italy43.0%
12Russia22.5%

练习3-数据分组

探索酒类消费数据

image description

步骤1 导入必要的库

In [262]:

# 运行以下代码
import pandas as pd

步骤2 从以下地址导入数据

In [10]:

# 运行以下代码
path3 ='./exercise_data/drinks.csv'    #'drinks.csv'

步骤3 将数据框命名为drinks

In [11]:

# 运行以下代码
drinks = pd.read_csv(path3)
drinks.head()

Out[11]:

countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0AS
1Albania89132544.9EU
2Algeria250140.7AF
3Andorra24513831212.4EU
4Angola21757455.9AF

步骤4 哪个大陆(continent)平均消耗的啤酒(beer)更多?

In [12]:

# 运行以下代码
drinks.groupby('continent').beer_servings.mean()

Out[12]:

continent
AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer_servings, dtype: float64

步骤5 打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值

In [13]:

# 运行以下代码
drinks.groupby('continent').wine_servings.describe()

Out[13]:

countmeanstdmin25%50%75%max
continent
AF53.016.26415138.8464190.01.02.013.00233.0
AS44.09.06818221.6670340.00.01.08.00123.0
EU45.0142.22222297.4217380.059.0128.0195.00370.0
OC16.035.62500064.5557900.01.08.523.25212.0
SA12.062.41666788.6201891.03.012.098.50221.0

步骤6 打印出每个大陆每种酒类别的消耗平均值

In [15]:

# 运行以下代码
drinks.groupby('continent').mean()

Out[15]:

beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
continent
AF61.47169816.33962316.2641513.007547
AS37.04545560.8409099.0681822.170455
EU193.777778132.555556142.2222228.617778
OC89.68750058.43750035.6250003.381250
SA175.083333114.75000062.4166676.308333

步骤7 打印出每个大陆每种酒类别的消耗中位数

In [268]:

# 运行以下代码
drinks.groupby('continent').median()

Out[268]:

beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
continent
AF32.03.02.02.30
AS17.516.01.01.20
EU219.0122.0128.010.00
OC52.537.08.51.75
SA162.5108.512.06.85

步骤8 打印出每个大陆对spirit饮品消耗的平均值,最大值和最小值

In [269]:

# 运行以下代码
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])

Out[269]:

meanminmax
continent
AF16.3396230152
AS60.8409090326
EU132.5555560373
OC58.4375000254
SA114.75000025302

练习4-Apply函数

探索1960 - 2014 美国犯罪数据

image description

步骤1 导入必要的库

In [16]:

# 运行以下代码
import numpy as np
import pandas as pd

步骤2 从以下地址导入数据集

In [27]:

# 运行以下代码
path4 = './exercise_data/US_Crime_Rates_1960_2014.csv'    # "US_Crime_Rates_1960_2014.csv"

步骤3 将数据框命名为crime

In [28]:

# 运行以下代码
crime = pd.read_csv(path4)
crime.head()

Out[28]:

YearPopulationTotalViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
01960179323175338420028846030957009110171901078401543209121001855400328200
11961182992000348800028939031986008740172201066701567609496001913000336000
21962185771000375220030151034507008530175501108601645709943002089600366800
319631884830004109500316970379250086401765011647017421010864002297800408300
419641911410004564600364220420040093602142013039020305012132002514400472800

步骤4 每一列(column)的数据类型是什么样的?

In [29]:

# 运行以下代码
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null int64
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: int64(12)
memory usage: 5.2 KB

注意到了吗,Year的数据类型为 int64,但是pandas有一个不同的数据类型去处理时间序列(time series),我们现在来看看。

步骤5 将Year的数据类型转换为 datetime64

In [30]:

# 运行以下代码
crime.Year = pd.to_datetime(crime.Year, format='%Y')
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null datetime64[ns]
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.2 KB

步骤6 将列Year设置为数据框的索引

In [31]:

# 运行以下代码
crime = crime.set_index('Year', drop = True)
crime.head()

Out[31]:

PopulationTotalViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-01179323175338420028846030957009110171901078401543209121001855400328200
1961-01-01182992000348800028939031986008740172201066701567609496001913000336000
1962-01-01185771000375220030151034507008530175501108601645709943002089600366800
1963-01-011884830004109500316970379250086401765011647017421010864002297800408300
1964-01-011911410004564600364220420040093602142013039020305012132002514400472800

步骤7 删除名为Total的列

In [32]:

# 运行以下代码
del crime['Total']
crime.head()

Out[32]:

PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-0117932317528846030957009110171901078401543209121001855400328200
1961-01-0118299200028939031986008740172201066701567609496001913000336000
1962-01-0118577100030151034507008530175501108601645709943002089600366800
1963-01-01188483000316970379250086401765011647017421010864002297800408300
1964-01-01191141000364220420040093602142013039020305012132002514400472800

In [33]:

crime.resample('10AS').sum()

Out[33]:

PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-0119150531754134930451609001061802367201633510215852013321100265477005292100
1970-01-0121211932989607930913838001922305545704159020470212028486000531578009739900
1980-01-0123713700691407432811704890020643986563953831097619130330734947204025311935411
1990-01-01261282525817527048119053499211664998827574893010568963267500157767936614624418
2000-01-0129479691171396805610094436916306892249942303668652124215651766797029111412834
2010-01-011570146307607201744095950728674210591749809376414210125170304016983569080

步骤8 按照Year对数据框进行分组并求和

*注意Population这一列,若直接对其求和,是不正确的**

In [34]:

# 更多关于 .resample 的介绍
# (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html)
# 更多关于 Offset Aliases的介绍 
# (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)
# 运行以下代码
crimes = crime.resample('10AS').sum() # resample a time series per decades


# 用resample去得到“Population”列的最大值
population = crime['Population'].resample('10AS').max()

# 更新 "Population" 
crimes['Population'] = population

crimes

Out[34]:

PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-012013850004134930451609001061802367201633510215852013321100265477005292100
1970-01-012200990009607930913838001922305545704159020470212028486000531578009739900
1980-01-012482390001407432811704890020643986563953831097619130330734947204025311935411
1990-01-0127269081317527048119053499211664998827574893010568963267500157767936614624418
2000-01-013070065501396805610094436916306892249942303668652124215651766797029111412834
2010-01-01318857056607201744095950728674210591749809376414210125170304016983569080

步骤9 何时是美国历史上生存最危险的年代?

In [279]:

# 运行以下代码
crime.idxmax(0)

Out[279]:

Population           2014-01-01
Violent              1992-01-01
Property             1991-01-01
Murder               1991-01-01
Forcible_Rape        1992-01-01
Robbery              1991-01-01
Aggravated_assault   1993-01-01
Burglary             1980-01-01
Larceny_Theft        1991-01-01
Vehicle_Theft        1991-01-01
dtype: datetime64[ns]

练习5-合并

探索虚拟姓名数据

步骤1 导入必要的库

In [280]:

# 运行以下代码
import numpy as np
import pandas as pd

步骤2 按照如下的元数据内容创建数据框

In [281]:

# 运行以下代码
raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

步骤3 将上述的数据框分别命名为data1, data2, data3

In [282]:

# 运行以下代码
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])

步骤4 将data1data2两个数据框按照行的维度进行合并,命名为all_data

In [283]:

# 运行以下代码
all_data = pd.concat([data1, data2])
all_data

Out[283]:

subject_idfirst_namelast_name
01AlexAnderson
12AmyAckerman
23AllenAli
34AliceAoni
45AyoungAtiches
04BillyBonder
15BrianBlack
26BranBalwner
37BryceBrice
48BettyBtisan

步骤5 将data1data2两个数据框按照列的维度进行合并,命名为all_data_col

In [284]:

# 运行以下代码
all_data_col = pd.concat([data1, data2], axis = 1)
all_data_col

Out[284]:

subject_idfirst_namelast_namesubject_idfirst_namelast_name
01AlexAnderson4BillyBonder
12AmyAckerman5BrianBlack
23AllenAli6BranBalwner
34AliceAoni7BryceBrice
45AyoungAtiches8BettyBtisan

步骤6 打印data3

In [285]:

# 运行以下代码
data3

Out[285]:

subject_idtest_id
0151
1215
2315
3461
4516
5714
6815
791
81061
91116

步骤7 按照subject_id的值对all_datadata3作合并

In [286]:

# 运行以下代码
pd.merge(all_data, data3, on='subject_id')

Out[286]:

subject_idfirst_namelast_nametest_id
01AlexAnderson51
12AmyAckerman15
23AllenAli15
34AliceAoni61
44BillyBonder61
55AyoungAtiches16
65BrianBlack16
77BryceBrice14
88BettyBtisan15

步骤8 对data1data2按照subject_id作连接

In [287]:

# 运行以下代码
pd.merge(data1, data2, on='subject_id', how='inner')

Out[287]:

subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
04AliceAoniBillyBonder
15AyoungAtichesBrianBlack

步骤9 找到 data1data2 合并之后的所有匹配结果

In [288]:

# 运行以下代码
pd.merge(data1, data2, on='subject_id', how='outer')

Out[288]:

subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
01AlexAndersonNaNNaN
12AmyAckermanNaNNaN
23AllenAliNaNNaN
34AliceAoniBillyBonder
45AyoungAtichesBrianBlack
56NaNNaNBranBalwner
67NaNNaNBryceBrice
78NaNNaNBettyBtisan

练习6-统计

探索风速数据

image description

步骤1 导入必要的库

In [289]:

# 运行以下代码
import pandas as pd
import datetime

步骤2 从以下地址导入数据

In [290]:

import pandas as pd

In [35]:

# 运行以下代码
path6 = "./exercise_data/wind.data"  # wind.data

步骤3 将数据作存储并且设置前三列为合适的索引

In [292]:

import datetime

In [293]:

# 运行以下代码
data = pd.read_table(path6, sep = "\s+", parse_dates = [[0,1,2]]) 
data.head()

Out[293]:

Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
02061-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
12061-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
22061-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
32061-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
42061-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

步骤4 2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug

In [294]:

# 运行以下代码
def fix_century(x):
    year = x.year - 100 if x.year > 1989 else x.year
    return datetime.date(year, x.month, x.day)

# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)

# data.info()
data.head()

Out[294]:

Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
01961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
11961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
21961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
31961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
41961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

步骤5 将日期设为索引,注意数据类型,应该是datetime64[ns]

In [295]:

# 运行以下代码
# transform Yr_Mo_Dy it to date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])

# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')

data.head()
# data.info()

Out[295]:

RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
1961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
1961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
1961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
1961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

步骤6 对应每一个location,一共有多少数据值缺失

In [296]:

# 运行以下代码
data.isnull().sum()

Out[296]:

RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64

步骤7 对应每一个location,一共有多少完整的数据值

In [297]:

# 运行以下代码
data.shape[0] - data.isnull().sum()

Out[297]:

RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64

步骤8 对于全体数据,计算风速的平均值

In [298]:

# 运行以下代码
data.mean().mean()

Out[298]:

10.227982360836924

步骤9 创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差

In [299]:

# 运行以下代码
loc_stats = pd.DataFrame()

loc_stats['min'] = data.min() # min
loc_stats['max'] = data.max() # max 
loc_stats['mean'] = data.mean() # mean
loc_stats['std'] = data.std() # standard deviations

loc_stats

Out[299]:

minmaxmeanstd
RPT0.6735.8012.3629875.618413
VAL0.2133.3710.6443145.267356
ROS1.5033.8411.6605265.008450
KIL0.0028.466.3064683.605811
SHA0.1337.5410.4558344.936125
BIR0.0026.167.0922543.968683
DUB0.0030.379.7973434.977555
CLA0.0031.088.4950534.499449
MUL0.0025.888.4935904.166872
CLO0.0428.218.7073324.503954
BEL0.1342.3813.1210075.835037
MAL0.6742.5415.5990796.699794

步骤10 创建一个名为day_stats的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差

In [300]:

# 运行以下代码
# create the dataframe
day_stats = pd.DataFrame()

# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max 
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations

day_stats.head()

Out[300]:

minmaxmeanstd
Yr_Mo_Dy
1961-01-019.2918.5013.0181822.808875
1961-01-026.5017.5411.3363643.188994
1961-01-036.1718.5011.6418183.681912
1961-01-041.7911.756.6191673.198126
1961-01-056.1713.3310.6300002.445356

步骤11 对于每一个location,计算一月份的平均风速

注意,1961年的1月和1962年的1月应该区别对待

In [301]:

# 运行以下代码
# creates a new column 'date' and gets the values from the index
data['date'] = data.index

# creates a column for each value from date
data['month'] = data['date'].apply(lambda date: date.month)
data['year'] = data['date'].apply(lambda date: date.year)
data['day'] = data['date'].apply(lambda date: date.day)

# gets all value from the month 1 and assign to janyary_winds
january_winds = data.query('month == 1')

# gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,'RPT':"MAL"].mean()

Out[301]:

RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64

步骤12 对于数据记录按照年为频率取样

In [302]:

# 运行以下代码
data.query('month == 1 and day == 1')

Out[302]:

RPTVALROSKILSHABIRDUBCLAMULCLOBELMALdatemonthyearday
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.041961-01-01119611
1962-01-019.293.4211.543.502.211.9610.412.793.545.174.387.921962-01-01119621
1963-01-0115.5913.6219.798.3812.2510.0023.4515.7113.5914.3717.5834.131963-01-01119631
1964-01-0125.8022.1318.2113.2521.2914.7914.1219.5813.2516.7528.9621.001964-01-01119641
1965-01-019.5411.929.004.386.085.2110.256.085.718.6312.0417.411965-01-01119651
1966-01-0122.0421.5017.0812.7522.1715.5921.7918.1216.6617.8328.3323.791966-01-01119661
1967-01-016.464.466.503.216.673.7911.383.837.719.0810.6720.911967-01-01119671
1968-01-0130.0417.8816.2516.2521.7912.5418.1616.6218.7517.6222.2527.291968-01-01119681
1969-01-016.131.635.411.082.541.008.502.424.586.349.1716.711969-01-01119691
1970-01-019.592.9611.793.426.134.089.004.467.293.507.3313.001970-01-01119701
1971-01-013.710.794.710.171.421.044.630.751.541.084.219.541971-01-01119711
1972-01-019.293.6314.544.256.754.4213.005.3310.048.548.7119.171972-01-01119721
1973-01-0116.5015.9214.627.418.2911.2113.547.7910.4610.7913.379.711973-01-01119731
1974-01-0123.2116.5416.089.7515.8311.469.5413.5413.8316.6617.2125.291974-01-01119741
1975-01-0114.0413.5411.295.4612.585.588.128.969.295.177.7111.631975-01-01119751
1976-01-0118.3417.6714.838.0016.6210.1313.179.0413.135.7511.3814.961976-01-01119761
1977-01-0120.0411.9220.259.139.298.0410.755.889.009.0014.8825.701977-01-01119771
1978-01-018.337.127.713.548.507.5014.7110.0011.8310.0015.0920.461978-01-01119781

步骤13 对于数据记录按照月为频率取样

In [303]:

# 运行以下代码
data.query('day == 1')

Out[303]:

RPTVALROSKILSHABIRDUBCLAMULCLOBELMALdatemonthyearday
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.041961-01-01119611
1961-02-0114.2515.129.045.8812.087.1710.173.636.505.509.178.001961-02-01219611
1961-03-0112.6713.1311.796.429.798.5410.2513.29NaN12.2120.62NaN1961-03-01319611
1961-04-018.386.348.336.759.339.5411.678.2111.216.4611.967.171961-04-01419611
1961-05-0115.8713.8815.379.7913.4610.179.9614.049.759.9218.6311.121961-05-01519611
1961-06-0115.929.5912.048.7911.546.049.758.299.3310.3410.6712.121961-06-01619611
1961-07-017.216.837.714.428.464.796.716.005.797.966.968.711961-07-01719611
1961-08-019.595.095.544.638.295.254.215.255.375.418.389.081961-08-01819611
1961-09-015.581.134.963.044.252.254.632.713.676.004.795.411961-09-01919611
1961-10-0114.2512.877.878.0013.007.755.839.007.085.2911.794.041961-10-011019611
1961-11-0113.2113.1314.338.5412.1710.2113.0812.1710.9213.5420.1720.041961-11-011119611
1961-12-019.677.758.003.966.002.757.252.505.585.587.7911.171961-12-011219611
1962-01-019.293.4211.543.502.211.9610.412.793.545.174.387.921962-01-01119621
1962-02-0119.1213.9612.2110.5815.7110.6315.7111.0813.1712.6217.6722.711962-02-01219621
1962-03-018.214.839.004.836.002.217.961.874.083.924.085.411962-03-01319621
1962-04-0114.3312.2511.8710.3714.9211.0019.7911.6714.0915.4616.6223.581962-04-01419621
1962-05-019.629.543.583.338.753.752.252.581.672.377.293.251962-05-01519621
1962-06-015.886.298.675.215.004.255.915.414.799.255.2510.711962-06-01619621
1962-07-018.674.176.926.718.175.6611.179.388.7511.1210.2517.081962-07-01719621
1962-08-014.585.376.042.297.873.714.462.584.004.797.217.461962-08-01819621
1962-09-0110.0012.0810.969.259.297.627.418.757.679.6214.5811.921962-09-01919621
1962-10-0114.587.8319.2110.0811.548.3813.2910.638.2112.9218.0518.121962-10-011019621
1962-11-0116.8813.2516.008.9613.4611.4610.4610.1710.3713.2114.8315.161962-11-011119621
1962-12-0118.3815.4111.756.7912.218.048.4210.835.669.0811.5011.501962-12-011219621
1963-01-0115.5913.6219.798.3812.2510.0023.4515.7113.5914.3717.5834.131963-01-01119631
1963-02-0115.417.6224.6711.429.218.1714.047.547.5410.0810.1717.671963-02-01219631
1963-03-0116.7519.6717.678.8719.0815.3716.2114.2911.299.2119.9219.791963-03-01319631
1963-04-0110.549.5912.467.339.469.5911.7911.879.7910.7113.3718.211963-04-01419631
1963-05-0118.7914.1713.5911.6314.1711.9614.4612.4612.8713.9615.2921.621963-05-01519631
1963-06-0113.376.8712.008.5010.049.4210.9212.9611.7911.0410.9213.671963-06-01619631
...................................................
1976-07-018.501.756.582.132.752.215.372.045.884.504.9610.631976-07-01719761
1976-08-0113.008.388.635.8312.928.2513.009.4210.5811.3414.2120.251976-08-01819761
1976-09-0111.8711.007.386.877.758.3310.346.4610.179.2912.7519.551976-09-01919761
1976-10-0110.966.7110.414.637.585.045.045.546.503.926.795.001976-10-011019761
1976-11-0113.9615.6710.296.4612.799.0810.009.6710.2111.6323.0921.961976-11-011119761
1976-12-0113.4616.429.214.5410.758.6710.884.838.795.918.8313.671976-12-011219761
1977-01-0120.0411.9220.259.139.298.0410.755.889.009.0014.8825.701977-01-01119771
1977-02-0111.839.7111.004.258.588.716.175.668.297.5811.7116.501977-02-01219771
1977-03-018.6314.8310.293.756.638.795.008.127.876.4213.5413.671977-03-01319771
1977-04-0121.6716.0017.3313.5920.8315.9625.6217.6219.4120.6724.3730.091977-04-01419771
1977-05-016.427.128.673.584.584.006.756.133.334.5019.2112.381977-05-01519771
1977-06-017.085.259.712.832.213.505.291.422.000.925.215.631977-06-01619771
1977-07-0115.4116.2917.086.2511.8311.8312.2910.5810.417.2117.377.831977-07-01719771
1977-08-014.332.964.422.330.961.084.961.872.332.0410.509.831977-08-01819771
1977-09-0117.3716.3316.838.5814.4611.8315.0913.9213.2913.8823.2925.171977-09-01919771
1977-10-0116.7515.3412.259.4216.3811.3818.5013.9214.0914.4622.3429.671977-10-011019771
1977-11-0116.7111.5412.174.178.547.1711.126.468.256.2111.0415.631977-11-011119771
1977-12-0113.3710.9212.422.375.796.138.967.386.295.718.5412.421977-12-011219771
1978-01-018.337.127.713.548.507.5014.7110.0011.8310.0015.0920.461978-01-01119781
1978-02-0127.2524.2118.1617.4627.5418.0520.9625.0420.0417.5027.7121.121978-02-01219781
1978-03-0115.046.2116.047.876.426.6712.298.0010.589.335.4117.001978-03-01319781
1978-04-013.427.582.711.383.462.082.674.754.831.677.3313.671978-04-01419781
1978-05-0110.5412.219.085.2911.0010.0811.1713.7511.8711.7912.8727.161978-05-01519781
1978-06-0110.3711.426.466.0411.257.506.465.967.795.465.5010.411978-06-01619781
1978-07-0112.4610.6311.176.7512.929.0412.429.6212.088.0414.0416.171978-07-01719781
1978-08-0119.3315.0920.178.8312.6210.419.3312.339.509.9215.7518.001978-08-01819781
1978-09-018.426.139.875.253.215.717.253.507.336.507.6215.961978-09-01919781
1978-10-019.506.8310.503.886.134.584.216.506.386.5410.6314.091978-10-011019781
1978-11-0113.5916.7511.257.0811.048.338.1711.2910.7511.2523.1325.001978-11-011119781
1978-12-0121.2916.2924.0412.7918.2119.2921.5417.2116.7117.8317.7525.701978-12-011219781

216 rows × 16 columns

练习7-可视化

探索泰坦尼克灾难数据

步骤1 导入必要的库

In [304]:

# 运行以下代码
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

步骤2 从以下地址导入数据

In [36]:

# 运行以下代码
path7 = '../input/pandas_exercise/pandas_exercise/exercise_data/train.csv'  # train.csv

步骤3 将数据框命名为titanic

In [306]:

# 运行以下代码
titanic = pd.read_csv(path7)
titanic.head()

Out[306]:

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

步骤4 将PassengerId设置为索引

In [307]:

# 运行以下代码
titanic.set_index('PassengerId').head()

Out[307]:

SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS

步骤5 绘制一个展示男女乘客比例的扇形图

In [308]:

# 运行以下代码
# sum the instances of males and females
males = (titanic['Sex'] == 'male').sum()
females = (titanic['Sex'] == 'female').sum()

# put them into a list called proportions
proportions = [males, females]

# Create a pie chart
plt.pie(
    # using proportions
    proportions,
    
    # with the labels being officer names
    labels = ['Males', 'Females'],
    
    # with no shadows
    shadow = False,
    
    # with colors
    colors = ['blue','red'],
    
    # with one slide exploded out
    explode = (0.15 , 0),
    
    # with the start angle at 90%
    startangle = 90,
    
    # with the percent listed as a fraction
    autopct = '%1.1f%%'
    )

# View the plot drop above
plt.axis('equal')

# Set labels
plt.title("Sex Proportion")

# View the plot
plt.tight_layout()
plt.show()

步骤6 绘制一个展示船票Fare, 与乘客年龄和性别的散点图

In [309]:

# 运行以下代码
# creates the plot using
lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)

# set title
lm.set(title = 'Fare x Age')

# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)

Out[309]:

(-5, 85)

步骤7 有多少人生还?

In [310]:

# 运行以下代码
titanic.Survived.sum()

Out[310]:

342

步骤8 绘制一个展示船票价格的直方图

In [311]:

# 运行以下代码
# sort the values from the top to the least value and slice the first 5 items
df = titanic.Fare.sort_values(ascending = False)
df

# create bins interval using numpy
binsVal = np.arange(0,600,10)
binsVal

# create the plot
plt.hist(df, bins = binsVal)

# Set the title and labels
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')

# show the plot
plt.show()

行业资料:添加即可领取PPT模板、简历模板、行业经典书籍PDF。
面试题库:历年经典,热乎的大厂面试真题,持续更新中,添加获取。
学习资料:含Python、爬虫、数据分析、算法等学习视频和文档,添加获取
交流加群:大佬指点迷津,你的问题往往有人遇到过,技术互助交流。


点击全文阅读


本文链接:http://zhangshiyu.com/post/30199.html

步骤  运行  代码  
<< 上一篇 下一篇 >>

  • 评论(0)
  • 赞助本站

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。

关于我们 | 我要投稿 | 免责申明

Copyright © 2020-2022 ZhangShiYu.com Rights Reserved.豫ICP备2022013469号-1