专栏名称: 爱数据LoveData

中国统计网（www.itongji.cn），国内最大的数据分析门户网站。提供数据分析行业资讯，统计百科知识、数据分析、商业智能(BI)、数据挖掘技术，Excel、SPSS、SAS、R等数据分析软件等在线学习平台。

玩转数据处理120题｜Pandas版本

爱数据LoveData · 公众号 · BI · 2021-03-08 16:30

正文

请到「今天看啥」查看全文

salary
不限 19600 .000000
大专 10000 .000000
本科 19361 .344538
硕士 20642 .857143

Python解法

df.groupby('education').mean()

时间转换

题目：将 createTime 列时间转换为月-日

难度： ⭐⭐⭐

期望输出

Python解法

for index,row in df.iterrows():
   df.iloc[index,0] = df.iloc[index,0].to_pydatetime().strftime("%m-%d")

数据查看

题目：查看索引、数据类型和内存信息

难度： ⭐

期望输出

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 4 columns):
createTime 135 non-null object
education 135 non-null object
salary 135




    
 non-null int64
categories 135 non-null category
dtypes: category(1), int64(1), object(2)
memory usage: 3.5+ KB

Python解法

df.info()

数据查看

题目：查看数值型列的汇总统计

难度： ⭐

Python解法

df.describe()

R解法

summary(df)

数据整理

题目：新增一列根据salary将数据分为三组

难度： ⭐⭐⭐⭐

输入

期望输出

Python解法

bins = [0,5000, 20000, 50000]
group_names = ['低', '中', '高']
df['categories'] = pd.cut(df['salary'], bins, labels=group_names)

数据整理

题目：按照 salary 列对数据降序排列

难度： ⭐⭐

Python解法

df.sort_values('salary', ascending=False)

数据提取

题目：取出第33行数据

难度： ⭐⭐

Python解法

df.iloc[32]

数据计算

题目：计算 salary 列的中位数

难度： ⭐⭐

Python解法

np.median(df['salary'])
# 17500.0

数据可视化

题目：绘制薪资水平频率分布直方图

难度： ⭐⭐⭐

期望输出

Python解法

# Jupyter运行matplotlib成像需要运行魔术命令
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei'] # 解决中文乱码
plt.rcParams['axes.unicode_minus'] = False # 解决符号问题
import matplotlib.pyplot as plt
plt.hist(df.salary)

# 也可以用原生pandas方法绘图
df.salary.plot(kind='hist')

数据可视化

题目：绘制薪资水平密度曲线

难度： ⭐⭐⭐

期望输出

Python解法

df.salary.plot(kind='kde',xlim = (0,70000))

数据删除

题目：删除最后一列 categories

难度： ⭐

Python解法

del df['categories']
# 等价于
df.drop(columns=['categories'




    
], inplace=True)

数据处理

题目：将df的第一列与第二列合并为新的一列

难度： ⭐⭐

Python解法

df['test'] = df['education'] + df['createTime']

数据处理

题目：将education列与salary列合并为新的一列

难度： ⭐⭐⭐

备注： salary为int类型，操作与35题有所不同

Python解法

df["test1"] = df["salary"].map(str) + df['education']

数据计算

题目：计算salary最大值与最小值之差

难度： ⭐⭐⭐

Python解法

df[['salary']].apply(lambda




    
 x: x.max() - x.min())
# salary 41500
# dtype: int64

数据处理

题目：将第一行与最后一行拼接

难度： ⭐⭐

Python解法

pd.concat([df[1:2], df[-1:]])

数据处理

题目：将第8行数据添加至末尾

难度： ⭐⭐

Python解法

df.append(df.iloc[7])

数据查看

题目：查看每列的数据类型

难度： ⭐

期望结果

createTime object
education object




    

salary int64
test object
test1 object
dtype: object

Python解法

df.dtypes
# createTime object
# education object
# salary int64
# test object
# test1 object
# dtype: object

数据处理

题目：将 createTime 列设置为索引

难度： ⭐⭐

Python解法

df.set_index("createTime")

数据创建

题目：生成一个和df长度相同的随机数dataframe

难度： ⭐⭐

Python解法

df1 = pd.DataFrame(pd.Series(np.random.randint(1, 10, 135)))

数据处理

题目：将上一题生成的dataframe与df合并

难度： ⭐⭐

Python解法

df= pd.concat([df,df1],axis=1)

数据计算

题目：生成新的一列 new 为 salary 列减去之前生成随机数列

难度： ⭐⭐

Python解法

df["new"] = df["salary"] - df[0]

缺失值处理

题目：检查数据中是否含有任何缺失值

难度： ⭐⭐⭐

Python解法

df.isnull().values.any()
# False

数据转换

题目：将 salary 列类型转换为浮点数

难度： ⭐⭐⭐

Python解法

df['salary'].astype(np.float64)

数据计算

题目：计算 salary 大于10000的次数

难度： ⭐⭐

Python解法

len(df[df['salary'] > 10000])
# 119

数据统计

题目：查看每种学历出现的次数

难度： ⭐⭐⭐

期望输出

本科 119
硕士 7
不限 5