新闻中心

商场购物数据分析(购物中心经营指标)

2023-09-30
浏览次数:
返回列表

项目背景介绍

无论您是从精心策划的购物清单中选择购物,还是让奇思妙想随心选择物品,我们每一个人都有独特的购物爱好。

下面是超过3百万的订单信息,使用这些数据来分析用户对产品的购买喜好,并建立预测用户再次购买产品的模型,并在用户购物时向用户推荐商品。

通过分析这些数据以及训练有效的预测算法,发现消费者购买杂货的意愿并更新自身的产品。

数据分析准备

表关系图

说明:

aisles: 产品摆放的分类

order_products__prior: 订单--产品关联表

orders.csv: 用户下单记录表。

products.csv: 产品ID分类,及其摆放位置的关系表

departments.csv: 产品分类表

预测目标

通过用户历史购买的记录信息,制定推荐策略。

导库

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import time %matplotlib inline plt.rcParams[font.sans-serif] = [FangSong] # 指定默认字体 plt.rcParams[axes.unicode_minus] = False # 解决保存图像是负号-显示为方块的问题

加载数据

# 订单表 data_orders=pd.read_csv(data/orders.csv) # 订单——产品表 data_order_products=pd.read_csv(data/order_products_prior.csv) # sample表 data_samples = pd.read_csv(data/sample_submission.csv) # 产品表 data_products = pd.read_csv(./data/products.csv) # 类别表 data_aisles = pd.read_csv(./data/aisles.csv) # 部门表 data_departments = pd.read_csv(./data/departments.csv)

单表初步分析

data_orders

查看表结构

#查看表单字段 data_orders.head(5)

order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order 0 2539329 1 prior 1 2 8 NaN 1 2398795 1 prior 2 3 7 15.0 2 473747 1 prior 3 3 12 21.0 3 2254736 1 prior 4 4 7 29.0 4 431534 1 prior 5 4 15 28.0

字段含义:

order_id 订单id

user_id 用户id

eval_set ------

order_number 订单量

order_dow 一周中星期几

order_hour_of_day 一天中小时

days_since_prior_order 多次购买间隔天数

#查看数据维度 data_orders.shape (3421083, 7) #查看数据缺失字段 print(data_orders.isnull().any()) #计算缺失比例 data_orders[data_orders[days_since_prior_order].isnull()].shape[0]/data_orders.shape[0]order_id False user_id False eval_set False order_number False order_dow False order_hour_of_day False days_since_prior_order True dtype: bool 0.06027594185817766

单表初步分析

#正太分布的检验函数 import numpy as np def norm(x): mu=np.mean(x) var=np.sum((x-mu)**2) sigma=var**(1/2) s=-(x-mu)**2/2*var up=np.exp(s) down=(2*np.pi)**(1/2)*sigma distribution=up/down return distribution

购买欲分析

#那些客户的购买欲望最强 buy_times=data_orders.groupby([user_id]).size().sort_values(ascending=False).sort_values() print(buy_times.max()) print(buy_times.min()) 100 4 #提取出每个人的购买次数,对购买次数进行统计 buy_times.values array([ 4, 4, 4, ..., 100, 100, 100], dtype=int64) def count_times(num): dic={} for i in num: if i in dic: dic[i]+=1 else: dic[i]=1 return dic dic=count_times(buy_times.values) print(dic) keys=[] values=[] for key, value in dic.items(): keys.append(key) values.append(value) plt.plot(keys,values) plt.show(){4: 23986, 5: 19590, 6: 16165, 7: 13850, 8: 11700, 9: 10190, 10: 9032, 11: 7815, 12: 6952, 13: 6236, 14: 5613, 15: 5191, 16: 4593, 17: 4139, 18: 3836, 19: 3390, 20: 3200, 21: 2921, 22: 2706, 23: 2605, 24: 2288, 25: 2186, 26: 2060, 27: 1908, 28: 1779, 29: 1601, 30: 1494, 31: 1474, 32: 1386, 33: 1273, 34: 1186, 35: 1157, 36: 1015, 37: 1048, 38: 960, 39: 904, 40: 878, 41: 885, 42: 814, 43: 743, 44: 724, 45: 707, 46: 657, 47: 675, 48: 654, 49: 579, 50: 554, 51: 582, 52: 527, 53: 524, 54: 499, 55: 425, 56: 390, 57: 396, 58: 343, 59: 332, 60: 303, 61: 324, 62: 276, 63: 267, 64: 254, 65: 239, 66: 238, 67: 204, 68: 189, 69: 185, 70: 178, 71: 154, 72: 166, 73: 138, 74: 162, 75: 162, 76: 134, 77: 120, 78: 127, 79: 127, 80: 116, 81: 108, 82: 119, 83: 86, 84: 99, 85: 100, 86: 93, 87: 78, 88: 79, 89: 77, 90: 85, 91: 56, 92: 69, 93: 67, 94: 57, 95: 64, 96: 67, 97: 54, 98: 50, 99: 47, 100: 1374}#找出下单次数80以上的id,认为购买欲强 print(buy_times[buy_times.values>=80].index)Int64Index([ 34117, 21167, 20320, 1378, 165479, 34227, 38279, 183151, 195371, 99272, ... 165005, 93608, 7279, 159550, 130153, 119028, 162140, 33218, 159591, 26086], dtype=int64, name=user_id, length=2945)

购买时间分析

#一周中按天分析 purchase_day = data_orders.groupby([order_dow]).size().sort_values(ascending=False) purchase_day.plot(kind=bar) plt.show() #一天中按小时分析 purchase_hour=data_orders.groupby([order_hour_of_day]).size().sort_values(ascending=False) purchase_hour.plot(kind=bar) plt.show() #一周的那个时间购买欲最旺盛 purchase_time=data_orders.groupby([order_dow,order_hour_of_day]).size().sort_values(ascending=False) #取前20看情况 purchase_time[:20].plot(kind=bar) plt.show() #第二次购买的间隔天数 print(data_orders[days_since_prior_order].dropna().max()) print(data_orders[days_since_prior_order].dropna().min()) 30.0 0.0 #看一下分布 days_prior = data_orders[days_since_prior_order].dropna().values dic_prior=count_times(days_prior) dic_final = sorted(dic_prior.items()) print(dic_final) keys=[] values=[] for i in dic_final: keys.append(i[0]) values.append(i[1]) plt.bar(keys,values) plt.show()[(0.0, 67755), (1.0, 145247), (2.0, 193206), (3.0, 217005), (4.0, 221696), (5.0, 214503), (6.0, 240013), (7.0, 320608), (8.0, 181717), (9.0, 118188), (10.0, 95186), (11.0, 80970), (12.0, 76146), (13.0, 83214), (14.0, 100230), (15.0, 66579), (16.0, 46941), (17.0, 39245), (18.0, 35881), (19.0, 34384), (20.0, 38527), (21.0, 45470), (22.0, 32012), (23.0, 23885), (24.0, 20712), (25.0, 19234), (26.0, 19016), (27.0, 22013), (28.0, 26777), (29.0, 19191), (30.0, 369323)]

data_order_products

查看表结构

data_order_products[:10]

order_id product_id add_to_cart_order reordered 0 2 33120 1 1 1 2 28985 2 1 2 2 9327 3 0 3 2 45918 4 1 4 2 30035 5 0 5 2 17794 6 1 6 2 40141 7 1 7 2 1819 8 1 8 2 43668 9 0 9 3 33754 1 1

字段含义:

order_id 订单号

product_id 产品号

add_to_cart_order 加入到购物车的顺序

reorderd 是否再次购买该产品

data_order_products.shape (32434489, 4) #总共有多少种产品 len(data_order_products.groupby([product_id]).size().index)49677

订单中的产品数量

df = data_order_products.groupby([order_id])[add_to_cart_order].aggregate(max).reset_index() df_final = df[add_to_cart_order].value_counts() #画图 plt.figure(figsize=(12,8)) sns.barplot(df_final.index, df_final.values, alpha=0.8) plt.ylabel(订单数, fontsize=12) plt.xlabel(订单中的产品数量, fontsize=12) plt.xticks(rotation=vertical) plt.show()

最大值为5的右尾分布

产品畅销程度

data_order_products[product_id].value_counts()[:10].plot(kind=bar) plt.xlabel(产品编号) plt.ylabel(购买数量) plt.show()

data_samples

data_samples.head(5)

order_id products 0 17 39276 29259 1 34 39276 29259 2 137 39276 29259 3 182 39276 29259 4 257 39276 29259

data_samples.info()<class pandas.core.frame.DataFrame> RangeIndex: 75000 entries, 0 to 74999 Data columns (total 2 columns): order_id 75000 non-null int64 products 75000 non-null object dtypes: int64(1), object(1) memory usage: 1.1+ MB

data_products

data_products.head(5)

product_id product_name aisle_id department_id 0 1 Chocolate Sandwich Cookies 61 19 1 2 All-Seasons Salt 104 13 2 3 Robust Golden Unsweetened Oolong Tea 94 7 3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 4 5 Green Chile Anytime Sauce 5 13

data_products.info()<class pandas.core.frame.DataFrame> RangeIndex: 49688 entries, 0 to 49687 Data columns (total 4 columns): product_id 49688 non-null int64 product_name 49688 non-null object aisle_id 49688 non-null int64 department_id 49688 non-null int64 dtypes: int64(3), object(1) memory usage: 1.5+ MB

data_asiles

data_aisles.head(5)

aisle_id aisle 0 1 prepared soups salads 1 2 specialty cheeses 2 3 energy granola bars 3 4 instant foods 4 5 marinades meat preparation

data_aisles.info()<class pandas.core.frame.DataFrame> RangeIndex: 134 entries, 0 to 133 Data columns (total 2 columns): aisle_id 134 non-null int64 aisle 134 non-null object dtypes: int64(1), object(1) memory usage: 2.2+ KB

data_departments

data_departments[department].valuesarray([frozen, other, bakery, produce, alcohol, international, beverages, pets, dry goods pasta, bulk, personal care, meat seafood, pantry, breakfast, canned goods, dairy eggs, household, babies, snacks, deli, missing], dtype=object)data_departments.info()<class pandas.core.frame.DataFrame> RangeIndex: 21 entries, 0 to 20 Data columns (total 2 columns): department_id 21 non-null int64 department 21 non-null object dtypes: int64(1), object(1) memory usage: 416.0+ bytes

多表联合分析

不同类别的产品订单量

# 订单——产品表 data_order_products_part = data_order_products[[order_id,product_id]] # 产品表 data_products_aisle = data_products[[product_id, aisle_id]] # 合并表 data_first = pd.merge(data_order_products_part, data_products_aisle, on=product_id, how=left) data_final = pd.merge(data_first, data_aisles, on=aisle_id, how=left) #订单量前25的产品类别 data_front25 = data_final[aisle].value_counts()[:25] #作图 plt.figure(figsize=(12,8)) sns.barplot(data_front25.index, data_front25.values, alpha=0.8) plt.ylabel(Number of Occurrences, fontsize=12) plt.xlabel(Aisle, fontsize=12) plt.xticks(rotation=vertical) plt.show()

不同部门的产品订单量

# 产品表 data_products_department = data_products[[product_id, department_id]] #合并表 data_first = pd.merge(data_order_products_part,data_products_department, on=product_id, how=left) data_final = pd.merge(data_first, data_departments, on= department_id, how=left) #按部门划分 department_group = data_final[department].value_counts()[:15] #份额展示 plt.figure(figsize=(10,10)) labels = (np.array(department_group.index)) sizes = (np.array((department_group / department_group.sum())*100)) plt.pie(sizes, labels=labels, autopct=%1.1f%%, startangle=200) plt.title("Departments distribution", fontsize=15) plt.show()

再次下单相关分析

再次下单产品和比例

#找出再次下单的产品 indexes = data_order_products[data_order_products[reordered]==1][product_id].value_counts().index indexesInt64Index([24852, 13176, 21137, 21903, 47209, 47766, 27845, 47626, 27966, 16797, ... 23467, 7068, 35129, 25118, 12153, 7061, 28538, 23454, 8734, 24481], dtype=int64, length=45305)indexes.shape (45305,) #计算比例 indexes.shape[0]/data_products.shape[0]0.9117895668974401

再次下单的产品种类

# 找出再次下单的产品 data_part = data_order_products[data_order_products[reordered]==1][[order_id,product_id, reordered]] # 合并表 data_first = pd.merge(data_part, data_products_aisle, on=product_id,how=left) data_final = pd.merge(data_first, data_aisles, on=aisle_id, how=left) top20 = data_final[aisle].value_counts()[:20] top20 #作图 plt.figure(figsize=(12,8)) top20.plot(kind=bar) plt.xticks(rotation=vertical) plt.show()

产品再次下单的对应部门

#合并表 data_first = pd.merge(data_part, data_products_department, on=product_id,how=left) data_final = pd.merge(data_first, data_departments, on= department_id, how=left) #按部门划分计数 department_again = data_final[department].value_counts() #作图 plt.figure(figsize=(12,8)) department_again.plot(kind=bar) plt.show()

添加顺序与再次购买

plt.figure(figsize=(12,8)) data_order_products[data_order_products[reordered]==1][add_to_cart_order].value_counts().plot() plt.show()

添加到购物车的顺序基本与会不会再次购买成正相关

基于时间,产品再次购买的分析

# 再次购买信息 data_reordered = data_order_products[data_order_products[reordered]==1] #时间信息 data_time = data_orders[[order_id,order_dow,order_hour_of_day,days_since_prior_order]].dropna() #合并表 merge_time = pd.merge(data_reordered,data_time ,on =order_id,how=left).dropna() #按星期作图 plt.figure(figsize=(10,8)) merge_time[order_dow].value_counts().plot(kind=bar) plt.xlabel(星期) plt.ylabel(再次购买产品量) plt.show() #小时影响 plt.figure(figsize=(10,8)) merge_time[order_hour_of_day].value_counts().plot(kind=bar) plt.xlabel(小时) plt.ylabel(再次购买产品量) plt.show() #间隔天的分析 plt.figure(figsize=(10,8)) merge_time[days_since_prior_order].value_counts().plot(kind=bar) plt.xlabel(间隔天数) plt.ylabel(再次购买产品量) plt.show() #星期和小时 grouped_df = merge_time.groupby(["order_dow", "order_hour_of_day"])["reordered"].aggregate("sum").reset_index() grouped_df = grouped_df.pivot(order_dow, order_hour_of_day, reordered) #作图 plt.figure(figsize=(12,6)) sns.heatmap(grouped_df,cmap=YlGnBu) plt.title("Reorder ratio of Day of week Vs Hour of day") plt.show() #注意aggregate的用法

多因素分析

产品量——摆放类别——部门

# 合并表 data_first = pd.merge(data_products, data_aisles, on= aisle_id, how=left) data_final = pd.merge(data_first, data_departments,on=department_id,how=left) # 找出所有department的值 department_index = data_final[department].value_counts().index # 作图(只画一张) for i in department_index[:1]: data_final.groupby([department,aisle]).size()[i].plot(bar) time.sleep(3) plt.show()

订单——摆放类别——部门

# 合并表 data_first = pd.merge(data_products[[product_id,aisle_id,department_id]], data_order_products[[order_id,product_id]], on=product_id,how=left) data_middle = pd.merge(data_first, data_aisles,on=aisle_id,how=left) data_final = pd.merge(data_middle,data_departments,on=department_id,how=left) data_final = data_final[[order_id,product_id,aisle,department]] grouped = data_final.groupby([department,aisle])[order_id].aggregate({Total_orders: count}).reset_index() grouped.sort_values(by=Total_orders, ascending=False, inplace=True)grouped.groupby([department]) fig, axes = plt.subplots(7,3, figsize=(20,45), gridspec_kw = dict(hspace=1.4)) for (aisle, group), ax in zip(grouped.groupby(["department"]), axes.flatten()): g = sns.barplot(group.aisle, group.Total_orders , ax=ax) ax.set(xlabel = "Aisles", ylabel=" Number of Orders") g.set_xticklabels(labels = group.aisle,rotation=90, fontsize=12) ax.set_title(aisle, fontsize=15)

搜索