新闻中心
商场购物数据分析(购物中心经营指标)
项目背景介绍
无论您是从精心策划的购物清单中选择购物,还是让奇思妙想随心选择物品,我们每一个人都有独特的购物爱好。
下面是超过3百万的订单信息,使用这些数据来分析用户对产品的购买喜好,并建立预测用户再次购买产品的模型,并在用户购物时向用户推荐商品。
通过分析这些数据以及训练有效的预测算法,发现消费者购买杂货的意愿并更新自身的产品。
数据分析准备
表关系图

说明:
aisles: 产品摆放的分类
order_products__prior: 订单--产品关联表
orders.csv: 用户下单记录表。
products.csv: 产品ID分类,及其摆放位置的关系表
departments.csv: 产品分类表
预测目标
通过用户历史购买的记录信息,制定推荐策略。
导库
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import time %matplotlib inline plt.rcParams[font.sans-serif] = [FangSong] # 指定默认字体 plt.rcParams[axes.unicode_minus] = False # 解决保存图像是负号-显示为方块的问题加载数据
# 订单表 data_orders=pd.read_csv(data/orders.csv) # 订单——产品表 data_order_products=pd.read_csv(data/order_products_prior.csv) # sample表 data_samples = pd.read_csv(data/sample_submission.csv) # 产品表 data_products = pd.read_csv(./data/products.csv) # 类别表 data_aisles = pd.read_csv(./data/aisles.csv) # 部门表 data_departments = pd.read_csv(./data/departments.csv)单表初步分析
data_orders
查看表结构
#查看表单字段 data_orders.head(5)order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order 0 2539329 1 prior 1 2 8 NaN 1 2398795 1 prior 2 3 7 15.0 2 473747 1 prior 3 3 12 21.0 3 2254736 1 prior 4 4 7 29.0 4 431534 1 prior 5 4 15 28.0
字段含义:
order_id 订单id
user_id 用户id
eval_set ------
order_number 订单量
order_dow 一周中星期几
order_hour_of_day 一天中小时
days_since_prior_order 多次购买间隔天数
#查看数据维度 data_orders.shape (3421083, 7) #查看数据缺失字段 print(data_orders.isnull().any()) #计算缺失比例 data_orders[data_orders[days_since_prior_order].isnull()].shape[0]/data_orders.shape[0]order_id False user_id False eval_set False order_number False order_dow False order_hour_of_day False days_since_prior_order True dtype: bool 0.06027594185817766单表初步分析
#正太分布的检验函数 import numpy as np def norm(x): mu=np.mean(x) var=np.sum((x-mu)**2) sigma=var**(1/2) s=-(x-mu)**2/2*var up=np.exp(s) down=(2*np.pi)**(1/2)*sigma distribution=up/down return distribution购买欲分析
#那些客户的购买欲望最强 buy_times=data_orders.groupby([user_id]).size().sort_values(ascending=False).sort_values() print(buy_times.max()) print(buy_times.min()) 100 4 #提取出每个人的购买次数,对购买次数进行统计 buy_times.values array([ 4, 4, 4, ..., 100, 100, 100], dtype=int64) def count_times(num): dic={} for i in num: if i in dic: dic[i]+=1 else: dic[i]=1 return dic dic=count_times(buy_times.values) print(dic) keys=[] values=[] for key, value in dic.items(): keys.append(key) values.append(value) plt.plot(keys,values) plt.show(){4: 23986, 5: 19590, 6: 16165, 7: 13850, 8: 11700, 9: 10190, 10: 9032, 11: 7815, 12: 6952, 13: 6236, 14: 5613, 15: 5191, 16: 4593, 17: 4139, 18: 3836, 19: 3390, 20: 3200, 21: 2921, 22: 2706, 23: 2605, 24: 2288, 25: 2186, 26: 2060, 27: 1908, 28: 1779, 29: 1601, 30: 1494, 31: 1474, 32: 1386, 33: 1273, 34: 1186, 35: 1157, 36: 1015, 37: 1048, 38: 960, 39: 904, 40: 878, 41: 885, 42: 814, 43: 743, 44: 724, 45: 707, 46: 657, 47: 675, 48: 654, 49: 579, 50: 554, 51: 582, 52: 527, 53: 524, 54: 499, 55: 425, 56: 390, 57: 396, 58: 343, 59: 332, 60: 303, 61: 324, 62: 276, 63: 267, 64: 254, 65: 239, 66: 238, 67: 204, 68: 189, 69: 185, 70: 178, 71: 154, 72: 166, 73: 138, 74: 162, 75: 162, 76: 134, 77: 120, 78: 127, 79: 127, 80: 116, 81: 108, 82: 119, 83: 86, 84: 99, 85: 100, 86: 93, 87: 78, 88: 79, 89: 77, 90: 85, 91: 56, 92: 69, 93: 67, 94: 57, 95: 64, 96: 67, 97: 54, 98: 50, 99: 47, 100: 1374}
购买时间分析
#一周中按天分析 purchase_day = data_orders.groupby([order_dow]).size().sort_values(ascending=False) purchase_day.plot(kind=bar) plt.show()


data_order_products
查看表结构
data_order_products[:10]order_id product_id add_to_cart_order reordered 0 2 33120 1 1 1 2 28985 2 1 2 2 9327 3 0 3 2 45918 4 1 4 2 30035 5 0 5 2 17794 6 1 6 2 40141 7 1 7 2 1819 8 1 8 2 43668 9 0 9 3 33754 1 1
字段含义:
order_id 订单号
product_id 产品号
add_to_cart_order 加入到购物车的顺序
reorderd 是否再次购买该产品
data_order_products.shape (32434489, 4) #总共有多少种产品 len(data_order_products.groupby([product_id]).size().index)49677订单中的产品数量
df = data_order_products.groupby([order_id])[add_to_cart_order].aggregate(max).reset_index() df_final = df[add_to_cart_order].value_counts() #画图 plt.figure(figsize=(12,8)) sns.barplot(df_final.index, df_final.values, alpha=0.8) plt.ylabel(订单数, fontsize=12) plt.xlabel(订单中的产品数量, fontsize=12) plt.xticks(rotation=vertical) plt.show()最大值为5的右尾分布
产品畅销程度
data_order_products[product_id].value_counts()[:10].plot(kind=bar) plt.xlabel(产品编号) plt.ylabel(购买数量) plt.show()data_samples
data_samples.head(5)order_id products 0 17 39276 29259 1 34 39276 29259 2 137 39276 29259 3 182 39276 29259 4 257 39276 29259
data_samples.info()<class pandas.core.frame.DataFrame> RangeIndex: 75000 entries, 0 to 74999 Data columns (total 2 columns): order_id 75000 non-null int64 products 75000 non-null object dtypes: int64(1), object(1) memory usage: 1.1+ MBdata_products
data_products.head(5)product_id product_name aisle_id department_id 0 1 Chocolate Sandwich Cookies 61 19 1 2 All-Seasons Salt 104 13 2 3 Robust Golden Unsweetened Oolong Tea 94 7 3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 4 5 Green Chile Anytime Sauce 5 13
data_products.info()<class pandas.core.frame.DataFrame> RangeIndex: 49688 entries, 0 to 49687 Data columns (total 4 columns): product_id 49688 non-null int64 product_name 49688 non-null object aisle_id 49688 non-null int64 department_id 49688 non-null int64 dtypes: int64(3), object(1) memory usage: 1.5+ MBdata_asiles
data_aisles.head(5)aisle_id aisle 0 1 prepared soups salads 1 2 specialty cheeses 2 3 energy granola bars 3 4 instant foods 4 5 marinades meat preparation
data_aisles.info()<class pandas.core.frame.DataFrame> RangeIndex: 134 entries, 0 to 133 Data columns (total 2 columns): aisle_id 134 non-null int64 aisle 134 non-null object dtypes: int64(1), object(1) memory usage: 2.2+ KBdata_departments
data_departments[department].valuesarray([frozen, other, bakery, produce, alcohol, international, beverages, pets, dry goods pasta, bulk, personal care, meat seafood, pantry, breakfast, canned goods, dairy eggs, household, babies, snacks, deli, missing], dtype=object)data_departments.info()<class pandas.core.frame.DataFrame> RangeIndex: 21 entries, 0 to 20 Data columns (total 2 columns): department_id 21 non-null int64 department 21 non-null object dtypes: int64(1), object(1) memory usage: 416.0+ bytes多表联合分析
不同类别的产品订单量
# 订单——产品表 data_order_products_part = data_order_products[[order_id,product_id]] # 产品表 data_products_aisle = data_products[[product_id, aisle_id]] # 合并表 data_first = pd.merge(data_order_products_part, data_products_aisle, on=product_id, how=left) data_final = pd.merge(data_first, data_aisles, on=aisle_id, how=left) #订单量前25的产品类别 data_front25 = data_final[aisle].value_counts()[:25] #作图 plt.figure(figsize=(12,8)) sns.barplot(data_front25.index, data_front25.values, alpha=0.8) plt.ylabel(Number of Occurrences, fontsize=12) plt.xlabel(Aisle, fontsize=12) plt.xticks(rotation=vertical) plt.show()不同部门的产品订单量
# 产品表 data_products_department = data_products[[product_id, department_id]] #合并表 data_first = pd.merge(data_order_products_part,data_products_department, on=product_id, how=left) data_final = pd.merge(data_first, data_departments, on= department_id, how=left) #按部门划分 department_group = data_final[department].value_counts()[:15] #份额展示 plt.figure(figsize=(10,10)) labels = (np.array(department_group.index)) sizes = (np.array((department_group / department_group.sum())*100)) plt.pie(sizes, labels=labels, autopct=%1.1f%%, startangle=200) plt.title("Departments distribution", fontsize=15) plt.show()再次下单相关分析
再次下单产品和比例
#找出再次下单的产品 indexes = data_order_products[data_order_products[reordered]==1][product_id].value_counts().index indexesInt64Index([24852, 13176, 21137, 21903, 47209, 47766, 27845, 47626, 27966, 16797, ... 23467, 7068, 35129, 25118, 12153, 7061, 28538, 23454, 8734, 24481], dtype=int64, length=45305)indexes.shape (45305,) #计算比例 indexes.shape[0]/data_products.shape[0]0.9117895668974401再次下单的产品种类
# 找出再次下单的产品 data_part = data_order_products[data_order_products[reordered]==1][[order_id,product_id, reordered]] # 合并表 data_first = pd.merge(data_part, data_products_aisle, on=product_id,how=left) data_final = pd.merge(data_first, data_aisles, on=aisle_id, how=left) top20 = data_final[aisle].value_counts()[:20] top20 #作图 plt.figure(figsize=(12,8)) top20.plot(kind=bar) plt.xticks(rotation=vertical) plt.show()产品再次下单的对应部门
#合并表 data_first = pd.merge(data_part, data_products_department, on=product_id,how=left) data_final = pd.merge(data_first, data_departments, on= department_id, how=left) #按部门划分计数 department_again = data_final[department].value_counts() #作图 plt.figure(figsize=(12,8)) department_again.plot(kind=bar) plt.show()添加顺序与再次购买
plt.figure(figsize=(12,8)) data_order_products[data_order_products[reordered]==1][add_to_cart_order].value_counts().plot() plt.show()添加到购物车的顺序基本与会不会再次购买成正相关