Introduction¶

「お弁当の需要予測」は実際にお昼に販売されたお弁当の売り上げ数を予測する問題で、回帰問題として分類されます。
データサイズはとても小さいですが、生のデータであり、データ加工等の工夫が検討できる良いデータセットです。
まずはデータを読み込み、売り上げ数と説明変数との関係を簡単な基礎分析で見ていきます。

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
sns.set(font="IPAexGothic",style="white")

train = pd.read_csv("./original/train.csv")
test = pd.read_csv("./original/test.csv")
sample = pd.read_csv("./original/sample.csv",header=None)
print("Data Shapes")
print("Train:",train.shape, "Test:",test.shape, "Sample:",sample.shape)

Data Shapes
Train: (207, 12) Test: (40, 11) Sample: (40, 2)

train.index = pd.to_datetime(train["datetime"])
train.head()

train.describe()

train.describe(include="O")

簡単な欠損値補間と月情報の特徴量の作成

train["payday"] = train["payday"].fillna(0)
train["precipitation"] = train["precipitation"].apply(lambda x : -1 if x == "--" else float(x))
train["event"] = train["event"].fillna("なし")
train["remarks"] = train["remarks"].fillna("なし")
train["month"] = train["datetime"].apply(lambda x : int(x.split("-")[1]))

お弁当の売り上げ分布の確認
日数が経過するにつれて売り上げが落ちて行っている
落ちて行っているのにも関わらず、スパイクしている為、売り上げ数に寄与している何かしらのファクターがある？

train["y"].plot(figsize=(15,4))

目的変数と説明変数（数値変数）との散布図の確認
月及び気温は売上数と相関が高そうだが、日数が経過するにつれて減衰している性質がある為、代替変数である可能性が高い

fig, ax = plt.subplots(2,3,figsize=(9,6))
train.plot.scatter(x="soldout", y="y", ax=ax[0][0])
train.plot.scatter(x="kcal", y="y", ax=ax[0][1])
train.plot.scatter(x="precipitation", y="y", ax=ax[0][2])
train.plot.scatter(x="payday", y="y", ax=ax[1][0])
train.plot.scatter(x="temperature", y="y", ax=ax[1][1])
train.plot.scatter(x="month", y="y", ax=ax[1][2])
plt.tight_layout()

目的変数と説明変数（カテゴリー変数）との箱ひげ図の確認
顕著に売上数に反応しているのはremarksの値がお楽しみメニューである時になしの場合と変化あり

fig, ax = plt.subplots(2,2,figsize=(12,7))
sns.boxplot(x="week",y="y",data=train,ax=ax[0][0])
sns.boxplot(x="weather",y="y",data=train,ax=ax[0][1])
sns.boxplot(x="remarks",y="y",data=train,ax=ax[1][0])
ax[1][0].set_xticklabels(ax[1][0].get_xticklabels(),rotation=30)
sns.boxplot(x="event",y="y",data=train,ax=ax[1][1])
plt.tight_layout()

お楽しみメニューを除いた場合に、後半の売り上げ数のスパイクが減少=スパイクの原因はお楽しみメニューであるかもしれない
お楽しみメニューがあるかないかで比較しても、中央値検定によれば有意に差がある

train[train["remarks"]!="お楽しみメニュー"]["y"].plot(figsize=(15,4))

train["fun"] = train["remarks"].apply(lambda x: 1 if x=="お楽しみメニュー" else 0)
sns.boxplot(x="fun",y="y",data=train)

メディアン検定

from scipy.stats import median_test
stat,p,med,tbl = median_test(train[train["fun"]==1]["y"],train[train["fun"]==0]["y"])
print("p",p,"stat",stat)

p 0.007057960766247775 stat 7.2581589841730345

お楽しみメニューの時の売り上げを見ると、この中でも凸凹がみられる

train[train["remarks"]=="お楽しみメニュー"]["y"].plot(figsize=(15,4))

train[train["remarks"]=="お楽しみメニュー"]

どうやらカレーか否かが重要そうである

train["curry"] = train["name"].apply(lambda x : 1 if x.find("カレー")>=0 else 0)
sns.boxplot(x="curry",y="y",data=train)

メディアン検定

stat,p,med,tbl = median_test(train[train["curry"]==1]["y"],train[train["curry"]==0]["y"])
print("p:",p,"stat",stat)

p: 0.010699742900892427 stat 6.514516162828877

Method¶

方針としては日数が経過するにつれて減衰している為、売上数と日数の単回帰モデルを軸に検討する
- 但し、2014-05以前はやや傾向が異なる為、学習データから除く
他、お楽しみメニューやカレー等は大きく寄与はしていそうだが、非線形な関係であることも考慮し、Random Forestを用いて単回帰モデルの結果を修正するモデルも作成し、予測結果を導出することにする

train = pd.read_csv("./original/train.csv")
test = pd.read_csv("./original/test.csv")
sample = pd.read_csv("./original/sample.csv",header=None)

train["t"] = 1
test["t"] = 0
dat = pd.concat([train,test],sort=True).reset_index(drop=True)

dat.index = pd.to_datetime(dat["datetime"])
dat = dat["2014-05-01":]
dat = dat.reset_index(drop=True)

dat["days"] = dat.index
dat["precipitation"] = dat["precipitation"].apply(lambda x : -1 if x=="--" else x).astype(np.float)
dat["fun"] = dat["remarks"].apply(lambda x: 1 if x=="お楽しみメニュー" else 0)
dat["curry"] = dat["name"].apply(lambda x : 1 if x.find("カレー")>=0 else 0)

cols = ["precipitation","weather","days","fun","curry","y"]

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error as MSE
from sklearn.linear_model import LinearRegression as LR
from sklearn.ensemble import RandomForestRegressor as RF

def learning(trainX,y_train):
    model1 = LR()
    model2 = RF(n_estimators=100,max_depth=4,random_state=777)
    model1.fit(trainX["days"].values.reshape(-1,1),y_train)
    pred = model1.predict(trainX["days"].values.reshape(-1,1))
    
    pred_sub = y_train - pred
    model2.fit(trainX.iloc[:, ~trainX.columns.str.match("y")],pred_sub)
    return model1, model2

Evaluation¶

cross validationでの検証からRMSEは大体11前後程度のモデルが作成できていることがわかる（大きく外れているのを除いた場合）
売上数がスパイクする部分についてはそれなりに予測ができているように見える
一方で、小さな凸凹についてはあまり予測ができておらず、説明変数の検討が更に必要
sklearnではRMSEは実装されていない為、MSEにルートをとり、導出する（累乗は**演算子を利用する）
ダミー変数化について
- 天気情報の値がtrainにはあるがtestにはないものが存在する為、別々でダミー変数化をするとカラムの数が合わなくなる
- その為、trainとtestを識別する為の適当な変数を作成してから一度結合し、ダミー変数化した後に戻すという手順を実施している

kf = KFold(n_splits=5,random_state=777)
tr = dat[dat["t"]==1][cols]

trains = []
tests = []
for train_index, test_index in kf.split(tr):
    tr.loc[train_index,"tt"] = 1
    tr.loc[test_index,"tt"] = 0
    tr["tt"] = tr["tt"].astype(np.int)
    tmp = pd.get_dummies(tr)
    
    trainX = tmp[tmp["tt"]==1]
    del trainX["tt"]
    testX = tmp[tmp["tt"]==0]
    del testX["tt"]
    y_train = tmp[tmp["tt"]==1]["y"]
    y_test = tmp[tmp["tt"]==0]["y"]
    
    model1, model2 = learning(trainX, y_train)
    
    pred_train = model1.predict(trainX["days"].values.reshape(-1,1)) + model2.predict(trainX.iloc[:, ~trainX.columns.str.match("y")])
    pred_test = model1.predict(testX["days"].values.reshape(-1,1)) + model2.predict(testX.iloc[:, ~testX.columns.str.match("y")])
    
    print("TRAIN:",MSE(y_train,pred_train)**0.5, "VARIDATE",MSE(y_test, pred_test)**0.5)
    trains.append(MSE(y_train,pred_train)**0.5)
    tests.append(MSE(y_test, pred_test)**0.5)
print("AVG")
print(np.array(trains).mean(), np.array(tests).mean())

TRAIN: 7.567045890582926 VARIDATE 12.108512732864117
TRAIN: 7.898879809389376 VARIDATE 10.596024908707413
TRAIN: 8.266274249088164 VARIDATE 8.843075664182068
TRAIN: 7.954599685874008 VARIDATE 19.93205551309561
TRAIN: 7.913918883780418 VARIDATE 9.443466823946043
AVG
7.920143703742978 12.184627128559049

cols = ["precipitation","weather","days","fun","curry","y","t"]
tmp = pd.get_dummies(dat[cols])
trainX = tmp[tmp["t"]==1]
del trainX["t"]
testX = tmp[tmp["t"]==0]
del testX["t"]
y_train = tmp[tmp["t"]==1]["y"]
y_test = tmp[tmp["t"]==0]["y"]

model1, model2 = learning(trainX,y_train)
pred = model1.predict(trainX["days"].values.reshape(-1,1)) + model2.predict(trainX.iloc[:,~trainX.columns.str.match("y")])

p = pd.DataFrame({"actual":y_train,"pred":pred})
p.plot(figsize=(15,4))
print("RMSE",MSE(y_train,pred)**0.5)

RMSE 7.986229908417227

model1, model2 = learning(trainX,y_train)
pred = model1.predict(testX["days"].values.reshape(-1,1)) + model2.predict(testX.iloc[:,~testX.columns.str.match("y")])
plt.figure(figsize=(15,4))
plt.plot(pred)

[]

sample[1] = pred
sample.to_csv("submit01.csv",index=None,header=None)

Conclution¶

お弁当の需要予測を題材とし、基礎分析の結果から2種類のモデルを利用した。
1つ目は日数が経過することに減少傾向だったため、日数と需要量との単回帰モデルである。
2つ目は上記予測結果を更に細かく補正する為の非線形なモデルとしてRandomForestを採用した。
- スパイク要因に注目したところ、お楽しみメニューやカレーかどうかが重要であり、これらを表す特徴量を加えた結果、スパイクが予測できていた
課題としては、小さな凸凹に対しては予測が不十分と考えられる為、これらに有効な特徴量を検討することである。

	datetime	y	week	soldout	name	kcal	remarks	event	payday	weather	precipitation	temperature
datetime
2013-11-18	2013-11-18	90	月	0	厚切りイカフライ	NaN	NaN	NaN	NaN	快晴	--	19.8
2013-11-19	2013-11-19	101	火	1	手作りヒレカツ	NaN	NaN	NaN	NaN	快晴	--	17.0
2013-11-20	2013-11-20	118	水	0	白身魚唐揚げ野菜あん	NaN	NaN	NaN	NaN	快晴	--	15.5
2013-11-21	2013-11-21	120	木	1	若鶏ピリ辛焼	NaN	NaN	NaN	NaN	快晴	--	15.2
2013-11-22	2013-11-22	130	金	1	ビッグメンチカツ	NaN	NaN	NaN	NaN	快晴	--	16.1

	y	soldout	kcal	payday	temperature
count	207.000000	207.000000	166.000000	10.0	207.000000
mean	86.623188	0.449275	404.409639	1.0	19.252174
std	32.882448	0.498626	29.884641	0.0	8.611365
min	29.000000	0.000000	315.000000	1.0	1.200000
25%	57.000000	0.000000	386.000000	1.0	11.550000
50%	78.000000	0.000000	408.500000	1.0	19.800000
75%	113.000000	1.000000	426.000000	1.0	26.100000
max	171.000000	1.000000	462.000000	1.0	34.600000

	datetime	y	week	soldout	name	kcal	remarks	event	payday	weather	precipitation	temperature	month	fun
datetime
2014-03-28	2014-3-28	106	金	0	キーマカレー	NaN	お楽しみメニュー	なし	0.0	快晴	-1.0	18.5	3	1
2014-04-11	2014-4-11	128	金	1	チキンカレー	NaN	お楽しみメニュー	なし	0.0	快晴	-1.0	16.5	4	1
2014-04-25	2014-4-25	80	金	0	中華丼	NaN	お楽しみメニュー	なし	0.0	晴れ	-1.0	20.8	4	1
2014-05-16	2014-5-16	126	金	0	ポークカレー	NaN	お楽しみメニュー	ママの会	0.0	快晴	-1.0	23.8	5	1
2014-05-30	2014-5-30	119	金	0	チキンカレー	NaN	お楽しみメニュー	なし	0.0	薄曇	-1.0	26.9	5	1
2014-06-13	2014-6-13	121	金	0	キーマカレー	NaN	お楽しみメニュー	なし	0.0	晴れ	-1.0	29.5	6	1
2014-06-27	2014-6-27	74	金	0	牛丼	NaN	お楽しみメニュー	なし	0.0	雨	0.0	25.4	6	1
2014-07-11	2014-7-11	124	金	0	ポークカレー	NaN	お楽しみメニュー	なし	0.0	晴れ	-1.0	33.9	7	1
2014-07-25	2014-7-25	83	金	0	ひやしたぬきうどん・炊き込みご飯	NaN	お楽しみメニュー	なし	0.0	晴れ	-1.0	33.6	7	1
2014-08-08	2014-8-8	129	金	0	チキンカレー	NaN	お楽しみメニュー	なし	1.0	曇	-1.0	31.1	8	1
2014-08-22	2014-8-22	100	金	1	ロコモコ丼	NaN	お楽しみメニュー	なし	0.0	晴れ	-1.0	33.1	8	1
2014-09-12	2014-9-12	115	金	0	ポークカレー	NaN	お楽しみメニュー	なし	0.0	晴れ	-1.0	27.3	9	1

お弁当の需要予測チュートリアル

Introduction¶

Method¶

Evaluation¶

Conclution¶

Refference¶

Now or Never¶

Article 1. Definitions

Article 2. Competition

Article 3 Reward and Vesting of Rights

Article 4 Confidentiality

Article 5 Prohibited Acts of Participants

Article 6. Change, Discontinuation or Termination of Provision of Services under These Terms

Article 7 Modification of Terms

This is a forum used by SIGNATE members to exchange thoughts and ideas on data science and competitions. As your membership here is conditional, please keep in mind to familiarize yourself before joining in on discussion.

	datetime	week	name	remarks	event	weather	precipitation
count	207	207	207	21	14	207	207
unique	207	5	156	6	2	7	8
top	2014-7-28	木	タンドリーチキン	お楽しみメニュー	ママの会	快晴	--
freq	1	43	6	12	9	53	169

お弁当の需要予測チュートリアル

Introduction¶

Method¶

Evaluation¶

Conclution¶

Refference¶

Now or Never¶

SIGNATE Competition

Article 1. Definitions

Article 2. Competition

Article 3 Reward and Vesting of Rights

Article 4 Confidentiality

Article 5 Prohibited Acts of Participants

Article 6. Change, Discontinuation or Termination of Provision of Services under These Terms

Article 7 Modification of Terms

General posting guidelines

This is a forum used by SIGNATE members to exchange thoughts and ideas on data science and competitions. As your membership here is conditional, please keep in mind to familiarize yourself before joining in on discussion.

Please sign in

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Must update your profile to join the competition

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Error details

Preparing to download the contents.

Must update your profile to join the competition