Student Cup 2021秋予測部門チュートリアル

create date : Oct. 1, 2021 at 11:11:53

データの理解を深めることを目的として、探索的分析や簡易的なモデリングを中心に作成したチュートリアルです。

概要¶

こちらのチュートリアルでは、シンプルなモデルを使って、データの学習を行い、予測値を算出し、提出物を作成する方法を説明します。
説明は以下の流れで構成されています。

1.データの読み込みと確認¶

2.特徴量生成¶

3.モデリング¶

4.提出物の作成¶

1.データの読み込みと確認¶

本チュートリアルで使用するライブラリをインポートし、データを読み込みます。

In [1]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import category_encoders as ce
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:

status = pd.read_csv('./path/to/status.csv')
station = pd.read_csv('./path/to/station.csv')
weather = pd.read_csv('./path/to/weather.csv')

status.csvの中身を見ていきましょう。
全70ステーションについて、2013/09/01~2015/08/31の730日分、それぞれの日について1時間おき、つまり70x730x24=1,226,400行のデータが存在します。

In [3]:

print(status.shape)
status.head()

(1226400, 8)

Out[3]:

	id	year	month	day	hour	bikes_available
0	0	2013	9	1	0	11.0
1	1	2013	9	1	1	11.0
2	2	2013	9	1	2	11.0
3	3	2013	9	1	3	11.0
4	4	2013	9	1	4	11.0

予測対象日はpredict列に1のフラグがついており、2014/09/01以降となっています。
予測対象日については、0時時点のbikes_availableは与えられており、1時以降を予測対象とします。
予測対象日の翌日については、0~23時まで丸一日bikes_available列の値は削除されています。

In [4]:

status[status['predict'] == 1].head()

Out[4]:

	id	year	month	day	hour	bikes_available	predict
8761	8761	2014	9	1	1	NaN	1
8762	8762	2014	9	1	2	NaN	1
8763	8763	2014	9	1	3	NaN	1
8764	8764	2014	9	1	4	NaN	1
8765	8765	2014	9	1	5	NaN	1

status.csvのyear, month, dayを結合してdate列を作り、これをdatetime型に変換します。

In [5]:

#statusのyear, month, dayを結合してdatetime型に
status['date'] = status['year'].astype(str) + '/' + status['month'].astype(str).str.zfill(2).astype(str) + '/' + status['day'].astype(str).str.zfill(2).astype(str)
status['date'] = pd.to_datetime(status['date'])
status.head()

Out[5]:

	id	year	month	day	hour	bikes_available	date
0	0	2013	9	1	0	11.0	2013-09-01
1	1	2013	9	1	1	11.0	2013-09-01
2	2	2013	9	1	2	11.0	2013-09-01
3	3	2013	9	1	3	11.0	2013-09-01
4	4	2013	9	1	4	11.0	2013-09-01

予測対象日については、2014/09/01以降の120日分となります。

In [6]:

status[status['predict'] == 1]['date'].nunique()

Out[6]:

続いてstation.csvを見ていきましょう。
全70箇所のステーションについて、それぞれの緯度経度、ドック数、市の名前、また設置された日付が分かります。

In [7]:

station.head()

Out[7]:

	station_id	lat	long	dock_count	city	installation_date
0	0	37.32973	-121.90178	27	city1	8/6/2013
1	1	37.33070	-121.88898	15	city1	8/5/2013
2	2	37.33399	-121.89490	11	city1	8/6/2013
3	3	37.33141	-121.89320	19	city1	8/5/2013
4	4	37.33672	-121.89407	15	city1	8/7/2013

設置された日付installation_date列をdatetime型に変換し、最も古い日付と最新の日付を確認します。
一番早いものは2013/08/05に、一番遅いものは2014/04/09に設置されたことがわかります。

In [8]:

station['installation_date'] = pd.to_datetime(station['installation_date'])
print(station['installation_date'].min())
print(station['installation_date'].max())

2013-08-05 00:00:00
2014-04-09 00:00:00

cityごとに、ステーションの位置をプロットしてみましょう。
自転車の使用は市内が多いなど、市ごとに特徴がありそうです。

In [9]:

fig, ax = plt.subplots(figsize=(8,6))
sns.scatterplot(x='long', y='lat', hue='city', data=station) 
plt.show()

最後にweather.csvを確認します。730日分、一日の0時時点での予測値が保存されています。

In [10]:

print(weather.shape)
weather.head()

(730, 22)

Out[10]:

	date	max_temperature	mean_temperature	min_temperature	max_dew_point	mean_dew_point	min_dew_point	max_humidity	mean_humidity	min_humidity	...	min_sea_level_pressure	max_visibility	mean_visibility	min_visibility	max_wind_Speed	mean_wind_speed	precipitation	cloud_cover	events	wind_dir_degrees
0	2013-09-01	81	70	61	62	58	54	80	67	47	...	29.85	10	10	10	14	4	0.00	1	NaN	354
1	2013-09-02	80	71	66	64	61	58	80	70	58	...	29.86	10	10	10	14	4	0.00	5	NaN	337
2	2013-09-03	81	69	58	60	56	52	82	65	44	...	29.93	10	10	10	19	2	1.71	6	Rain	341
3	2013-09-04	82	68	56	61	55	49	81	64	43	...	29.94	10	10	10	15	0	0.00	0	NaN	324
4	2013-09-05	81	68	56	59	54	50	81	63	41	...	29.95	10	10	10	16	1	0.00	0	NaN	335

5 rows × 22 columns

date列をdatetime型に変換します。

In [11]:

weather['date'] = pd.to_datetime(weather['date'])
print(weather['date'].nunique())

event列について、それぞれのカテゴリの数を算出します。
RainやFogといった悪天候の日が100日ほどありますが、全体的に温暖な気候であることが伺えます。

In [12]:

weather['events'].value_counts()

Out[12]:

Rain        82
Fog         10
Fog-Rain     2
Name: events, dtype: int64

2.特徴量の生成¶

続いてモデリングに用いる特徴量の生成を行います。
まず、市ごとの特徴をとらえるため、statusにstationのcity列をマージします。

In [13]:

#statusにstationのstation_idをマージ
status = pd.merge(status, station[['station_id', 'city']], how = 'left')

In [14]:

status.head()

Out[14]:

	id	year	month	day	hour	bikes_available	date	city
0	0	2013	9	1	0	11.0	2013-09-01	city1
1	1	2013	9	1	1	11.0	2013-09-01	city1
2	2	2013	9	1	2	11.0	2013-09-01	city1
3	3	2013	9	1	3	11.0	2013-09-01	city1
4	4	2013	9	1	4	11.0	2013-09-01	city1

statusにweatherのprecipitation列を、date列をキーとしてマージします。

In [15]:

#weatherのprecipitationをマージ
status = pd.merge(status, weather[['date', 'precipitation']], how = 'left')
status.head()

Out[15]:

	id	year	month	day	hour	bikes_available	date	city
0	0	2013	9	1	0	11.0	2013-09-01	city1
1	1	2013	9	1	1	11.0	2013-09-01	city1
2	2	2013	9	1	2	11.0	2013-09-01	city1
3	3	2013	9	1	3	11.0	2013-09-01	city1
4	4	2013	9	1	4	11.0	2013-09-01	city1

statusのdate列の曜日を数値化したものをweek_num列として追加します。

In [16]:

#statusのdate列の曜日を数値化
status['week_num'] = status['date'].dt.weekday
status.head()

Out[16]:

	id	year	month	day	hour	bikes_available	date	city	week_num
0	0	2013	9	1	0	11.0	2013-09-01	city1	6
1	1	2013	9	1	1	11.0	2013-09-01	city1	6
2	2	2013	9	1	2	11.0	2013-09-01	city1	6
3	3	2013	9	1	3	11.0	2013-09-01	city1	6
4	4	2013	9	1	4	11.0	2013-09-01	city1	6

特徴量に、同一のステーション・同一の日における0時時点のbikes_availableの数を特徴量として加えます。

In [17]:

#同じステーション、同じ日において、0時時点の台数を特徴量に追加
#station_id, dateでグルーピングしたときの一番初めの値を取得
t = status.groupby(['station_id', 'date']).first()['bikes_available'].reset_index()
#24回リピートすることでデータのサイズを合わせる
t = pd.DataFrame(np.repeat(t.values, 24, axis=0))
t.columns = ['station_id', 'date', 'bikes_available_at0']
t.head(10)

Out[17]:

	date	bikes_available_at0
0	2013-09-01	11.0
1	2013-09-01	11.0
2	2013-09-01	11.0
3	2013-09-01	11.0
4	2013-09-01	11.0
5	2013-09-01	11.0
6	2013-09-01	11.0
7	2013-09-01	11.0
8	2013-09-01	11.0
9	2013-09-01	11.0

statusにbikes_available_at0列として追加します。

In [18]:

status['bikes_available_at0'] = t['bikes_available_at0']

最後に、city列をcategorical encoderを用いて数値化します。

In [19]:

#city列をcategorical encoderを用いて数値化
cols = ['city']

#encoder = ce.CountEncoder()
#temp_ = encoder.fit_transform(status[cols]).add_prefix("CE_")

# OneHotEncodeしたい列を指定
encoder = ce.OneHotEncoder(cols=cols, handle_unknown='impute')
temp_ = encoder.fit_transform(status[cols]).add_prefix("CE_")

status = pd.concat([status, temp_], axis=1)

3.モデリング¶

statusを学習用、テスト用に分割します。
2014/09/01以前を学習用、以後をテスト用とします(ただし、テスト用はpredict = 1のものに絞ります)。

In [20]:

#2014/09/01以前をtrain、以後をtestに分割(testはpredict = 1のものに絞る)
train = status[status['date'] < '2014-09-01']
test = status[(status['date'] >= '2014-09-01') & (status['predict'] == 1)]

学習用データtrainはbikes_availableがnanではないものに絞ります。

In [21]:

#trainはbikes_availableがnanではないものに絞る
train = train[train['bikes_available'].notna()]

使用カラムを限定し、学習用データの3割を評価用に分割します。

In [22]:

train.columns

Out[22]:

Index(['id', 'year', 'month', 'day', 'hour', 'station_id', 'bikes_available',
       'predict', 'date', 'city', 'precipitation', 'week_num',
       'bikes_available_at0', 'CE_city_1', 'CE_city_2', 'CE_city_3',
       'CE_city_4', 'CE_city_5'],
      dtype='object')

In [23]:

#使用カラムを限定
train = train[['hour', 'station_id',  'CE_city_1', 'CE_city_2', 'CE_city_3', 'CE_city_4', 'CE_city_5', 'precipitation', 'week_num', 'bikes_available_at0', 'bikes_available']]
test_X = test[['hour', 'station_id',  'CE_city_1', 'CE_city_2', 'CE_city_3', 'CE_city_4', 'CE_city_5', 'precipitation', 'week_num', 'bikes_available_at0']]

In [24]:

#trainをtrain, validに分割
train_, valid_ = train_test_split(train, test_size=0.3)
#説明変数と目的変数に分離
train_X = train_.drop('bikes_available', axis = 1)
train_y = train_['bikes_available']
valid_X = valid_.drop('bikes_available', axis = 1)
valid_y = valid_['bikes_available']

In [25]:

train_X.head()

Out[25]:

	hour	station_id	CE_city_1	CE_city_3	CE_city_5	week_num	bikes_available_at0
456985	1	26	0	1	0	4	11.0
1217002	10	69	1	0	0	1	10.0
1163530	10	66	1	0	0	5	5.0
369277	13	21	0	1	0	6	9.0
1001502	6	57	0	0	1	6	10.0

評価用データについて、bikes_available_at0の値をそのまま予測に用いた場合（つまり、0時時点のbikes_availableの数がその後23時間一定であるような場合）、RMSEはどの程度になるのでしょうか。

In [26]:

#validデータのrmse
mean_squared_error(valid_y, valid_X['bikes_available_at0'], squared=False)

Out[26]:

3.103763549429869

約3.1台となりました。
続いて、分割した学習用データに対し、RandomForestを用いてモデルの作成した場合、RMSEはどの程度になるか確認します。

In [27]:

#モデリング
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(max_depth=15, random_state=0)
regr.fit(train_X, train_y)

Out[27]:

RandomForestRegressor(max_depth=15, random_state=0)

In [28]:

#validデータに対して予測
val_predict = regr.predict(valid_X)
val_predict

Out[28]:

array([6.96365895, 5.25292085, 4.58023995, ..., 8.065614  , 8.76594212,
       7.84500984])

特徴量を作成し、RandomForestを用いたことで、RMSEは約2.1台まで減少しました。

In [29]:

#validデータのrmse
mean_squared_error(valid_y, val_predict, squared=False)

Out[29]:

2.1189571101598856

4.提出物の作成¶

最後に、提出できる形にデータを整形していきます。
まず初めに、テスト用データに対する予測を行います。

In [30]:

#testデータに対する予測
test_predict = regr.predict(test_X)
test_predict

Out[30]:

array([14.97149339, 14.97342517, 14.97225555, ...,  8.84647008,
        9.00931135,  9.22941448])

提出用にidを付与します。

In [31]:

sub_index = status[status['predict'] == 1].index
sub_df = pd.DataFrame(list(zip(sub_index, test_predict)))
sub_df.head()

Out[31]:

	0	1
0	8761	14.971493
1	8762	14.973425
2	8763	14.972256
3	8764	14.959053
4	8765	14.980478

ヘッダ無しで、1列目にid、2列目に予想した値を出力してください。

In [32]:

#提出用csvの作成
sub_df.to_csv("submission.csv",index=False, header=False)

create date : Oct. 1, 2021 at 11:11:53

Student Cup 2021秋 予測部門 チュートリアル

概要¶

1.データの読み込みと確認¶

2.特徴量生成¶

3.モデリング¶

4.提出物の作成¶

1.データの読み込みと確認¶

2.特徴量の生成¶

3.モデリング¶

4.提出物の作成¶

SIGNATE Competition

General posting guidelines

This is a forum used by SIGNATE members to exchange thoughts and ideas on data science and competitions. As your membership here is conditional, please keep in mind to familiarize yourself before joining in on discussion.

Please sign in

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Must update your profile to join the competition

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Error details

Preparing to download the contents.

Must update your profile to join the competition

Student Cup 2021秋予測部門チュートリアル