Jリーグの観客動員数予測実装例¶

分析環境は以下を想定します。

numpy==1.23.1
pandas==1.4.4
matplotlib==3.6.1
sklearn==1.1.1

目次¶

1. ライブラリのインポート
2. データの読み込み
4. データの前処理
5. データの可視化
3. データの結合
6. 学習・評価
7. 予測・結果の提出

1. ライブラリのインポート¶

In [1]:

import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import japanize_matplotlib
%matplotlib inline

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split

2. データの読み込み¶

In [2]:

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train_add = pd.read_csv('train_add.csv')
add_2014 = pd.read_csv('2014_add.csv')
stadium = pd.read_csv('stadium.csv')
condition = pd.read_csv('condition.csv')
condition_add = pd.read_csv('condition_add.csv')

まずデータの中身とデータのshapeを確認していきます。

In [3]:

print('train shape: ', train.shape)
print('train_add shape: ', train_add.shape)
print('test shape: ',test.shape)
print('condition shape: ', condition.shape)
print('condition_add shape: ', condition_add.shape)
print('stadium shape:', stadium.shape)

train shape:  (1721, 11)
train_add shape:  (232, 11)
test shape:  (313, 10)
condition shape:  (2034, 31)
condition_add shape:  (270, 31)
stadium shape: (59, 3)

In [4]:

# train, train_addを確認
display(train.head(), train_add.head())

	id	y	year	stage	match	gameday	time	home	away	stadium	tv
0	13994	18250	2012	Ｊ１	第１節第１日	03/10(土)	14:04	ベガルタ仙台	鹿島アントラーズ	ユアテックスタジアム仙台	スカパー／ｅ２／スカパー光／ＮＨＫ総合
1	13995	24316	2012	Ｊ１	第１節第１日	03/10(土)	14:04	名古屋グランパス	清水エスパルス	豊田スタジアム	スカパー／ｅ２／スカパー光（Ｊ　ＳＰＯＲＴＳ　４）／ＮＨＫ名古屋
2	13996	17066	2012	Ｊ１	第１節第１日	03/10(土)	14:04	ガンバ大阪	ヴィッセル神戸	万博記念競技場	スカパー／ｅ２／スカパー光（Ｊ　ＳＰＯＲＴＳ　１）／ＮＨＫ大阪
3	13997	29603	2012	Ｊ１	第１節第１日	03/10(土)	14:06	サンフレッチェ広島	浦和レッズ	エディオンスタジアム広島	スカパー／ｅ２／スカパー光／ＮＨＫ広島
4	13998	25353	2012	Ｊ１	第１節第１日	03/10(土)	14:04	コンサドーレ札幌	ジュビロ磐田	札幌ドーム	スカパー／ｅ２／スカパー光（スカイ・Ａ　ｓｐｏｒｔｓ＋）／ＮＨＫ札幌

	id	y	year	stage	match	gameday	time	home	away	stadium	tv
0	14003	19010	2012	Ｊ１	第２節第１日	03/17(土)	14:04	鹿島アントラーズ	川崎フロンターレ	県立カシマサッカースタジアム	スカパー／ｅ２／スカパー光／ＮＨＫ水戸
1	14020	15072	2012	Ｊ１	第３節第２日	03/25(日)	19:03	ガンバ大阪	ジュビロ磐田	万博記念競技場	スカパー／ｅ２／スカパー光
2	14023	25743	2012	Ｊ１	第４節第１日	03/31(土)	15:03	浦和レッズ	川崎フロンターレ	埼玉スタジアム２００２	スカパー／ｅ２／スカパー光／テレ玉
3	14076	24183	2012	Ｊ１	第１０節第１日	05/06(日)	13:03	横浜Ｆ・マリノス	コンサドーレ札幌	日産スタジアム	スカパー／ｅ２／スカパー光
4	14081	20512	2012	Ｊ１	第１０節第１日	05/06(日)	17:03	名古屋グランパス	川崎フロンターレ	豊田スタジアム	スカパー／ｅ２／スカパー光（Ｊ　ＳＰＯＲＴＳ　４）／名古屋テレビ（録）

In [5]:

# condition, condition_addを確認
display(condition.head(), condition_add.head())

	id	home_score	away_score	weather	temperature	humidity	referee	home_team	home_01	home_02	...	away_02	away_03	away_04	away_05	away_06	away_07	away_08	away_09	away_10	away_11
0	13994	1	0	雨	3.8	66%	木村　博之	ベガルタ仙台	林　卓人	菅井　直樹	...	新井場　徹	岩政　大樹	中田　浩二	アレックス	青木　剛	増田　誓志	小笠原　満男	本山　雅志	大迫　勇也	ジュニーニョ
1	13995	1	0	屋内	12.4	43%	西村　雄一	名古屋グランパス	楢﨑　正剛	田中　隼磨	...	吉田　豊	岩下　敬輔	カルフィン　ヨン　ア　ピン	李　記帝	村松　大輔	河井　陽介	枝村　匠馬	高木　俊幸	アレックス	大前　元紀
2	13996	2	3	晴一時雨	11.3	41%	高山　啓義	ガンバ大阪	藤ヶ谷　陽介	加地　亮	...	近藤　岳登	北本　久仁衛	伊野波　雅彦	相馬　崇人	三原　雅俊	田中　英雄	野沢　拓也	橋本　英郎	森岡　亮太	大久保　嘉人
3	13997	1	0	曇一時雨のち晴	11.4	52%	松尾　一	サンフレッチェ広島	西川　周作	森脇　良太	...	濱田　水輝	阿部　勇樹	槙野　智章	平川　忠亮	鈴木　啓太	山田　直輝	梅崎　司	柏木　陽介	原口　元気	田中　達也
4	13998	0	0	屋内	22.5	32%	廣瀬　格	コンサドーレ札幌	李　昊乗	高木　純平	...	駒野　友一	チョ　ビョングク	藤田　義明	山本　脩斗	小林　裕紀	山本　康裕	山田　大記	松浦　拓弥	菅沼　実	前田　遼一

5 rows × 31 columns

	id	home_score	away_score	weather	temperature	humidity	referee	home_team	home_01	home_02	...	away_02	away_03	away_04	away_05	away_06	away_07	away_08	away_09	away_10	away_11
0	14003	0	1	雨	13.3	86%	西村　雄一	鹿島アントラーズ	曽ヶ端　準	新井場　徹	...	實藤　友紀	ジェシ	森下　俊	小宮山　尊信	中村　憲剛	柴崎　晃誠	田坂　祐介	山瀬　功治	レナト	小松　塁
1	14020	1	2	曇	4.6	56%	家本　政明	ガンバ大阪	藤ヶ谷　陽介	加地　亮	...	駒野　友一	チョ　ビョングク	藤田　義明	金沢　浄	小林　裕紀	山本　康裕	山田　大記	松浦　拓弥	菅沼　実	前田　遼一
2	14023	1	1	雨	10.0	65%	家本　政明	浦和レッズ	加藤　順大	坪井　慶介	...	田中　裕介	ジェシ	森下　俊	小宮山　尊信	中村　憲剛	柴崎　晃誠	田坂　祐介	山瀬　功治	レナト	小松　塁
3	14076	2	1	晴	27.9	47%	今村　義朗	横浜Ｆ・マリノス	飯倉　大樹	小林　祐三	...	日高　拓磨	ジェイド　ノース	櫛引　一紀	岩沼　俊介	河合　竜二	宮澤　裕樹	古田　寛幸	近藤　祐介	高木　純平	前田　俊介
4	14081	2	3	晴	19.0	48%	吉田　寿光	名古屋グランパス	楢﨑　正剛	石櫃　洋祐	...	田中　裕介	井川　祐輔	森下　俊	登里　享平	稲本　潤一	中村　憲剛	大島　僚太	田坂　祐介	楠神　順平	矢島　卓郎

5 rows × 31 columns

In [6]:

# stadiumを確認
stadium.head()

Out[6]:

	name	address	capa
0	名古屋市瑞穂陸上競技場	愛知県名古屋市瑞穂区山下通5-1	20000
1	豊田スタジアム	愛知県豊田市千石町7-2	40000
2	フクダ電子アリーナ	千葉県千葉市中央区川崎町1-20	18500
3	日立柏サッカー場	千葉県柏市日立台1-2-50	15349
4	ニンジニアスタジアム	愛媛県松山市上野町乙46	15576

In [7]:

# testを確認
test.head()

Out[7]:

	id	year	stage	match	gameday	time	home	away	stadium	tv
0	15822	2014	Ｊ１	第１８節第１日	08/02(土)	19:04	ベガルタ仙台	大宮アルディージャ	ユアテックスタジアム仙台	スカパー！／スカパー！プレミアムサービス
1	15823	2014	Ｊ１	第１８節第１日	08/02(土)	18:34	鹿島アントラーズ	サンフレッチェ広島	県立カシマサッカースタジアム	スカパー！／スカパー！プレミアムサービス
2	15824	2014	Ｊ１	第１８節第１日	08/02(土)	19:04	浦和レッズ	ヴィッセル神戸	埼玉スタジアム２００２	スカパー！／スカパー！プレミアムサービス／ＮＨＫ　ＢＳ１／テレ玉
3	15825	2014	Ｊ１	第１８節第１日	08/02(土)	19:03	柏レイソル	川崎フロンターレ	日立柏サッカー場	スカパー！／スカパー！プレミアムサービス
4	15827	2014	Ｊ１	第１８節第１日	08/02(土)	19:03	アルビレックス新潟	セレッソ大阪	デンカビッグスワンスタジアム	スカパー！／スカパー！プレミアムサービス

それぞれのデータの情報を整理しておきます。

train, train_add colmuns¶

id : 試合管理ID
y : 観客動員数(目的変数)
year : 試合の開催年
stage : 所属リーグ(J1, J2)
match : 試合日程情報
gameday : 試合日
time : キックオフ時間
home : ホームチーム(開催スタジアムを本拠地とするチーム)
away : アウェイチーム
stadium : 開催スタジアム名
tv : 試合のLIVE放送サービス

condition, condition_add columns¶

id : 試合管理ID
home_score : ホームチームのスコア
away_score : アウェイチームのスコア
weather : 天気
temperature : 気温
humidity : 湿度
referee : メイン審判名
home_team : ホームのチーム名
home_01~home_11 : ホームチームのスターティングメンバー
away_01~away_11 : アウェイチームのスターティングメンバー

stadium columns¶

name : スタジアム名
address : スタジアムの住所
capa : スタジアムの収容人数

3.データの結合¶

まずは追加データ(train_add, condition_add)の結合を行います。

In [8]:

# trainとtrain_add、conditionとcondition_addをそれぞれ結合
full_train = pd.concat([train, train_add], axis=0)
full_condition = pd.concat([condition, condition_add], axis=0)

In [9]:

# 結合前と後で形に異常がないか確認
print('train concat')
print('before: ', train.shape, train_add.shape)
print('after: ', full_train.shape)

train concat
before:  (1721, 11) (232, 11)
after:  (1953, 11)

In [10]:

print('condition concat')
print('before: ', condition.shape, condition_add.shape)
print('after: ', full_condition.shape)

condition concat
before:  (2034, 31) (270, 31)
after:  (2304, 31)

また、full_trainとfull_conditionはidを、full_trainとstadiumはスタジアム名を参照して結合できそうです。
今回は選手、レフェリーは使用せず分析に使いやすそうな特徴量のみを結合していきます。
train, testのデータセットを仕上げていきましょう。

In [11]:

# 結合する特徴量を選択
stadium = stadium.rename(columns={'name': 'stadium'})
stadium = stadium[['stadium', 'capa']]
full_condition = full_condition[['id' ,'weather' ,'temperature' ,'humidity']]

In [12]:

# 結合
full_train = pd.merge(full_train, stadium, on='stadium',  how='left')
full_train = pd.merge(full_train, full_condition, on='id',  how='left')
full_test = pd.merge(test, stadium, on='stadium',  how='left')
full_test = pd.merge(full_test, full_condition, on='id',  how='left')

In [13]:

# 結合後の確認
display(full_train.head(), full_test.head())
print('full_train shape: ', full_train.shape)
print('full_test shape: ', full_test.shape)

	id	y	year	stage	match	gameday	time	home	away	stadium	tv	capa	weather	temperature	humidity
0	13994	18250	2012	Ｊ１	第１節第１日	03/10(土)	14:04	ベガルタ仙台	鹿島アントラーズ	ユアテックスタジアム仙台	スカパー／ｅ２／スカパー光／ＮＨＫ総合	19694	雨	3.8	66%
1	13995	24316	2012	Ｊ１	第１節第１日	03/10(土)	14:04	名古屋グランパス	清水エスパルス	豊田スタジアム	スカパー／ｅ２／スカパー光（Ｊ　ＳＰＯＲＴＳ　４）／ＮＨＫ名古屋	40000	屋内	12.4	43%
2	13996	17066	2012	Ｊ１	第１節第１日	03/10(土)	14:04	ガンバ大阪	ヴィッセル神戸	万博記念競技場	スカパー／ｅ２／スカパー光（Ｊ　ＳＰＯＲＴＳ　１）／ＮＨＫ大阪	21000	晴一時雨	11.3	41%
3	13997	29603	2012	Ｊ１	第１節第１日	03/10(土)	14:06	サンフレッチェ広島	浦和レッズ	エディオンスタジアム広島	スカパー／ｅ２／スカパー光／ＮＨＫ広島	50000	曇一時雨のち晴	11.4	52%
4	13998	25353	2012	Ｊ１	第１節第１日	03/10(土)	14:04	コンサドーレ札幌	ジュビロ磐田	札幌ドーム	スカパー／ｅ２／スカパー光（スカイ・Ａ　ｓｐｏｒｔｓ＋）／ＮＨＫ札幌	39232	屋内	22.5	32%

	id	year	stage	match	gameday	time	home	away	stadium	tv	capa	weather	temperature	humidity
0	15822	2014	Ｊ１	第１８節第１日	08/02(土)	19:04	ベガルタ仙台	大宮アルディージャ	ユアテックスタジアム仙台	スカパー！／スカパー！プレミアムサービス	19694	晴	27.4	70%
1	15823	2014	Ｊ１	第１８節第１日	08/02(土)	18:34	鹿島アントラーズ	サンフレッチェ広島	県立カシマサッカースタジアム	スカパー！／スカパー！プレミアムサービス	40728	晴	30.8	65%
2	15824	2014	Ｊ１	第１８節第１日	08/02(土)	19:04	浦和レッズ	ヴィッセル神戸	埼玉スタジアム２００２	スカパー！／スカパー！プレミアムサービス／ＮＨＫ　ＢＳ１／テレ玉	63700	晴	31.7	58%
3	15825	2014	Ｊ１	第１８節第１日	08/02(土)	19:03	柏レイソル	川崎フロンターレ	日立柏サッカー場	スカパー！／スカパー！プレミアムサービス	15349	晴	29.3	76%
4	15827	2014	Ｊ１	第１８節第１日	08/02(土)	19:03	アルビレックス新潟	セレッソ大阪	デンカビッグスワンスタジアム	スカパー！／スカパー！プレミアムサービス	42300	晴	30.4	68%

full_train shape:  (1953, 15)
full_test shape:  (313, 14)

In [14]:

# 欠損値などの確認
full_train.isnull().sum()

Out[14]:

id             0
y              0
year           0
stage          0
match          0
gameday        0
time           0
home           0
away           0
stadium        0
tv             0
capa           0
weather        0
temperature    0
humidity       0
dtype: int64

In [15]:

full_test.isnull().sum()

Out[15]:

id             0
year           0
stage          0
match          0
gameday        0
time           0
home           0
away           0
stadium        0
tv             0
capa           0
weather        0
temperature    0
humidity       0
dtype: int64

In [16]:

full_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1953 entries, 0 to 1952
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           1953 non-null   int64  
 1   y            1953 non-null   int64  
 2   year         1953 non-null   int64  
 3   stage        1953 non-null   object 
 4   match        1953 non-null   object 
 5   gameday      1953 non-null   object 
 6   time         1953 non-null   object 
 7   home         1953 non-null   object 
 8   away         1953 non-null   object 
 9   stadium      1953 non-null   object 
 10  tv           1953 non-null   object 
 11  capa         1953 non-null   int64  
 12  weather      1953 non-null   object 
 13  temperature  1953 non-null   float64
 14  humidity     1953 non-null   object 
dtypes: float64(1), int64(4), object(10)
memory usage: 244.1+ KB

In [17]:

full_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 313 entries, 0 to 312
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           313 non-null    int64  
 1   year         313 non-null    int64  
 2   stage        313 non-null    object 
 3   match        313 non-null    object 
 4   gameday      313 non-null    object 
 5   time         313 non-null    object 
 6   home         313 non-null    object 
 7   away         313 non-null    object 
 8   stadium      313 non-null    object 
 9   tv           313 non-null    object 
 10  capa         313 non-null    int64  
 11  weather      313 non-null    object 
 12  temperature  313 non-null    float64
 13  humidity     313 non-null    object 
dtypes: float64(1), int64(3), object(10)
memory usage: 36.7+ KB

4.データの前処理¶

object型の特徴量は分析に使用することが難しいので数値型(int, floatなど)に変換します。
object型の特徴量を確認しましょう。
match, gameday, time, tv, humidityに変換処理をしていきます。

In [18]:

# object型のデータを確認
full_train.select_dtypes(include=object).head()

Out[18]:

	stage	match	gameday	time	home	away	stadium	tv	weather	humidity
0	Ｊ１	第１節第１日	03/10(土)	14:04	ベガルタ仙台	鹿島アントラーズ	ユアテックスタジアム仙台	スカパー／ｅ２／スカパー光／ＮＨＫ総合	雨	66%
1	Ｊ１	第１節第１日	03/10(土)	14:04	名古屋グランパス	清水エスパルス	豊田スタジアム	スカパー／ｅ２／スカパー光（Ｊ　ＳＰＯＲＴＳ　４）／ＮＨＫ名古屋	屋内	43%
2	Ｊ１	第１節第１日	03/10(土)	14:04	ガンバ大阪	ヴィッセル神戸	万博記念競技場	スカパー／ｅ２／スカパー光（Ｊ　ＳＰＯＲＴＳ　１）／ＮＨＫ大阪	晴一時雨	41%
3	Ｊ１	第１節第１日	03/10(土)	14:06	サンフレッチェ広島	浦和レッズ	エディオンスタジアム広島	スカパー／ｅ２／スカパー光／ＮＨＫ広島	曇一時雨のち晴	52%
4	Ｊ１	第１節第１日	03/10(土)	14:04	コンサドーレ札幌	ジュビロ磐田	札幌ドーム	スカパー／ｅ２／スカパー光（スカイ・Ａ　ｓｐｏｒｔｓ＋）／ＮＨＫ札幌	屋内	32%

In [19]:

# 第○節の数字をsectionとして取り出す
full_train['section'] = full_train['match'].apply(lambda x: x.split('節')[0][1:]).astype(int)
full_test['section'] = full_test['match'].apply(lambda x: x.split('節')[0][1:]).astype(int)

In [20]:

# gamedayから月と曜日を取り出す
full_train['month'] = full_train['gameday'].apply(lambda x: x[:2]).astype(int)
full_train['weekday'] = full_train['gameday'].apply(lambda x: x[6])
full_test['month'] = full_test['gameday'].apply(lambda x: x[:2]).astype(int)
full_test['weekday'] = full_test['gameday'].apply(lambda x: x[6])

In [21]:

# timeから時間を取り出す
full_train['hour'] = full_train['time'].apply(lambda x: x.split(':')[0]).astype(int)
full_test['hour'] = full_test['time'].apply(lambda x: x.split(':')[0]).astype(int)

In [22]:

# tvからサービスの数をカウント
full_train['num_tv'] = full_train['tv'].apply(lambda x: len(x.split('／')))
full_test['num_tv'] = full_test['tv'].apply(lambda x: len(x.split('／')))

In [23]:

# humidityから％を切り取り数値データに変換
full_train['humidity'] = full_train['humidity'].apply(lambda x: x.rstrip('%')).astype(int)
full_test['humidity'] = full_test['humidity'].apply(lambda x: x.rstrip('%')).astype(int)

In [24]:

# 完成したデータを確認
display(full_train.head(3), full_test.head(3))

	id	y	year	stage	match	gameday	time	home	away	stadium	tv	capa	weather	temperature	humidity	section	month	weekday	hour	num_tv
0	13994	18250	2012	Ｊ１	第１節第１日	03/10(土)	14:04	ベガルタ仙台	鹿島アントラーズ	ユアテックスタジアム仙台	スカパー／ｅ２／スカパー光／ＮＨＫ総合	19694	雨	3.8	66	1	3	土	14	4
1	13995	24316	2012	Ｊ１	第１節第１日	03/10(土)	14:04	名古屋グランパス	清水エスパルス	豊田スタジアム	スカパー／ｅ２／スカパー光（Ｊ　ＳＰＯＲＴＳ　４）／ＮＨＫ名古屋	40000	屋内	12.4	43	1	3	土	14	4
2	13996	17066	2012	Ｊ１	第１節第１日	03/10(土)	14:04	ガンバ大阪	ヴィッセル神戸	万博記念競技場	スカパー／ｅ２／スカパー光（Ｊ　ＳＰＯＲＴＳ　１）／ＮＨＫ大阪	21000	晴一時雨	11.3	41	1	3	土	14	4

	id	year	stage	match	gameday	time	home	away	stadium	tv	capa	weather	temperature	humidity	section	month	weekday	hour	num_tv
0	15822	2014	Ｊ１	第１８節第１日	08/02(土)	19:04	ベガルタ仙台	大宮アルディージャ	ユアテックスタジアム仙台	スカパー！／スカパー！プレミアムサービス	19694	晴	27.4	70	18	8	土	19	2
1	15823	2014	Ｊ１	第１８節第１日	08/02(土)	18:34	鹿島アントラーズ	サンフレッチェ広島	県立カシマサッカースタジアム	スカパー！／スカパー！プレミアムサービス	40728	晴	30.8	65	18	8	土	18	2
2	15824	2014	Ｊ１	第１８節第１日	08/02(土)	19:04	浦和レッズ	ヴィッセル神戸	埼玉スタジアム２００２	スカパー！／スカパー！プレミアムサービス／ＮＨＫ　ＢＳ１／テレ玉	63700	晴	31.7	58	18	8	土	19	4

In [25]:

full_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1953 entries, 0 to 1952
Data columns (total 20 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           1953 non-null   int64  
 1   y            1953 non-null   int64  
 2   year         1953 non-null   int64  
 3   stage        1953 non-null   object 
 4   match        1953 non-null   object 
 5   gameday      1953 non-null   object 
 6   time         1953 non-null   object 
 7   home         1953 non-null   object 
 8   away         1953 non-null   object 
 9   stadium      1953 non-null   object 
 10  tv           1953 non-null   object 
 11  capa         1953 non-null   int64  
 12  weather      1953 non-null   object 
 13  temperature  1953 non-null   float64
 14  humidity     1953 non-null   int64  
 15  section      1953 non-null   int64  
 16  month        1953 non-null   int64  
 17  weekday      1953 non-null   object 
 18  hour         1953 non-null   int64  
 19  num_tv       1953 non-null   int64  
dtypes: float64(1), int64(9), object(10)
memory usage: 320.4+ KB

In [26]:

full_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 313 entries, 0 to 312
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           313 non-null    int64  
 1   year         313 non-null    int64  
 2   stage        313 non-null    object 
 3   match        313 non-null    object 
 4   gameday      313 non-null    object 
 5   time         313 non-null    object 
 6   home         313 non-null    object 
 7   away         313 non-null    object 
 8   stadium      313 non-null    object 
 9   tv           313 non-null    object 
 10  capa         313 non-null    int64  
 11  weather      313 non-null    object 
 12  temperature  313 non-null    float64
 13  humidity     313 non-null    int64  
 14  section      313 non-null    int64  
 15  month        313 non-null    int64  
 16  weekday      313 non-null    object 
 17  hour         313 non-null    int64  
 18  num_tv       313 non-null    int64  
dtypes: float64(1), int64(8), object(10)
memory usage: 48.9+ KB

In [27]:

# 数値データの統計値を算出
full_train.describe()

Out[27]:

	id	y	year	capa	temperature	humidity	section	month	hour	num_tv
count	1953.000000	1953.000000	1953.000000	1953.000000	1953.000000	1953.000000	1953.000000	1953.000000	1953.000000	1953.000000
mean	15049.442396	10629.558116	2012.820276	25688.549411	20.438914	60.220174	18.050691	6.316948	16.310804	2.656938
std	646.260483	8102.315189	0.758124	14016.934408	6.438737	19.953138	11.153364	2.500493	2.310252	0.715085
min	13994.000000	0.000000	2012.000000	3560.000000	1.400000	12.000000	1.000000	3.000000	12.000000	1.000000
25%	14482.000000	4687.000000	2012.000000	15589.000000	15.800000	44.000000	9.000000	4.000000	14.000000	2.000000
50%	15044.000000	8594.000000	2013.000000	20246.000000	21.400000	63.000000	17.000000	6.000000	16.000000	3.000000
75%	15532.000000	13471.000000	2013.000000	30132.000000	25.600000	77.000000	27.000000	8.000000	19.000000	3.000000
max	16238.000000	62632.000000	2014.000000	72327.000000	34.200000	99.000000	42.000000	12.000000	20.000000	5.000000

In [28]:

full_test.describe()

Out[28]:

	id	year	capa	temperature	humidity	section	month	hour	num_tv
count	313.000000	313.0	313.000000	313.000000	313.000000	313.000000	313.000000	313.000000	313.000000
mean	16142.252396	2014.0	26493.354633	21.666773	64.894569	30.162939	9.418530	16.664537	2.380192
std	224.441223	0.0	14237.841998	5.073073	17.748557	6.249792	1.179652	2.300903	0.609111
min	15822.000000	2014.0	7258.000000	1.300000	19.000000	18.000000	8.000000	12.000000	2.000000
25%	15907.000000	2014.0	15600.000000	19.000000	51.000000	26.000000	8.000000	14.000000	2.000000
50%	16261.000000	2014.0	20396.000000	22.600000	66.000000	30.000000	9.000000	18.000000	2.000000
75%	16346.000000	2014.0	32000.000000	25.200000	79.000000	34.000000	10.000000	19.000000	3.000000
max	16436.000000	2014.0	72327.000000	31.700000	99.000000	42.000000	12.000000	19.000000	5.000000

trainデータのy(観客動員数)の最小値が0であることが確認できます。
インターネットで過去の試合記録を見ると当時では異例の無観客試合だったようです。
外れ値になりますのでこのデータは学習で使わないことにします。

In [29]:

# 無観客試合のデータを確認
full_train[full_train['y']==0]

Out[29]:

	id	y	year	stage	match	gameday	time	home	away	stadium	tv	capa	weather	temperature	humidity	section	month	weekday	hour	num_tv
1385	15699	0	2014	Ｊ１	第４節第１日	03/23(日)	15:04	浦和レッズ	清水エスパルス	埼玉スタジアム２００２	スカパー！／スカパー！プレミアムサービス／テレ玉	63700	晴	16.2	23	4	3	日	15	3

In [30]:

# 無観客試合のデータを削除
full_train.drop(index=1385, inplace=True)
full_train[full_train['y']==0]

Out[30]:

	id	y	year	stage	match	gameday	time	home	away	stadium	tv	capa	weather	temperature	humidity	section	month	weekday	hour	num_tv

5.データの可視化¶

まずは目的変数であるyとの相関を可視化してみます。

In [31]:

full_train.corr()['y']

Out[31]:

id            -0.176920
y              1.000000
year           0.003211
capa           0.688290
temperature   -0.028072
humidity      -0.100557
section       -0.044138
month          0.105861
hour           0.029106
num_tv         0.142387
Name: y, dtype: float64

In [32]:

plt.figure(figsize=(7, 7))
plt.scatter(x=full_train['capa'], y=full_train['y'])
plt.grid()
plt.xlabel('capa')
plt.ylabel('y', rotation=0)
plt.title('yとcapaの相関係数:{}'.format(round(full_train.corr()['y']['capa'], 4)))
plt.show()

In [33]:

plt.figure(figsize=(7, 7))
plt.scatter(x=full_train['temperature'], y=full_train['y'])
plt.grid()
plt.xlabel('temperature')
plt.ylabel('y', rotation=0)
plt.title('yとtemperatureの相関係数:{}'.format(round(full_train.corr()['y']['temperature'], 4)))
plt.show()

大きなスタジアムでの試合には、それ相応の観客が入っていることがわかります。
人気チームやJ1の試合など、別の要素から事前に想定できているということでしょうか。
逆に気温や天気にはほとんど影響されておりません。上記のコードを使用して他の変数も確認してみましょう。

次にデータの分布について見ていきます。
「シーズン後半(優勝決定直前)の方が盛り上がりそうなので、y(観客動員数)が多くなるのではないか」と考えてsectionの分布を見ます。

In [34]:

plt.figure(figsize=(10, 7))
full_train.groupby('section')['y'].mean().plot.bar()
plt.grid()
plt.xlabel('section')
plt.xticks(rotation=0)
plt.ylabel('y', rotation=0)
plt.title('sectionごとのyの分布')
plt.show()

sectionにほとんど関係なくy(観客動員数）が変化していますが
sectionが35~40のあたりで不自然に減少しているのでデータを見ておきます。

In [35]:

print('section35~40の6section分の試合数 : ', len(full_train.query('35<=section<=40')), '平均観客数 : ', full_train.query('35<=section<=40')['y'].mean())
print('section29~34の6section分の試合数 : ', len(full_train.query('29<=section<=34')), '平均観客数 : ', full_train.query('29<=section<=34')['y'].mean())

section35~40の6section分の試合数 :  132 平均観客数 :  6373.037878787879
section29~34の6section分の試合数 :  240 平均観客数 :  12023.55

In [36]:

full_train.query('35<=section<=40').stage.value_counts()

Out[36]:

Ｊ２    132
Name: stage, dtype: int64

In [37]:

full_train.query('29<=section<=34').stage.value_counts()

Out[37]:

Ｊ２    132
Ｊ１    108
Name: stage, dtype: int64

sectionごとに試合数も異なりますが、y(観客動員数)の平均も大きく異なります。
stageを比較することでy(観客動員数)の変化にはJ1の試合数が関係してそうだとわかりました。

In [38]:

plt.figure(figsize=(7, 7))
full_train.groupby('stage')['y'].mean().plot.bar()
plt.grid()
plt.xticks(rotation=0)
plt.ylabel('y', rotation=0)
plt.title('stage別1試合あたりの観客動員数平均')
plt.show()

J1の方が2.5倍ほどy(観客動員数)が多いことがわかります。

In [39]:

plt.figure(figsize=(7, 7))
J1_group = full_train[full_train['stage'] == 'Ｊ１'].groupby('section')['y'].mean()
J2_group = full_train[full_train['stage'] == 'Ｊ２'].groupby('section')['y'].mean()
plt.bar(J1_group.index, J1_group, label='J1')
plt.bar(J2_group.index, J2_group, label='J2', align='edge')
plt.legend()
plt.grid()
plt.xlabel('section')
plt.ylabel('y', rotation=0)
plt.title('stage別section分布')
plt.show()

上記のグラフから、J1の試合はsection34で終わっていることがわかりました。
その影響からsection35以降はyの数値が大きく減少しています。

6. 学習・評価¶

データセットを把握できたところで、学習に使う特徴量を設定しAIモデルが学習できるようにデータセットを分割していきます。
その後、RandomForestでAIモデルを作成します。

In [40]:

# 学習に使用する特徴量の選択
use_columns = ['capa', 'section', 'stage', 'month', 'hour', 'num_tv']
y = full_train['y']
train = full_train[use_columns]
test = full_test[use_columns]

In [41]:

#カテゴリ変数のダミー変数化
train = pd.get_dummies(train, drop_first=True)
test = pd.get_dummies(test, drop_first=True)

In [42]:

# 学習データと検証データに分割
X_train, X_valid, y_train, y_valid = train_test_split(train, y, random_state = 82)
print(X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)

(1464, 6) (488, 6) (1464,) (488,)

In [43]:

# AIモデル学習
rfr = RandomForestRegressor(random_state=82)
rfr.fit(X_train, y_train)

Out[43]:

RandomForestRegressor(random_state=82)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

完成したRandomForestの予測AIモデルを用いて性能を評価します。

In [44]:

# 予測・精度評価
y_pred_train = rfr.predict(X_train)
rmse_train = np.sqrt(MSE(y_train, y_pred_train))

y_pred_valid = rfr.predict(X_valid)
rmse_valid = np.sqrt(MSE(y_valid, y_pred_valid))

print('学習データの予測精度', rmse_train)
print('評価データの予測精度', rmse_valid)

学習データの予測精度 1445.4804231253313
評価データの予測精度 3899.170841540232

In [45]:

# 予測の可視化
plt.figure(figsize=(7, 5))
plt.scatter(y_valid, y_pred_valid)

min_value = min(y_valid.min(), y_pred_valid.min())
max_value = max(y_valid.max(), y_pred_valid.max())

plt.xlim([min_value,max_value])
plt.ylim([min_value,max_value])
plt.plot([min_value,max_value],[min_value,max_value])

plt.xlabel('実績値')
plt.ylabel('予測値')
plt.show()

やや過学習していることがわかります。
また、y(観客動員数)が多いデータの予測が難しい傾向にあるようです。
この辺りの改善が次回アクションになってきそうです。

7. 予測・結果の提出¶

現状のAIモデルでテストデータの予測結果を提出をしてみます。

In [46]:

# 予測
predict = rfr.predict(test)

In [47]:

# sample_submitの読み込み
submit = pd.read_csv('sample_submit.csv', header=None)

# 予測結果の適用
submit[1] = predict
submit.to_csv('submission_tutorial.csv', header=None, index=False)

In [48]:

submit.head()

Out[48]:

	0	1
0	15822	13942.49
1	15823	16623.51
2	15824	28615.25
3	15825	9444.06
4	15827	21961.54

最後に¶

提出結果は4291.039...でした。

RandomForestを用いて、Jリーグの入場者の数を予測するAIモデルを作成しました。
本チュートリアルでの分析手法を下記にまとめます。

外部データを結合し、使用できる特徴量とデータ量を増やす。
5つの数値変数と1つのカテゴリ変数を利用。
カテゴリ変数はダミー変数に変更。
それらを用いてRandomForestで学習・予測。

このチュートリアルから発展させていくには、まず学習に利用する説明変数の数を増やしていくところから初めてみましょう。
その際には目的変数と説明変数の関係性についてグラフなどを用いて確認し、データの背景にあることへの仮説を持って分析を進めることが精度UPへの近道です。
(例えば、祝日が多い月は観客動員数が多いのではないか、など)

【練習問題】Jリーグの観客動員数予測　チュートリアル

Jリーグの観客動員数予測実装例¶

目次¶

1. ライブラリのインポート¶

2. データの読み込み¶

train, train_add colmuns¶

condition, condition_add columns¶

stadium columns¶

3.データの結合¶

4.データの前処理¶

5.データの可視化¶

6. 学習・評価¶

7. 予測・結果の提出¶

最後に¶

Article 1. Definitions

Article 2. Competition

Article 3 Reward and Vesting of Rights

Article 4 Confidentiality

Article 5 Prohibited Acts of Participants

Article 6. Change, Discontinuation or Termination of Provision of Services under These Terms

Article 7 Modification of Terms

This is a forum used by SIGNATE members to exchange thoughts and ideas on data science and competitions. As your membership here is conditional, please keep in mind to familiarize yourself before joining in on discussion.

【練習問題】Jリーグの観客動員数予測 チュートリアル

Jリーグの観客動員数予測実装例¶

目次¶

1. ライブラリのインポート¶

2. データの読み込み¶

train, train_add colmuns¶

condition, condition_add columns¶

stadium columns¶

3.データの結合¶

4.データの前処理¶

5.データの可視化¶

6. 学習・評価¶

7. 予測・結果の提出¶

最後に¶

SIGNATE Competition

Article 1. Definitions

Article 2. Competition

Article 3 Reward and Vesting of Rights

Article 4 Confidentiality

Article 5 Prohibited Acts of Participants

Article 6. Change, Discontinuation or Termination of Provision of Services under These Terms

Article 7 Modification of Terms

General posting guidelines

This is a forum used by SIGNATE members to exchange thoughts and ideas on data science and competitions. As your membership here is conditional, please keep in mind to familiarize yourself before joining in on discussion.

Please sign in

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Must update your profile to join the competition

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Error details

Preparing to download the contents.

Must update your profile to join the competition

【練習問題】Jリーグの観客動員数予測　チュートリアル