必要なライブラリ

日付の処理にdatetime、ファイルの処理にpandas、基本的な数値計算にnumpy、モデルの構築にscikit-learnを使用します。

必要なデータの用意

https://signate.jp/competitions/48/data へアクセスし, 以下のファイルをダウンロードします。
データは、data フォルダへ纏めておきます。

発電量データ(train_kwh.tsv)
気象予報データ(forecast.zip)

実装

まず必要なライブラリをインポートします。

import pandas as pd
import numpy as np
import datetime as dt
from sklearn import linear_model

1. データの前処理

学習期間と評価期間の日付の間隔が異なっており、また時刻の表記が特殊（0030-2400）なため、タイムスタンプ型で統一した上で間隔を合わせます。

# 各データ期間のタイムスタンプを作成
train_rng = pd.date_range("201201010000", "201512312355", freq="10T") # 学習期間：10分間隔
test_rng = pd.date_range("201601010000", "201703312330", freq="30T")  # 評価期間：30分間隔
total_rng = pd.date_range("201201010000", "201703312330", freq="30T") # 学習＋評価期間：30分間隔

# 学習データの読み込み
train = pd.read_table("data/train_kwh.tsv", sep="\t", index_col=0)

# 処理しやすいように、学習データのインデックスをタイムスタンプ型に置き換える
train.index = train_rng

# 評価期間の間隔30分に合わせて、発電量を30分値に集計
train = train.groupby(pd.Grouper(freq="30T")).sum()

train.head()

発電量データのインデックスがタイムスタンプ型となり、発電量も30分単位の集計値となりました。
次に特徴量を作成します。

# 学習＋評価期間の日時情報から月と時間の情報を抜き出しリストに格納
months = [x.month for x in total_rng]
hours = [x.hour for x in total_rng]

# 気象予報データを読み込み
ka_forecast = pd.read_table("data/forecast_kanagawa.tsv", sep="\t", index_col=0, parse_dates=True)  # 神奈川
ya_forecast = pd.read_table("data/forecast_yamanashi.tsv", sep="\t", index_col=0, parse_dates=True) # 山梨

# 気象予報データから3時間毎天気予報("we_"で始まるカラム)のみを抜き出す
ka_weather = ka_forecast.loc[:, ka_forecast.columns.str.startswith("we_")]
ya_weather = ya_forecast.loc[:, ya_forecast.columns.str.startswith("we_")]

# 横並びの3時間毎天気予報を1次元配列に変換
ka_weather = ka_weather.values.flatten()
ya_weather = ya_weather.values.flatten()

# 予測は30分単位ですが3時間は全て同じ天気と仮定し、間隔を合わせるために3時間毎天気予報に6を乗じる（3時間=30分×6）
ka_weather = np.repeat(ka_weather, 6)
ya_weather = np.repeat(ya_weather, 6)

# 月と時間、天気情報をデータフレームに変換
ka_df = pd.DataFrame({"m":months, "h":hours, "weather":ka_weather}, dtype="str", index=total_rng)
ya_df = pd.DataFrame({"m":months, "h":hours, "weather":ya_weather}, dtype="str", index=total_rng)

# データフレームのカテゴリカル変数をダミー化して特徴量を作成 
ka_X = pd.get_dummies(ka_df, drop_first = True)
ya_X = pd.get_dummies(ya_df, drop_first = True)

ka_X.head()

これで、月、時間、天気予報を使った特徴量が作成できました。
次に、学習用の目的変数と説明変数、評価用の説明変数を発電所毎に作成します。

# 発電所毎に、欠損レコードを除いた発電量を抜き出し、それを学習用の目的変数とする
train01_index = train[train["SOLA01"].notnull()].index # 浮島発電所
train02_index = train[train["SOLA02"].notnull()].index # 扇島発電所
train03_index = train[train["SOLA03"].notnull()].index # 米倉山発電所
train01_y = train.loc[train01_index]["SOLA01"]
train02_y = train.loc[train02_index]["SOLA02"]
train03_y = train.loc[train03_index]["SOLA03"]

# 発電所毎に、欠損レコードを除いた特徴量を抜き出し、それを学習用の説明変数とする（浮島と扇島は神奈川の天気予報、米倉山は山梨の天気予報を適用）
train01_X = ka_X.loc[train01_index]
train02_X = ka_X.loc[train02_index]
train03_X = ya_X.loc[train03_index]

# 発電所毎に、評価用の説明変数を切り出す
test01_X = ka_X["2016":]
test02_X = ka_X["2016":]
test03_X = ya_X["2016":]

2. モデルの構築

scikit-learnを用いて、線形回帰モデルを構築します。

# 線形回帰モデルのインスタンス作成
reg01 = linear_model.LinearRegression()
reg02 = linear_model.LinearRegression()
reg03 = linear_model.LinearRegression()

# 学習データをフィッティング
reg01.fit(train01_X, train01_y)
reg02.fit(train02_X, train02_y)
reg03.fit(train03_X, train03_y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

3. 予測値の出力

学習したモデルを用いて、評価期間に対する発電量の予測値を出力し、応募用ファイルに加工・出力します。

# 評価期間の発電量の予測
pred01 = reg01.predict(test01_X)
pred02 = reg02.predict(test02_X)
pred03 = reg03.predict(test03_X)

# 発電量は0以上のため、負の値は0となるよう予測値を補正
pred01[pred01 < 0] = 0
pred02[pred02 < 0] = 0
pred03[pred03 < 0] = 0

# タイムスタンプ型の日付を、応募ファイルのフォーマットに合わせて、yyyymmddhhmmの数値且つ時刻が0030-2400で表現されるよう変換
test_rng = test_rng.strftime("%Y%m%d%H%M").astype(np.int64)
test_rng = [x+30 if x%100==0 else x+70 for x in test_rng]

# 応募ファイルを作成し、ファイル出力
submit = pd.DataFrame({"SOLA01":pred01, "SOLA02":pred02, "SOLA03":pred03}, index=test_rng)
submit.to_csv("submit_file.tsv", sep="\t", header=None)

結果、評価データに対するスコアは312となりました。

まとめ

データの前処理→モデル構築→予測までの実装を一通り説明しました。
特徴量の選択・設計や、モデル選択、パラメータチューニング等、工夫の余地はまだまだありますので、精度改善に挑戦してみてください！
皆さんのご応募をお待ちしております.。

	SOLA01	SOLA02	SOLA03
2012-01-01 00:00:00	0	0.0	NaN
2012-01-01 00:30:00	0	0.0	NaN
2012-01-01 01:00:00	0	0.0	NaN
2012-01-01 01:30:00	0	0.0	NaN
2012-01-01 02:00:00	0	0.0	NaN

	h_1	h_2	...
2012-01-01 00:00:00	0	0	...
2012-01-01 00:30:00	0	0	...
2012-01-01 01:00:00	1	0	...
2012-01-01 01:30:00	1	0	...
2012-01-01 02:00:00	0	1	...

3時間毎の天気予報を用いたシンプルな線形回帰モデルの構築

必要なライブラリ

必要なデータの用意

実装

1. データの前処理

2. モデルの構築

3. 予測値の出力

まとめ

Article 1. Definitions

Article 2. Competition

Article 3 Reward and Vesting of Rights

Article 4 Confidentiality

Article 5 Prohibited Acts of Participants

Article 6. Change, Discontinuation or Termination of Provision of Services under These Terms

Article 7 Modification of Terms

This is a forum used by SIGNATE members to exchange thoughts and ideas on data science and competitions. As your membership here is conditional, please keep in mind to familiarize yourself before joining in on discussion.

3時間毎の天気予報を用いたシンプルな線形回帰モデルの構築

必要なライブラリ

必要なデータの用意

実装

1. データの前処理

2. モデルの構築

3. 予測値の出力

まとめ

SIGNATE Competition

Article 1. Definitions

Article 2. Competition

Article 3 Reward and Vesting of Rights

Article 4 Confidentiality

Article 5 Prohibited Acts of Participants

Article 6. Change, Discontinuation or Termination of Provision of Services under These Terms

Article 7 Modification of Terms

General posting guidelines

This is a forum used by SIGNATE members to exchange thoughts and ideas on data science and competitions. As your membership here is conditional, please keep in mind to familiarize yourself before joining in on discussion.

Please sign in

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Must update your profile to join the competition

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Error details

Preparing to download the contents.

Must update your profile to join the competition