spamメール分類¶

分析環境は以下を想定します。

numpy==1.23.1
pandas==1.4.4
seaborn==0.12.0
matplotlib==3.6.1
sklearn==1.1.1
nltk==3.7
wordcloud==1.8.2.2

目次¶

1. ライブラリのインポート
2. データの読み込み
3. データの可視化
4. データの前処理
5. 学習・評価
6. 予測・結果の提出

1. ライブラリのインポート¶

In [1]:

import warnings
warnings.simplefilter('ignore')

import re
import glob

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import japanize_matplotlib
%matplotlib inline

from collections import defaultdict
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from wordcloud import WordCloud

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import f1_score

2. データの読み込み¶

本データは正解データがtrain_master.tsvに、テキストデータがtrain2とtest2のフォルダの中にあります。
フォルダ内のデータの読み込みにはglobを使用します。
globは全データのファイル名をリストで一括取得します。
そのリストからfor文でデータを作成する方針で進めます。

In [2]:

# マスタデータの読み込み
train = pd.read_csv('train_master.tsv', sep='\t', index_col=0)
train.head()

Out[2]:

	label
file_name
train_0000.txt	0
train_0001.txt	0
train_0002.txt	1
train_0003.txt	1
train_0004.txt	0

In [3]:

# データサイズの確認
print(train.shape)
print('trainデータ', len(glob.glob('train2/train_*.txt')))
print('testデータ', len(glob.glob('test2/test_*.txt')))

(2586, 1)
trainデータ 2586
testデータ 2586

In [4]:

# trainデータの作成
text_train = []
for file_name in np.sort(glob.glob('train2/train_*.txt')):
  with open(file_name) as f:
    text = f.read()
  text_train.append(text)
train['text'] = text_train

In [5]:

train.head()

Out[5]:

	label	text
file_name
train_0000.txt	0	Subject: re : buyback / deficiency deals works...
train_0001.txt	0	Subject: fw : stress relief\n- - - - - origina...
train_0002.txt	1	Subject: from mrs . juliana\ndear friend ,\npl...
train_0003.txt	1	Subject: [ wrenches ] 68 % off dreamweaver mx ...
train_0004.txt	0	Subject: y 2 k - texas log\nname home pager\ng...

In [6]:

# testデータ作成
text_test = []
for file_name in sorted(glob.glob('test2/test_*.txt')):
    with open(file_name) as f:
        text = f.read()
    text_test.append(text)
test = pd.DataFrame(index=[f'test_{str(x).zfill(4)}.txt' for x in range(len(text_test))], data={'text':text_test})

In [7]:

test.head()

Out[7]:

	text
test_0000.txt	Subject: join the thousands who are now sp @ m...
test_0001.txt	Subject: potential list fo 9 / 00\ndaren :\npe...
test_0002.txt	Subject: bounce skel @ iit . demokritos . gr :...
test_0003.txt	Subject: hpl meter # 981488 paris tenaska hpl\...
test_0004.txt	Subject: hpl nom for august 3 , 2000\n( see at...

3.データの可視化¶

自然言語処理課題の有名な可視化手法としてwordcloudがあります。
これは頻出単語ほど大きく、あまり出てこない単語は小さく表示してくれます。
spam, hamで分けて可視化してみます。

In [8]:

spam = WordCloud(background_color='white', collocations=False).generate(''.join(train[train['label']==1]['text']))
ham = WordCloud(background_color='white', collocations=False).generate(''.join(train[train['label']==0]['text']))
plt.figure(figsize=(10, 10))
plt.subplot(2, 1, 1)
plt.imshow(spam, interpolation='bilinear')
plt.axis('off')
plt.title('spam')

plt.subplot(2, 1, 2)
plt.imshow(ham, interpolation='bilinear')
plt.axis('off')
plt.title('ham')

plt.show()

spamとhamで頻出単語に違いがありそうですが、
記号や英単語一文字が表示されていますのでデータの前処理で後ほど整理していきましょう。

続いてラベルの分布を確認します。

In [9]:

plt.figure(figsize=(7, 5))
sns.countplot(x=train['label'], data=train)
plt.title('labelの分布')
plt.grid()
plt.show()

In [10]:

train['label'].value_counts()

Out[10]:

0    1839
1     747
Name: label, dtype: int64

textに含まれる文字数や単語数を比較していきます。

In [11]:

train['text_length'] = train['text'].apply(lambda x: len(x))
train['num_words'] = train['text'].apply(lambda x: len(x.split()))

test['text_length'] = test['text'].apply(lambda x: len(x))
test['num_words'] = test['text'].apply(lambda x: len(x.split()))

In [12]:

train.head()

Out[12]:

	label	text	text_length	num_words
file_name
train_0000.txt	0	Subject: re : buyback / deficiency deals works...	925	207
train_0001.txt	0	Subject: fw : stress relief\n- - - - - origina...	569	138
train_0002.txt	1	Subject: from mrs . juliana\ndear friend ,\npl...	2831	538
train_0003.txt	1	Subject: [ wrenches ] 68 % off dreamweaver mx ...	565	108
train_0004.txt	0	Subject: y 2 k - texas log\nname home pager\ng...	520	130

In [13]:

# グラフの幅の統一のため、binを設定
n_bin = 15
x_max = train['text_length'].max()
x_min = train['text_length'].min()
bins = np.linspace(x_min, x_max, n_bin)

plt.figure(figsize=(7, 5))
plt.hist(train[train['label']==0]['text_length'], label='ham', bins=bins)
plt.hist(train[train['label']==1]['text_length'], label='spam', align='right', bins=bins)
plt.title('label別の文字数の比較')
plt.xlabel('文字数')
plt.legend()
plt.grid()
plt.show()

In [14]:

n_bin = 15
x_max = train['num_words'].max()
x_min = train['num_words'].min()
bins = np.linspace(x_min, x_max, n_bin)

plt.figure(figsize=(7, 5))
plt.hist(train[train['label']==0]['num_words'], label='ham', bins=bins)
plt.hist(train[train['label']==1]['num_words'], label='spam', align='right', bins=bins)
plt.title('label別の単語数の比較')
plt.xlabel('単語数')
plt.legend()
plt.grid()
plt.show()

In [15]:

#hamに出現する単語の頻度の可視化

# corpusに全ての出現単語を要素として取得
corpus=[]
for x in train[train['label']==0]['text'].str.split():
    for i in x:
        corpus.append(i)

# corpusを辞書型で集計
plt.figure(figsize=(12, 5))
dic = defaultdict(int)
for word in corpus:
    dic[word]+=1

top=sorted(dic.items(), key=lambda x: x[1], reverse=True)[:15] 
x,y=zip(*top)
plt.bar(x,y)
plt.title('hamメール頻出単語TOP15')
plt.grid()
plt.show()

In [16]:

#spamに出現する単語の頻度の可視化
corpus=[]
for x in train[train['label']==1]['text'].str.split():
    for i in x:
        corpus.append(i)

plt.figure(figsize=(12, 5))
dic = defaultdict(int)
for word in corpus:
    dic[word]+=1

top=sorted(dic.items(), key=lambda x: x[1], reverse=True)[:15] 
x,y=zip(*top)
plt.bar(x,y)
plt.title('spamメール頻出単語TOP15')
plt.grid()
plt.show()

単語(スペース区切り)で要素を取得・集計した場合に、出現頻度が高いものとして記号が多く含まれています。
そのため、spamとhamで特徴の比較ができないので、stopword除去など自然言語処理タスクの前処理を行います。

4. データの前処理¶

ここではtextの小文字化とstopword、英数字以外の文字の除去を行います。
小文字化を行う理由として、文章における大文字には文の先頭の単語であるか、固有名詞であるかの2つの意味しかなく
spamメール分類の特徴としては不要だと考えられるからです。
またstopwordはthe, a, inなど特に文章の特徴を持たないような単語を取り除く処理になります。

In [17]:

# 小文字化
train['text'] = train['text'].str.lower()
test['text'] = test['text'].str.lower()

In [18]:

# stopwordの中身を一部確認
stop_words = stopwords.words('english')
print('stopwordsの一部を確認:', stop_words[:5])

stopwordsの一部を確認: ['i', 'me', 'my', 'myself', 'we']

In [19]:

# 1つのデータのtextに対して「単語ごとの分割」→「stop_wordsでないものを取り出す」→「記号以外を取り出す」という処理を行う関数を作成
def remove_stopwords(text):
    words = ' '.join([re.sub('[^a-zA-Z]+', '', word) for word in text.split() if word not in stop_words])
    return words

In [20]:

# 上記関数をデータセットに適用
train['text_remove'] = train['text'].apply(lambda x: remove_stopwords(x))
test['text_remove'] = test['text'].apply(lambda x: remove_stopwords(x))
train.head()

Out[20]:

	label	text	text_length	num_words	text_remove
file_name
train_0000.txt	0	subject: re : buyback / deficiency deals works...	925	207	subject buyback deficiency deals worksheet e...
train_0001.txt	0	subject: fw : stress relief\n- - - - - origina...	569	138	subject fw stress relief original messag...
train_0002.txt	1	subject: from mrs . juliana\ndear friend ,\npl...	2831	538	subject mrs juliana dear friend please surp...
train_0003.txt	1	subject: [ wrenches ] 68 % off dreamweaver mx ...	565	108	subject wrenches dreamweaver mx flier ali...
train_0004.txt	0	subject: y 2 k - texas log\nname home pager\ng...	520	130	subject k texas log name home pager george g...

text_removeとしてstopwordと記号を取り除いた文章データを保持することができました。
再度、出現頻度を可視化してみます。

In [21]:

#hamに出現する単語の頻度の可視化
corpus=[]
for x in train[train['label']==0]['text_remove'].str.split():
    for i in x:
        corpus.append(i)

plt.figure(figsize=(12, 5))
dic = defaultdict(int)
for word in corpus:
    dic[word]+=1

top=sorted(dic.items(), key=lambda x: x[1], reverse=True)[:15] 
x,y=zip(*top)
plt.bar(x,y)
plt.title('hamメール頻出単語TOP15')
plt.grid()
plt.show()

In [22]:

#spamに出現する単語の頻度の可視化
corpus=[]
for x in train[train['label']==1]['text_remove'].str.split():
    for i in x:
        corpus.append(i)

plt.figure(figsize=(12, 5))
dic = defaultdict(int)
for word in corpus:
    dic[word]+=1

top=sorted(dic.items(), key=lambda x: x[1], reverse=True)[:15] 
x,y=zip(*top)
plt.bar(x,y)
plt.title('spamメール頻出単語TOP15')
plt.grid()
plt.show()

ある程度出現する単語に違いを見ることができました。

それでは学習を進めていきたいのですが、
テキストデータのままでは学習できないので、CountVectorizerを用いて数値データに置き換えます。
CountVectorizerはテキストデータごとの単語の出現回数を数えるライブラリで、テキストデータをnumpyの特殊な行列に変更して出力します。
まずはtrainデータを学習用と検証用に分割し、学習用データにCountVectorizerのfit_transformを、検証用データにtransformを適用します。

In [23]:

# 学習用、検証用のデータ分割
X_train, X_valid, y_train, y_valid = train_test_split(train, train['label'], test_size=0.20, random_state=82, stratify=train['label'])
print(X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)

(2068, 5) (518, 5) (2068,) (518,)

今回使用するAIアルゴリズムの仕様上、説明変数をarray型で保持します。

In [24]:

# CountVectorizerの適用
count_vectorizer = CountVectorizer()

# train
X_train_array = count_vectorizer.fit_transform(X_train['text_remove']).toarray()
# valid
X_valid_array = count_vectorizer.transform(X_valid['text_remove']).toarray()
print(X_train_array.shape, X_valid_array.shape)

# test
test_array = count_vectorizer.transform(test['text_remove']).toarray()

(2068, 26996) (518, 26996)

5. 学習・評価¶

今回はナイーブベイズ分類器を用いてモデルを作成していきます。
ナイーブベイズ分類器は問題を単純化し高速に処理する手法になります。
自然言語には文章や単語間に様々な関係(例えば「signate」と「competition」という単語は文章全体を考えると同時に出現しやすそうなど)が存在するため、
自然言語処理の分野において、こういった人間的な感覚を考慮することは非常に難しいものになります。
なのでスパムメール分類タスクにおいては、複雑な関係をシンプルに捉えることができて処理が軽く分析内容も分かりやすいナイーブベイズがよく使用されます。

In [25]:

# ナイーブベイズ分類器のなかで2種類のアルゴリズムを検証
gnb = GaussianNB()
mnb = MultinomialNB()

In [26]:

gnb.fit(X_train_array, y_train)
mnb.fit(X_train_array, y_train)

Out[26]:

MultinomialNB()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [27]:

gnb_train_predict = gnb.predict(X_train_array)
gnb_valid_predict = gnb.predict(X_valid_array)

mnb_train_predict = mnb.predict(X_train_array)
mnb_valid_predict = mnb.predict(X_valid_array)

In [28]:

#F1スコア精度
print('gnb train_F1 : ', f1_score(y_train, gnb_train_predict))
print('gnb valid_F1 : ', f1_score(y_valid, gnb_valid_predict))

print('mnb train_F1 : ', f1_score(y_train, mnb_train_predict))
print('mnb valid_F1 : ', f1_score(y_valid, mnb_valid_predict))

gnb train_F1 :  0.9812286689419796
gnb valid_F1 :  0.9160839160839161
mnb train_F1 :  0.9839391377852917
mnb valid_F1 :  0.9342560553633218

精度が高いMultinomialNBを使用します。

6. 予測・結果の提出¶

In [29]:

# testデータの予測
mnb_pred = mnb.predict(test_array)

In [30]:

# submitデータの読み込み、作成
submit = pd.read_csv('sample_submit.csv', header=None)
submit[1] = mnb_pred
submit.to_csv('submission_tutorial.csv',header=None,index=False)

In [31]:

submit.head()

Out[31]:

	0	1
0	test_0000.txt	1
1	test_0001.txt	0
2	test_0002.txt	1
3	test_0003.txt	0
4	test_0004.txt	0

最後に¶

投稿した結果は0.9447でした。(ちなみにGaussianNBは0.9219でした。)
スパムメール分類モデルを作成しましたが、このチュートリアルから発展させていくには、

語形だけが変化した同じ意味を持つ単語を統合する（ステミング）
特徴量のスケールを変換する（tf-idf）

といった手法をヒントに分析を進めてみてください。
また自然言語処理以外の観点では、不均衡データですのspamデータのオーバーサンプリングなども試す価値がありそうです。

【練習問題】スパムメール分類　チュートリアル

spamメール分類¶

目次¶

1. ライブラリのインポート¶

2. データの読み込み¶

3.データの可視化¶

4. データの前処理¶

5. 学習・評価¶

6. 予測・結果の提出¶

最後に¶

Article 1. Definitions

Article 2. Competition

Article 3 Reward and Vesting of Rights

Article 4 Confidentiality

Article 5 Prohibited Acts of Participants

Article 6. Change, Discontinuation or Termination of Provision of Services under These Terms

Article 7 Modification of Terms

This is a forum used by SIGNATE members to exchange thoughts and ideas on data science and competitions. As your membership here is conditional, please keep in mind to familiarize yourself before joining in on discussion.

【練習問題】スパムメール分類 チュートリアル

spamメール分類¶

目次¶

1. ライブラリのインポート¶

2. データの読み込み¶

3.データの可視化¶

4. データの前処理¶

5. 学習・評価¶

6. 予測・結果の提出¶

最後に¶

SIGNATE Competition

Article 1. Definitions

Article 2. Competition

Article 3 Reward and Vesting of Rights

Article 4 Confidentiality

Article 5 Prohibited Acts of Participants

Article 6. Change, Discontinuation or Termination of Provision of Services under These Terms

Article 7 Modification of Terms

General posting guidelines

This is a forum used by SIGNATE members to exchange thoughts and ideas on data science and competitions. As your membership here is conditional, please keep in mind to familiarize yourself before joining in on discussion.

Please sign in

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Must update your profile to join the competition

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Error details

Preparing to download the contents.

Must update your profile to join the competition

【練習問題】スパムメール分類　チュートリアル