[신경망] IMDB 영화 리뷰를 긍정/부정 이진 분류하기

머신러닝 & 딥러닝 2021. 12. 22. 21:58

데이터셋 로드하기

데이터셋 : imdb (학습 25,000개, 테스트 25,000개)

from keras.datasets import imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
# num_words=10000은 가장 자주 나타나는 단어 만개만 사용하기 위함.

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17465344/17464789 [==============================] - 0s 0us/step
17473536/17464789 [==============================] - 0s 0us/step

영화 리뷰가 어떤 형태인지 확인

# word_index는 {단어:인덱스} 형태의 딕셔너리.
word_index = imdb.get_word_index()

# reverse_word_index는 {인덱스:단어} 형태의 딕셔너리.
reverse_word_index = dict(
    [(value, key) for (key, value) in word_index.items()] )

decoded_review = ' '.join(
    [reverse_word_index.get(i-3, '?') for i in train_data[0]]
)

decoded_review

"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

데이터 전처리

리스트를 원-핫 인코딩을 이용하여 0, 1 형태로 변환한다.

즉, [2, 5]일 때 인덱스 2와 5에서는 1을 주고 그 이외에는 0을 입력한다.

import numpy as np

def vectorize_sequences(sequences, dimension=10000):
  result = np.zeros((len(sequences), dimension))
  for i, sequence in enumerate(sequences):
    result[i, sequence] = 1.
  return result

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

(x_train.shape, x_test.shape)

((25000, 10000), (25000, 10000))

신경망 모델

은닉층 2개 : 각 16개의 노드를 가짐.
출력층 : 스칼라 값의 예측 출력

# 모델 정의

from keras import models, layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000, )))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

모델 컴파일

손실함수 : binary_crossentropy
optimizer : rmsprop

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

훈련 검증

x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

history = model.fit(partial_x_train,
                    partial_y_train, 
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

Epoch 1/20
30/30 [==============================] - 3s 45ms/step - loss: 0.5159 - accuracy: 0.7860 - val_loss: 0.3811 - val_accuracy: 0.8716
Epoch 2/20
30/30 [==============================] - 1s 20ms/step - loss: 0.2983 - accuracy: 0.9031 - val_loss: 0.3216 - val_accuracy: 0.8742
Epoch 3/20
30/30 [==============================] - 1s 19ms/step - loss: 0.2160 - accuracy: 0.9301 - val_loss: 0.2852 - val_accuracy: 0.8869
Epoch 4/20
30/30 [==============================] - 1s 21ms/step - loss: 0.1682 - accuracy: 0.9442 - val_loss: 0.2748 - val_accuracy: 0.8888
Epoch 5/20
30/30 [==============================] - 1s 20ms/step - loss: 0.1351 - accuracy: 0.9554 - val_loss: 0.2846 - val_accuracy: 0.8882
Epoch 6/20
30/30 [==============================] - 1s 20ms/step - loss: 0.1104 - accuracy: 0.9659 - val_loss: 0.3119 - val_accuracy: 0.8808
Epoch 7/20
30/30 [==============================] - 1s 20ms/step - loss: 0.0866 - accuracy: 0.9743 - val_loss: 0.3408 - val_accuracy: 0.8761
Epoch 8/20
30/30 [==============================] - 1s 20ms/step - loss: 0.0708 - accuracy: 0.9808 - val_loss: 0.3466 - val_accuracy: 0.8787
Epoch 9/20
30/30 [==============================] - 1s 21ms/step - loss: 0.0588 - accuracy: 0.9837 - val_loss: 0.3757 - val_accuracy: 0.8777
Epoch 10/20
30/30 [==============================] - 1s 20ms/step - loss: 0.0464 - accuracy: 0.9890 - val_loss: 0.3953 - val_accuracy: 0.8761
Epoch 11/20
30/30 [==============================] - 1s 20ms/step - loss: 0.0382 - accuracy: 0.9908 - val_loss: 0.4288 - val_accuracy: 0.8764
Epoch 12/20
30/30 [==============================] - 1s 20ms/step - loss: 0.0282 - accuracy: 0.9939 - val_loss: 0.4555 - val_accuracy: 0.8747
Epoch 13/20
30/30 [==============================] - 1s 20ms/step - loss: 0.0204 - accuracy: 0.9969 - val_loss: 0.4956 - val_accuracy: 0.8727
Epoch 14/20
30/30 [==============================] - 1s 25ms/step - loss: 0.0173 - accuracy: 0.9972 - val_loss: 0.5291 - val_accuracy: 0.8707
Epoch 15/20
30/30 [==============================] - 1s 20ms/step - loss: 0.0124 - accuracy: 0.9986 - val_loss: 0.5727 - val_accuracy: 0.8704
Epoch 16/20
30/30 [==============================] - 1s 20ms/step - loss: 0.0139 - accuracy: 0.9969 - val_loss: 0.6016 - val_accuracy: 0.8695
Epoch 17/20
30/30 [==============================] - 1s 20ms/step - loss: 0.0050 - accuracy: 0.9998 - val_loss: 0.6357 - val_accuracy: 0.8666
Epoch 18/20
30/30 [==============================] - 1s 24ms/step - loss: 0.0081 - accuracy: 0.9979 - val_loss: 0.6703 - val_accuracy: 0.8667
Epoch 19/20
30/30 [==============================] - 1s 20ms/step - loss: 0.0028 - accuracy: 0.9999 - val_loss: 0.7055 - val_accuracy: 0.8683
Epoch 20/20
30/30 [==============================] - 1s 20ms/step - loss: 0.0063 - accuracy: 0.9987 - val_loss: 0.7432 - val_accuracy: 0.8656

import matplotlib.pyplot as plt

history_dict = history.history
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(loss)+1)

plt.plot(epochs, loss, 'bo', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

plt.clf()
history_dict = history.history
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training Accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

results = model.evaluate(x_test, test_labels)
results

782/782 [==============================] - 3s 3ms/step - loss: 0.8120 - accuracy: 0.8531

[0.8120064735412598, 0.8531200289726257]

저작자표시 비영리 변경금지 (새창열림)

'머신러닝 & 딥러닝' 카테고리의 다른 글

[Image] 이미지 증강 (Image Augmentation) (0)	2022.01.13
[신경망] 뉴스 기사 분류 (Multiclass classification) (0)	2021.12.23
[추천] Surprise Package (0)	2021.12.16
[추천] Latent Collaborative Filtering (0)	2021.12.16
[추천] Contents based filtering (0)	2021.12.14

ABOUT ME

엔지니어 한다운의 저널

데이터셋 로드하기

영화 리뷰가 어떤 형태인지 확인

데이터 전처리

신경망 모델

모델 컴파일

훈련 검증

'머신러닝 & 딥러닝' 카테고리의 다른 글

티스토리툴바