-
[신경망] IMDB 영화 리뷰를 긍정/부정 이진 분류하기머신러닝 & 딥러닝 2021. 12. 22. 21:58
데이터셋 로드하기
- 데이터셋 : imdb (학습 25,000개, 테스트 25,000개)
from keras.datasets import imdb (train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000) # num_words=10000은 가장 자주 나타나는 단어 만개만 사용하기 위함.
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz 17465344/17464789 [==============================] - 0s 0us/step 17473536/17464789 [==============================] - 0s 0us/step
영화 리뷰가 어떤 형태인지 확인
# word_index는 {단어:인덱스} 형태의 딕셔너리. word_index = imdb.get_word_index() # reverse_word_index는 {인덱스:단어} 형태의 딕셔너리. reverse_word_index = dict( [(value, key) for (key, value) in word_index.items()] ) decoded_review = ' '.join( [reverse_word_index.get(i-3, '?') for i in train_data[0]] )
decoded_review
"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
데이터 전처리
- 리스트를 원-핫 인코딩을 이용하여 0, 1 형태로 변환한다.
즉, [2, 5]일 때 인덱스 2와 5에서는 1을 주고 그 이외에는 0을 입력한다.
import numpy as np def vectorize_sequences(sequences, dimension=10000): result = np.zeros((len(sequences), dimension)) for i, sequence in enumerate(sequences): result[i, sequence] = 1. return result x_train = vectorize_sequences(train_data) x_test = vectorize_sequences(test_data)
(x_train.shape, x_test.shape)
((25000, 10000), (25000, 10000))
신경망 모델
- 은닉층 2개 : 각 16개의 노드를 가짐.
- 출력층 : 스칼라 값의 예측 출력
# 모델 정의 from keras import models, layers model = models.Sequential() model.add(layers.Dense(16, activation='relu', input_shape=(10000, ))) model.add(layers.Dense(16, activation='relu')) model.add(layers.Dense(1, activation='sigmoid'))
모델 컴파일
- 손실함수 : binary_crossentropy
- optimizer : rmsprop
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
훈련 검증
x_val = x_train[:10000] partial_x_train = x_train[10000:] y_val = train_labels[:10000] partial_y_train = train_labels[10000:]
history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val))
Epoch 1/20 30/30 [==============================] - 3s 45ms/step - loss: 0.5159 - accuracy: 0.7860 - val_loss: 0.3811 - val_accuracy: 0.8716 Epoch 2/20 30/30 [==============================] - 1s 20ms/step - loss: 0.2983 - accuracy: 0.9031 - val_loss: 0.3216 - val_accuracy: 0.8742 Epoch 3/20 30/30 [==============================] - 1s 19ms/step - loss: 0.2160 - accuracy: 0.9301 - val_loss: 0.2852 - val_accuracy: 0.8869 Epoch 4/20 30/30 [==============================] - 1s 21ms/step - loss: 0.1682 - accuracy: 0.9442 - val_loss: 0.2748 - val_accuracy: 0.8888 Epoch 5/20 30/30 [==============================] - 1s 20ms/step - loss: 0.1351 - accuracy: 0.9554 - val_loss: 0.2846 - val_accuracy: 0.8882 Epoch 6/20 30/30 [==============================] - 1s 20ms/step - loss: 0.1104 - accuracy: 0.9659 - val_loss: 0.3119 - val_accuracy: 0.8808 Epoch 7/20 30/30 [==============================] - 1s 20ms/step - loss: 0.0866 - accuracy: 0.9743 - val_loss: 0.3408 - val_accuracy: 0.8761 Epoch 8/20 30/30 [==============================] - 1s 20ms/step - loss: 0.0708 - accuracy: 0.9808 - val_loss: 0.3466 - val_accuracy: 0.8787 Epoch 9/20 30/30 [==============================] - 1s 21ms/step - loss: 0.0588 - accuracy: 0.9837 - val_loss: 0.3757 - val_accuracy: 0.8777 Epoch 10/20 30/30 [==============================] - 1s 20ms/step - loss: 0.0464 - accuracy: 0.9890 - val_loss: 0.3953 - val_accuracy: 0.8761 Epoch 11/20 30/30 [==============================] - 1s 20ms/step - loss: 0.0382 - accuracy: 0.9908 - val_loss: 0.4288 - val_accuracy: 0.8764 Epoch 12/20 30/30 [==============================] - 1s 20ms/step - loss: 0.0282 - accuracy: 0.9939 - val_loss: 0.4555 - val_accuracy: 0.8747 Epoch 13/20 30/30 [==============================] - 1s 20ms/step - loss: 0.0204 - accuracy: 0.9969 - val_loss: 0.4956 - val_accuracy: 0.8727 Epoch 14/20 30/30 [==============================] - 1s 25ms/step - loss: 0.0173 - accuracy: 0.9972 - val_loss: 0.5291 - val_accuracy: 0.8707 Epoch 15/20 30/30 [==============================] - 1s 20ms/step - loss: 0.0124 - accuracy: 0.9986 - val_loss: 0.5727 - val_accuracy: 0.8704 Epoch 16/20 30/30 [==============================] - 1s 20ms/step - loss: 0.0139 - accuracy: 0.9969 - val_loss: 0.6016 - val_accuracy: 0.8695 Epoch 17/20 30/30 [==============================] - 1s 20ms/step - loss: 0.0050 - accuracy: 0.9998 - val_loss: 0.6357 - val_accuracy: 0.8666 Epoch 18/20 30/30 [==============================] - 1s 24ms/step - loss: 0.0081 - accuracy: 0.9979 - val_loss: 0.6703 - val_accuracy: 0.8667 Epoch 19/20 30/30 [==============================] - 1s 20ms/step - loss: 0.0028 - accuracy: 0.9999 - val_loss: 0.7055 - val_accuracy: 0.8683 Epoch 20/20 30/30 [==============================] - 1s 20ms/step - loss: 0.0063 - accuracy: 0.9987 - val_loss: 0.7432 - val_accuracy: 0.8656
import matplotlib.pyplot as plt history_dict = history.history loss = history_dict['loss'] val_loss = history_dict['val_loss'] epochs = range(1, len(loss)+1) plt.plot(epochs, loss, 'bo', label='Training Loss') plt.plot(epochs, val_loss, 'b', label='Validation Loss') plt.title('Training and Validation Loss') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.show()
plt.clf() history_dict = history.history acc = history_dict['accuracy'] val_acc = history_dict['val_accuracy'] plt.plot(epochs, acc, 'bo', label='Training Accuracy') plt.plot(epochs, val_acc, 'b', label='Validation Accuracy') plt.title('Training and Validation Accuracy') plt.xlabel('Epochs') plt.ylabel('Accuracy') plt.legend() plt.show()
results = model.evaluate(x_test, test_labels) results
782/782 [==============================] - 3s 3ms/step - loss: 0.8120 - accuracy: 0.8531
[0.8120064735412598, 0.8531200289726257]
'머신러닝 & 딥러닝' 카테고리의 다른 글
[Image] 이미지 증강 (Image Augmentation) (0) 2022.01.13 [신경망] 뉴스 기사 분류 (Multiclass classification) (0) 2021.12.23 [추천] Surprise Package (0) 2021.12.16 [추천] Latent Collaborative Filtering (0) 2021.12.16 [추천] Contents based filtering (0) 2021.12.14