-
[텍스트] 캐글 - Mercari Price Suggestion머신러닝 & 딥러닝 2021. 12. 6. 23:01
Mercari Price Suggestion
- 제품명, 제품 상태, 카테고리, 브랜드 이름, 제품 설명 등을 통해 제품 가격을 예측하는 회귀 모델 생성.
- 텍스트 컬럼들을 피처 벡터화 후 나머지 피처들과 결합하여 학습/테스트용 피처 데이터 셋을 생성
- 어떻게 효율적으로 텍스트들을 효율적으로 추출하고 벡터화 할 것 인가에 따라 예측 성능 개선 가능.
from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/drive')
Mounted at /content/drive
!cd "/content/drive/My Drive/Colab Notebooks/data"; ls -l
total 163086 -rw------- 1 root root 4651 Sep 17 01:21 규모별_미분양현황_20210917101949.csv -rw------- 1 root root 1000 Sep 17 01:39 규모별_미분양현황_20210917103934.csv -rw------- 1 root root 125204 Oct 1 01:52 breast-cancer-wisconsin-data.csv -rw------- 1 root root 150828752 Sep 19 2019 creditcard.csv -rw------- 1 root root 220 Sep 13 05:23 n113_stock.csv -rw------- 1 root root 1740630 Sep 23 01:01 n123_서울시_기간별_시간평균_대기환경_정보_2020.03.csv -rw------- 1 root root 203378 Sep 27 00:02 sc12x_dataset.csv -rw------- 1 root root 14094055 Nov 1 01:35 weatherAUS.csv
!cat /proc/meminfo;cat /proc/cpuinfo
MemTotal: 13302924 kB MemFree: 10545288 kB MemAvailable: 12459224 kB Buffers: 153436 kB Cached: 1891820 kB SwapCached: 0 kB Active: 1026244 kB Inactive: 1500588 kB Active(anon): 432672 kB Inactive(anon): 504 kB Active(file): 593572 kB Inactive(file): 1500084 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 3564 kB Writeback: 0 kB AnonPages: 481444 kB Mapped: 272532 kB Shmem: 1180 kB KReclaimable: 123508 kB Slab: 166488 kB SReclaimable: 123508 kB SUnreclaim: 42980 kB KernelStack: 5848 kB PageTables: 6856 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 6651460 kB Committed_AS: 3619984 kB VmallocTotal: 34359738367 kB VmallocUsed: 8384 kB VmallocChunk: 0 kB Percpu: 1424 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 113472 kB DirectMap2M: 6174720 kB DirectMap1G: 9437184 kB processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU @ 2.20GHz stepping : 0 microcode : 0x1 cpu MHz : 2199.998 cache size : 56320 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa bogomips : 4399.99 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU @ 2.20GHz stepping : 0 microcode : 0x1 cpu MHz : 2199.998 cache size : 56320 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa bogomips : 4399.99 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
데이터 전처리
from sklearn.linear_model import Ridge , LogisticRegression from sklearn.model_selection import train_test_split , cross_val_score from sklearn.feature_extraction.text import CountVectorizer , TfidfVectorizer import pandas as pd mercari_df= pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/mercari_train.tsv',sep='\t') print(mercari_df.shape) mercari_df.head(3)
(1482535, 8)
train_id name item_condition_id category_name brand_name price shipping item_description 0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet 1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ... 2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol... print(mercari_df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1482535 entries, 0 to 1482534 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 train_id 1482535 non-null int64 1 name 1482535 non-null object 2 item_condition_id 1482535 non-null int64 3 category_name 1476208 non-null object 4 brand_name 849853 non-null object 5 price 1482535 non-null float64 6 shipping 1482535 non-null int64 7 item_description 1482531 non-null object dtypes: float64(1), int64(3), object(4) memory usage: 90.5+ MB None
타겟 분포 확인
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline y_train_df = mercari_df['price'] plt.figure(figsize=(6,4)) sns.distplot(y_train_df,kde=False)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
타겟값 로그 변환 후 분포도 확인
import numpy as np y_train_df = np.log1p(y_train_df) sns.distplot(y_train_df,kde=False)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
mercari_df['price'] = np.log1p(mercari_df['price']) mercari_df['price'].head(3)
0 2.397895 1 3.970292 2 2.397895 Name: price, dtype: float64
각 피처들의 유형 살펴보기
print('Shipping 값 유형:\n',mercari_df['shipping'].value_counts()) print('item_condition_id 값 유형:\n',mercari_df['item_condition_id'].value_counts())
Shipping 값 유형: 0 819435 1 663100 Name: shipping, dtype: int64 item_condition_id 값 유형: 1 640549 3 432161 2 375479 4 31962 5 2384 Name: item_condition_id, dtype: int64
boolean_cond= mercari_df['item_description']=='No description yet' mercari_df[boolean_cond]['item_description'].count()
82489
category name이 대/중/소 와 같이 '/' 문자열 기반으로 되어 있음. 이를 개별 컬럼들로 재 생성
# apply lambda에서 호출되는 대,중,소 분할 함수 생성, 대,중,소 값을 리스트 반환 def split_cat(category_name): try: return category_name.split('/') except: return ['Other_Null' , 'Other_Null' , 'Other_Null'] # 위의 split_cat( )을 apply lambda에서 호출하여 대,중,소 컬럼을 mercari_df에 생성. mercari_df['cat_dae'], mercari_df['cat_jung'], mercari_df['cat_so'] = \ zip(*mercari_df['category_name'].apply(lambda x : split_cat(x))) # 대분류만 값의 유형과 건수를 살펴보고, 중분류, 소분류는 값의 유형이 많으므로 분류 갯수만 추출 print('대분류 유형 :\n', mercari_df['cat_dae'].value_counts()) print('중분류 갯수 :', mercari_df['cat_jung'].nunique()) print('소분류 갯수 :', mercari_df['cat_so'].nunique())
대분류 유형 : Women 664385 Beauty 207828 Kids 171689 Electronics 122690 Men 93680 Home 67871 Vintage & Collectibles 46530 Other 45351 Handmade 30842 Sports & Outdoors 25342 Other_Null 6327 Name: cat_dae, dtype: int64 중분류 갯수 : 114 소분류 갯수 : 871
# apply lambda에서 호출되는 대,중,소 분할 함수 생성, 대,중,소 값을 리스트 반환 def split_cat(category_name): try: return category_name.split('/') except: return ['Other_Null' , 'Other_Null' , 'Other_Null'] # 위의 split_cat( )을 apply lambda에서 호출하여 대,중,소 컬럼을 mercari_df에 생성. mercari_df['category_list'] = mercari_df['category_name'].apply(lambda x : split_cat(x)) mercari_df['category_list'].head()
0 [Men, Tops, T-shirts] 1 [Electronics, Computers & Tablets, Components ... 2 [Women, Tops & Blouses, Blouse] 3 [Home, Home Décor, Home Décor Accents] 4 [Women, Jewelry, Necklaces] Name: category_list, dtype: object
mercari_df['cat_dae'] = mercari_df['category_list'].apply(lambda x:x[0]) mercari_df['cat_jung'] = mercari_df['category_list'].apply(lambda x:x[1]) mercari_df['cat_so'] = mercari_df['category_list'].apply(lambda x:x[2]) mercari_df.drop('category_list', axis=1, inplace=True)
mercari_df[['cat_dae','cat_jung','cat_so']].head()
cat_dae cat_jung cat_so 0 Men Tops T-shirts 1 Electronics Computers & Tablets Components & Parts 2 Women Tops & Blouses Blouse 3 Home Home Décor Home Décor Accents 4 Women Jewelry Necklaces Null 값 일괄 처리
mercari_df['brand_name'] = mercari_df['brand_name'].fillna(value='Other_Null') mercari_df['category_name'] = mercari_df['category_name'].fillna(value='Other_Null') mercari_df['item_description'] = mercari_df['item_description'].fillna(value='Other_Null') # 각 컬럼별로 Null값 건수 확인. 모두 0가 나와야 합니다. mercari_df.isnull().sum()
train_id 0 name 0 item_condition_id 0 category_name 0 brand_name 0 price 0 shipping 0 item_description 0 cat_dae 0 cat_jung 0 cat_so 0 dtype: int64
mercari_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1482535 entries, 0 to 1482534 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 train_id 1482535 non-null int64 1 name 1482535 non-null object 2 item_condition_id 1482535 non-null int64 3 category_name 1482535 non-null object 4 brand_name 1482535 non-null object 5 price 1482535 non-null float64 6 shipping 1482535 non-null int64 7 item_description 1482535 non-null object 8 cat_dae 1482535 non-null object 9 cat_jung 1482535 non-null object 10 cat_so 1482535 non-null object dtypes: float64(1), int64(3), object(7) memory usage: 124.4+ MB
피처 인코딩과 피처 벡터화
brand name과 name의 종류 확인
print('brand name 의 유형 건수 :', mercari_df['brand_name'].nunique()) print('brand name sample 5건 : \n', mercari_df['brand_name'].value_counts()[:5])
brand name 의 유형 건수 : 4810 brand name sample 5건 : Other_Null 632682 PINK 54088 Nike 54043 Victoria's Secret 48036 LuLaRoe 31024 Name: brand_name, dtype: int64
print('name 의 종류 갯수 :', mercari_df['name'].nunique()) print('name sample 7건 : \n', mercari_df['name'][:7])
name 의 종류 갯수 : 1225273 name sample 7건 : 0 MLB Cincinnati Reds T Shirt Size XL 1 Razer BlackWidow Chroma Keyboard 2 AVA-VIV Blouse 3 Leather Horse Statues 4 24K GOLD plated rose 5 Bundled items requested for Ruie 6 Acacia pacific tides santorini top Name: name, dtype: object
item_description의 문자열 개수 확인
pd.set_option('max_colwidth', 200) # item_description의 평균 문자열 개수 print('item_description 평균 문자열 개수:',mercari_df['item_description'].str.len().mean()) mercari_df['item_description'][:2]
item_description 평균 문자열 개수: 145.7113889385411
0 No description yet 1 This keyboard is in great condition and works like it came out of the box. All of the ports are tested and work perfectly. The lights are customizable via the Razer Synapse app on your PC. Name: item_description, dtype: object
import gc gc.collect()
88
name은 Count로, item_description은 TF-IDF로 피처 벡터화
# name 속성에 대한 feature vectorization 변환 cnt_vec = CountVectorizer(max_features=30000) X_name = cnt_vec.fit_transform(mercari_df.name) # item_description 에 대한 feature vectorization 변환 tfidf_descp = TfidfVectorizer(max_features = 50000, ngram_range= (1,3) , stop_words='english') X_descp = tfidf_descp.fit_transform(mercari_df['item_description']) print('name vectorization shape:',X_name.shape) print('item_description vectorization shape:',X_descp.shape)
name vectorization shape: (1482535, 30000) item_description vectorization shape: (1482535, 50000)
사이킷런의 LabelBinarizer를 이용하여 원-핫 인코딩 변환 후 희소행렬 최적화 형태로 저장
from sklearn.preprocessing import LabelBinarizer # brand_name, item_condition_id, shipping 각 피처들을 희소 행렬 원-핫 인코딩 변환 lb_brand_name= LabelBinarizer(sparse_output=True) X_brand = lb_brand_name.fit_transform(mercari_df['brand_name']) lb_item_cond_id = LabelBinarizer(sparse_output=True) X_item_cond_id = lb_item_cond_id.fit_transform(mercari_df['item_condition_id']) lb_shipping= LabelBinarizer(sparse_output=True) X_shipping = lb_shipping.fit_transform(mercari_df['shipping']) # cat_dae, cat_jung, cat_so 각 피처들을 희소 행렬 원-핫 인코딩 변환 lb_cat_dae = LabelBinarizer(sparse_output=True) X_cat_dae= lb_cat_dae.fit_transform(mercari_df['cat_dae']) lb_cat_jung = LabelBinarizer(sparse_output=True) X_cat_jung = lb_cat_jung.fit_transform(mercari_df['cat_jung']) lb_cat_so = LabelBinarizer(sparse_output=True) X_cat_so = lb_cat_so.fit_transform(mercari_df['cat_so'])
print(type(X_brand), type(X_item_cond_id), type(X_shipping)) print('X_brand_shape:{0}, X_item_cond_id shape:{1}'.format(X_brand.shape, X_item_cond_id.shape)) print('X_shipping shape:{0}, X_cat_dae shape:{1}'.format(X_shipping.shape, X_cat_dae.shape)) print('X_cat_jung shape:{0}, X_cat_so shape:{1}'.format(X_cat_jung.shape, X_cat_so.shape))
<class 'scipy.sparse.csr.csr_matrix'> <class 'scipy.sparse.csr.csr_matrix'> <class 'scipy.sparse.csr.csr_matrix'> X_brand_shape:(1482535, 4810), X_item_cond_id shape:(1482535, 5) X_shipping shape:(1482535, 1), X_cat_dae shape:(1482535, 11) X_cat_jung shape:(1482535, 114), X_cat_so shape:(1482535, 871)
import gc gc.collect()
290
피처 벡터화된 희소 행렬과 원-핫 인코딩된 희소 행렬을 모두 scipy 패키지의 hstack()함수를 이용하여 결합
from scipy.sparse import hstack import gc sparse_matrix_list = (X_name, X_descp, X_brand, X_item_cond_id, X_shipping, X_cat_dae, X_cat_jung, X_cat_so) # 사이파이 sparse 모듈의 hstack 함수를 이용하여 앞에서 인코딩과 Vectorization을 수행한 데이터 셋을 모두 결합. X_features_sparse= hstack(sparse_matrix_list).tocsr() print(type(X_features_sparse), X_features_sparse.shape) # 데이터 셋이 메모리를 많이 차지하므로 사용 용도가 끝났으면 바로 메모리에서 삭제. del X_features_sparse gc.collect()
<class 'scipy.sparse.csr.csr_matrix'> (1482535, 85812)
88
릿지 회귀 모델 구축 및 평가
rmsle 정의
def rmsle(y , y_pred): # underflow, overflow를 막기 위해 log가 아닌 log1p로 rmsle 계산 return np.sqrt(np.mean(np.power(np.log1p(y) - np.log1p(y_pred), 2))) def evaluate_org_price(y_test , preds): # 원본 데이터는 log1p로 변환되었으므로 exmpm1으로 원복 필요. preds_exmpm = np.expm1(preds) y_test_exmpm = np.expm1(y_test) # rmsle로 RMSLE 값 추출 rmsle_result = rmsle(y_test_exmpm, preds_exmpm) return rmsle_result
여러 모델에 대한 학습/예측을 수행하기 위해 별도의 함수인 model_train_predict()생성.
해당 함수는 여러 희소 행렬을 hstack()으로 결합한 뒤 학습과 테스트 데이터 세트로 분할 후 모델 학습 및 예측을 수행
import gc from scipy.sparse import hstack def model_train_predict(model,matrix_list): # scipy.sparse 모듈의 hstack 을 이용하여 sparse matrix 결합 X= hstack(matrix_list).tocsr() X_train, X_test, y_train, y_test=train_test_split(X, mercari_df['price'], test_size=0.2, random_state=156) # 모델 학습 및 예측 model.fit(X_train , y_train) preds = model.predict(X_test) del X , X_train , X_test , y_train gc.collect() return preds , y_test
릿지 선형 회귀로 학습/예측/평가. Item Description 피처의 영향도를 알아보기 위한 테스트 함께 수행
linear_model = Ridge(solver = "lsqr", fit_intercept=False) sparse_matrix_list = (X_name, X_brand, X_item_cond_id, X_shipping, X_cat_dae, X_cat_jung, X_cat_so) linear_preds , y_test = model_train_predict(model=linear_model ,matrix_list=sparse_matrix_list) print('Item Description을 제외했을 때 rmsle 값:', evaluate_org_price(y_test , linear_preds)) sparse_matrix_list = (X_descp, X_name, X_brand, X_item_cond_id, X_shipping, X_cat_dae, X_cat_jung, X_cat_so) linear_preds , y_test = model_train_predict(model=linear_model , matrix_list=sparse_matrix_list) print('Item Description을 포함한 rmsle 값:', evaluate_org_price(y_test ,linear_preds))
Item Description을 제외했을 때 rmsle 값: 0.503396904091685 Item Description을 포함한 rmsle 값: 0.47114342682343463
import gc gc.collect()
140
LightGBM 회귀 모델 구축과 앙상블을 이용한 최종 예측 평가
from lightgbm import LGBMRegressor sparse_matrix_list = (X_descp, X_name, X_brand, X_item_cond_id, X_shipping, X_cat_dae, X_cat_jung, X_cat_so) lgbm_model = LGBMRegressor(n_estimators=200, learning_rate=0.5, num_leaves=125, random_state=156) lgbm_preds , y_test = model_train_predict(model = lgbm_model , matrix_list=sparse_matrix_list) print('LightGBM rmsle 값:', evaluate_org_price(y_test , lgbm_preds))
LightGBM rmsle 값: 0.45668225613337415
preds = lgbm_preds * 0.45 + linear_preds * 0.55 print('LightGBM과 Ridge를 ensemble한 최종 rmsle 값:', evaluate_org_price(y_test , preds))
LightGBM과 Ridge를 ensemble한 최종 rmsle 값: 0.4506274879031209
'머신러닝 & 딥러닝' 카테고리의 다른 글
[추천] Latent Collaborative Filtering (0) 2021.12.16 [추천] Contents based filtering (0) 2021.12.14 [텍스트] KoNLPy 맥 M1 설치하기 (0) 2021.12.05 [텍스트] 문서 유사도 (0) 2021.12.04 [정규표현식] re 모듈 (0) 2021.11.30