ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [텍스트] 캐글 - Mercari Price Suggestion
    머신러닝 & 딥러닝 2021. 12. 6. 23:01

    Mercari Price Suggestion

    • 제품명, 제품 상태, 카테고리, 브랜드 이름, 제품 설명 등을 통해 제품 가격을 예측하는 회귀 모델 생성.
    • 텍스트 컬럼들을 피처 벡터화 후 나머지 피처들과 결합하여 학습/테스트용 피처 데이터 셋을 생성
    • 어떻게 효율적으로 텍스트들을 효율적으로 추출하고 벡터화 할 것 인가에 따라 예측 성능 개선 가능.
    from google.colab import auth
    auth.authenticate_user()
    
    from google.colab import drive
    drive.mount('/content/drive')
    Mounted at /content/drive
    
    !cd "/content/drive/My Drive/Colab Notebooks/data"; ls -l
    total 163086
    -rw------- 1 root root      4651 Sep 17 01:21 규모별_미분양현황_20210917101949.csv
    -rw------- 1 root root      1000 Sep 17 01:39 규모별_미분양현황_20210917103934.csv
    -rw------- 1 root root    125204 Oct  1 01:52 breast-cancer-wisconsin-data.csv
    -rw------- 1 root root 150828752 Sep 19  2019 creditcard.csv
    -rw------- 1 root root       220 Sep 13 05:23 n113_stock.csv
    -rw------- 1 root root   1740630 Sep 23 01:01 n123_서울시_기간별_시간평균_대기환경_정보_2020.03.csv
    -rw------- 1 root root    203378 Sep 27 00:02 sc12x_dataset.csv
    -rw------- 1 root root  14094055 Nov  1 01:35 weatherAUS.csv
    
    !cat /proc/meminfo;cat /proc/cpuinfo
    MemTotal:       13302924 kB
    MemFree:        10545288 kB
    MemAvailable:   12459224 kB
    Buffers:          153436 kB
    Cached:          1891820 kB
    SwapCached:            0 kB
    Active:          1026244 kB
    Inactive:        1500588 kB
    Active(anon):     432672 kB
    Inactive(anon):      504 kB
    Active(file):     593572 kB
    Inactive(file):  1500084 kB
    Unevictable:           0 kB
    Mlocked:               0 kB
    SwapTotal:             0 kB
    SwapFree:              0 kB
    Dirty:              3564 kB
    Writeback:             0 kB
    AnonPages:        481444 kB
    Mapped:           272532 kB
    Shmem:              1180 kB
    KReclaimable:     123508 kB
    Slab:             166488 kB
    SReclaimable:     123508 kB
    SUnreclaim:        42980 kB
    KernelStack:        5848 kB
    PageTables:         6856 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
    CommitLimit:     6651460 kB
    Committed_AS:    3619984 kB
    VmallocTotal:   34359738367 kB
    VmallocUsed:        8384 kB
    VmallocChunk:          0 kB
    Percpu:             1424 kB
    AnonHugePages:         0 kB
    ShmemHugePages:        0 kB
    ShmemPmdMapped:        0 kB
    FileHugePages:         0 kB
    FilePmdMapped:         0 kB
    CmaTotal:              0 kB
    CmaFree:               0 kB
    HugePages_Total:       0
    HugePages_Free:        0
    HugePages_Rsvd:        0
    HugePages_Surp:        0
    Hugepagesize:       2048 kB
    Hugetlb:               0 kB
    DirectMap4k:      113472 kB
    DirectMap2M:     6174720 kB
    DirectMap1G:     9437184 kB
    processor    : 0
    vendor_id    : GenuineIntel
    cpu family    : 6
    model        : 79
    model name    : Intel(R) Xeon(R) CPU @ 2.20GHz
    stepping    : 0
    microcode    : 0x1
    cpu MHz        : 2199.998
    cache size    : 56320 KB
    physical id    : 0
    siblings    : 2
    core id        : 0
    cpu cores    : 1
    apicid        : 0
    initial apicid    : 0
    fpu        : yes
    fpu_exception    : yes
    cpuid level    : 13
    wp        : yes
    flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
    bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa
    bogomips    : 4399.99
    clflush size    : 64
    cache_alignment    : 64
    address sizes    : 46 bits physical, 48 bits virtual
    power management:
    
    processor    : 1
    vendor_id    : GenuineIntel
    cpu family    : 6
    model        : 79
    model name    : Intel(R) Xeon(R) CPU @ 2.20GHz
    stepping    : 0
    microcode    : 0x1
    cpu MHz        : 2199.998
    cache size    : 56320 KB
    physical id    : 0
    siblings    : 2
    core id        : 0
    cpu cores    : 1
    apicid        : 1
    initial apicid    : 1
    fpu        : yes
    fpu_exception    : yes
    cpuid level    : 13
    wp        : yes
    flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
    bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa
    bogomips    : 4399.99
    clflush size    : 64
    cache_alignment    : 64
    address sizes    : 46 bits physical, 48 bits virtual
    power management:
    
    

     

    데이터 전처리

    from sklearn.linear_model import Ridge , LogisticRegression
    from sklearn.model_selection import train_test_split , cross_val_score
    from sklearn.feature_extraction.text import CountVectorizer , TfidfVectorizer
    import pandas as pd
    
    mercari_df= pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/mercari_train.tsv',sep='\t')
    print(mercari_df.shape)
    mercari_df.head(3)
    (1482535, 8)
    
      train_id name item_condition_id category_name brand_name price shipping item_description
    0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
    1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
    2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
    print(mercari_df.info())
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1482535 entries, 0 to 1482534
    Data columns (total 8 columns):
     #   Column             Non-Null Count    Dtype  
    ---  ------             --------------    -----  
     0   train_id           1482535 non-null  int64  
     1   name               1482535 non-null  object 
     2   item_condition_id  1482535 non-null  int64  
     3   category_name      1476208 non-null  object 
     4   brand_name         849853 non-null   object 
     5   price              1482535 non-null  float64
     6   shipping           1482535 non-null  int64  
     7   item_description   1482531 non-null  object 
    dtypes: float64(1), int64(3), object(4)
    memory usage: 90.5+ MB
    None
    

     

    타겟 분포 확인

    import matplotlib.pyplot as plt
    import seaborn as sns
    %matplotlib inline
    
    y_train_df = mercari_df['price']
    plt.figure(figsize=(6,4))
    sns.distplot(y_train_df,kde=False)
    /usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
      warnings.warn(msg, FutureWarning)
    
    
    

     

    타겟값 로그 변환 후 분포도 확인

    import numpy as np
    
    y_train_df = np.log1p(y_train_df)
    sns.distplot(y_train_df,kde=False)
    /usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
      warnings.warn(msg, FutureWarning)
    

    mercari_df['price'] = np.log1p(mercari_df['price'])
    mercari_df['price'].head(3)
    0    2.397895
    1    3.970292
    2    2.397895
    Name: price, dtype: float64
    

     

    각 피처들의 유형 살펴보기

    print('Shipping 값 유형:\n',mercari_df['shipping'].value_counts())
    print('item_condition_id 값 유형:\n',mercari_df['item_condition_id'].value_counts())
    Shipping 값 유형:
     0    819435
    1    663100
    Name: shipping, dtype: int64
    item_condition_id 값 유형:
     1    640549
    3    432161
    2    375479
    4     31962
    5      2384
    Name: item_condition_id, dtype: int64
    
    boolean_cond= mercari_df['item_description']=='No description yet'
    mercari_df[boolean_cond]['item_description'].count()
    82489
    

     

    category name이 대/중/소 와 같이 '/' 문자열 기반으로 되어 있음. 이를 개별 컬럼들로 재 생성

    # apply lambda에서 호출되는 대,중,소 분할 함수 생성, 대,중,소 값을 리스트 반환
    def split_cat(category_name):
        try:
            return category_name.split('/')
        except:
            return ['Other_Null' , 'Other_Null' , 'Other_Null']
    
    # 위의 split_cat( )을 apply lambda에서 호출하여 대,중,소 컬럼을 mercari_df에 생성. 
    mercari_df['cat_dae'], mercari_df['cat_jung'], mercari_df['cat_so'] = \
                            zip(*mercari_df['category_name'].apply(lambda x : split_cat(x)))
    
    # 대분류만 값의 유형과 건수를 살펴보고, 중분류, 소분류는 값의 유형이 많으므로 분류 갯수만 추출
    print('대분류 유형 :\n', mercari_df['cat_dae'].value_counts())
    print('중분류 갯수 :', mercari_df['cat_jung'].nunique())
    print('소분류 갯수 :', mercari_df['cat_so'].nunique())
    대분류 유형 :
     Women                     664385
    Beauty                    207828
    Kids                      171689
    Electronics               122690
    Men                        93680
    Home                       67871
    Vintage & Collectibles     46530
    Other                      45351
    Handmade                   30842
    Sports & Outdoors          25342
    Other_Null                  6327
    Name: cat_dae, dtype: int64
    중분류 갯수 : 114
    소분류 갯수 : 871
    
    # apply lambda에서 호출되는 대,중,소 분할 함수 생성, 대,중,소 값을 리스트 반환
    def split_cat(category_name):
        try:
            return category_name.split('/')
        except:
            return ['Other_Null' , 'Other_Null' , 'Other_Null']
    
    # 위의 split_cat( )을 apply lambda에서 호출하여 대,중,소 컬럼을 mercari_df에 생성. 
    mercari_df['category_list'] = mercari_df['category_name'].apply(lambda x : split_cat(x))
    mercari_df['category_list'].head()
    0                                [Men, Tops, T-shirts]
    1    [Electronics, Computers & Tablets, Components ...
    2                      [Women, Tops & Blouses, Blouse]
    3               [Home, Home Décor, Home Décor Accents]
    4                          [Women, Jewelry, Necklaces]
    Name: category_list, dtype: object
    
    mercari_df['cat_dae'] = mercari_df['category_list'].apply(lambda x:x[0])
    mercari_df['cat_jung'] = mercari_df['category_list'].apply(lambda x:x[1])
    mercari_df['cat_so'] = mercari_df['category_list'].apply(lambda x:x[2])
    
    mercari_df.drop('category_list', axis=1, inplace=True) 
    mercari_df[['cat_dae','cat_jung','cat_so']].head()
      cat_dae cat_jung cat_so
    0 Men Tops T-shirts
    1 Electronics Computers & Tablets Components & Parts
    2 Women Tops & Blouses Blouse
    3 Home Home Décor Home Décor Accents
    4 Women Jewelry Necklaces

     

    Null 값 일괄 처리

    mercari_df['brand_name'] = mercari_df['brand_name'].fillna(value='Other_Null')
    mercari_df['category_name'] = mercari_df['category_name'].fillna(value='Other_Null')
    mercari_df['item_description'] = mercari_df['item_description'].fillna(value='Other_Null')
    
    # 각 컬럼별로 Null값 건수 확인. 모두 0가 나와야 합니다.
    mercari_df.isnull().sum()
    train_id             0
    name                 0
    item_condition_id    0
    category_name        0
    brand_name           0
    price                0
    shipping             0
    item_description     0
    cat_dae              0
    cat_jung             0
    cat_so               0
    dtype: int64
    
    mercari_df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1482535 entries, 0 to 1482534
    Data columns (total 11 columns):
     #   Column             Non-Null Count    Dtype  
    ---  ------             --------------    -----  
     0   train_id           1482535 non-null  int64  
     1   name               1482535 non-null  object 
     2   item_condition_id  1482535 non-null  int64  
     3   category_name      1482535 non-null  object 
     4   brand_name         1482535 non-null  object 
     5   price              1482535 non-null  float64
     6   shipping           1482535 non-null  int64  
     7   item_description   1482535 non-null  object 
     8   cat_dae            1482535 non-null  object 
     9   cat_jung           1482535 non-null  object 
     10  cat_so             1482535 non-null  object 
    dtypes: float64(1), int64(3), object(7)
    memory usage: 124.4+ MB
    

     

    피처 인코딩과 피처 벡터화

    brand name과 name의 종류 확인

    print('brand name 의 유형 건수 :', mercari_df['brand_name'].nunique())
    print('brand name sample 5건 : \n', mercari_df['brand_name'].value_counts()[:5])
    brand name 의 유형 건수 : 4810
    brand name sample 5건 : 
     Other_Null           632682
    PINK                  54088
    Nike                  54043
    Victoria's Secret     48036
    LuLaRoe               31024
    Name: brand_name, dtype: int64
    
    print('name 의 종류 갯수 :', mercari_df['name'].nunique())
    print('name sample 7건 : \n', mercari_df['name'][:7])
    name 의 종류 갯수 : 1225273
    name sample 7건 : 
     0    MLB Cincinnati Reds T Shirt Size XL
    1       Razer BlackWidow Chroma Keyboard
    2                         AVA-VIV Blouse
    3                  Leather Horse Statues
    4                   24K GOLD plated rose
    5       Bundled items requested for Ruie
    6     Acacia pacific tides santorini top
    Name: name, dtype: object
    

     

    item_description의 문자열 개수 확인

    pd.set_option('max_colwidth', 200)
    
    # item_description의 평균 문자열 개수
    print('item_description 평균 문자열 개수:',mercari_df['item_description'].str.len().mean())
    
    mercari_df['item_description'][:2]
    item_description 평균 문자열 개수: 145.7113889385411
    
    0                                                                                                                                                                              No description yet
    1    This keyboard is in great condition and works like it came out of the box. All of the ports are tested and work perfectly. The lights are customizable via the Razer Synapse app on your PC.
    Name: item_description, dtype: object
    
    import gc
    gc.collect()
    88
    

     

    name은 Count로, item_description은 TF-IDF로 피처 벡터화

    # name 속성에 대한 feature vectorization 변환
    cnt_vec = CountVectorizer(max_features=30000)
    X_name = cnt_vec.fit_transform(mercari_df.name)
    
    # item_description 에 대한 feature vectorization 변환 
    tfidf_descp = TfidfVectorizer(max_features = 50000, ngram_range= (1,3) , stop_words='english')
    X_descp = tfidf_descp.fit_transform(mercari_df['item_description'])
    
    print('name vectorization shape:',X_name.shape)
    print('item_description vectorization shape:',X_descp.shape)
    name vectorization shape: (1482535, 30000)
    item_description vectorization shape: (1482535, 50000)
    

     

    사이킷런의 LabelBinarizer를 이용하여 원-핫 인코딩 변환 후 희소행렬 최적화 형태로 저장

    from sklearn.preprocessing import LabelBinarizer
    
    # brand_name, item_condition_id, shipping 각 피처들을 희소 행렬 원-핫 인코딩 변환
    lb_brand_name= LabelBinarizer(sparse_output=True)
    X_brand = lb_brand_name.fit_transform(mercari_df['brand_name'])
    
    lb_item_cond_id = LabelBinarizer(sparse_output=True)
    X_item_cond_id = lb_item_cond_id.fit_transform(mercari_df['item_condition_id'])
    
    lb_shipping= LabelBinarizer(sparse_output=True)
    X_shipping = lb_shipping.fit_transform(mercari_df['shipping'])
    
    # cat_dae, cat_jung, cat_so 각 피처들을 희소 행렬 원-핫 인코딩 변환
    lb_cat_dae = LabelBinarizer(sparse_output=True)
    X_cat_dae= lb_cat_dae.fit_transform(mercari_df['cat_dae'])
    
    lb_cat_jung = LabelBinarizer(sparse_output=True)
    X_cat_jung = lb_cat_jung.fit_transform(mercari_df['cat_jung'])
    
    lb_cat_so = LabelBinarizer(sparse_output=True)
    X_cat_so = lb_cat_so.fit_transform(mercari_df['cat_so'])
    print(type(X_brand), type(X_item_cond_id), type(X_shipping))
    print('X_brand_shape:{0}, X_item_cond_id shape:{1}'.format(X_brand.shape, X_item_cond_id.shape))
    print('X_shipping shape:{0}, X_cat_dae shape:{1}'.format(X_shipping.shape, X_cat_dae.shape))
    print('X_cat_jung shape:{0}, X_cat_so shape:{1}'.format(X_cat_jung.shape, X_cat_so.shape))
    <class 'scipy.sparse.csr.csr_matrix'> <class 'scipy.sparse.csr.csr_matrix'> <class 'scipy.sparse.csr.csr_matrix'>
    X_brand_shape:(1482535, 4810), X_item_cond_id shape:(1482535, 5)
    X_shipping shape:(1482535, 1), X_cat_dae shape:(1482535, 11)
    X_cat_jung shape:(1482535, 114), X_cat_so shape:(1482535, 871)
    
    import gc
    gc.collect()
    290
    

     

    피처 벡터화된 희소 행렬과 원-핫 인코딩된 희소 행렬을 모두 scipy 패키지의 hstack()함수를 이용하여 결합

    from  scipy.sparse import hstack
    import gc
    
    sparse_matrix_list = (X_name, X_descp, X_brand, X_item_cond_id,
                X_shipping, X_cat_dae, X_cat_jung, X_cat_so)
    
    # 사이파이 sparse 모듈의 hstack 함수를 이용하여 앞에서 인코딩과 Vectorization을 수행한 데이터 셋을 모두 결합. 
    X_features_sparse= hstack(sparse_matrix_list).tocsr()
    print(type(X_features_sparse), X_features_sparse.shape)
    
    # 데이터 셋이 메모리를 많이 차지하므로 사용 용도가 끝났으면 바로 메모리에서 삭제. 
    del X_features_sparse
    gc.collect()
    <class 'scipy.sparse.csr.csr_matrix'> (1482535, 85812)
    
    88
    

     

    릿지 회귀 모델 구축 및 평가

    rmsle 정의

    def rmsle(y , y_pred):
        # underflow, overflow를 막기 위해 log가 아닌 log1p로 rmsle 계산 
        return np.sqrt(np.mean(np.power(np.log1p(y) - np.log1p(y_pred), 2)))
    
    def evaluate_org_price(y_test , preds): 
    
        # 원본 데이터는 log1p로 변환되었으므로 exmpm1으로 원복 필요. 
        preds_exmpm = np.expm1(preds)
        y_test_exmpm = np.expm1(y_test)
    
        # rmsle로 RMSLE 값 추출
        rmsle_result = rmsle(y_test_exmpm, preds_exmpm)
        return rmsle_result

     

    여러 모델에 대한 학습/예측을 수행하기 위해 별도의 함수인 model_train_predict()생성.

    해당 함수는 여러 희소 행렬을 hstack()으로 결합한 뒤 학습과 테스트 데이터 세트로 분할 후 모델 학습 및 예측을 수행

    import gc 
    from  scipy.sparse import hstack
    
    def model_train_predict(model,matrix_list):
        # scipy.sparse 모듈의 hstack 을 이용하여 sparse matrix 결합
        X= hstack(matrix_list).tocsr()     
    
        X_train, X_test, y_train, y_test=train_test_split(X, mercari_df['price'], 
                                                          test_size=0.2, random_state=156)
    
        # 모델 학습 및 예측
        model.fit(X_train , y_train)
        preds = model.predict(X_test)
    
        del X , X_train , X_test , y_train 
        gc.collect()
    
        return preds , y_test

     

    릿지 선형 회귀로 학습/예측/평가. Item Description 피처의 영향도를 알아보기 위한 테스트 함께 수행

    linear_model = Ridge(solver = "lsqr", fit_intercept=False)
    
    sparse_matrix_list = (X_name, X_brand, X_item_cond_id,
                          X_shipping, X_cat_dae, X_cat_jung, X_cat_so)
    linear_preds , y_test = model_train_predict(model=linear_model ,matrix_list=sparse_matrix_list)
    print('Item Description을 제외했을 때 rmsle 값:', evaluate_org_price(y_test , linear_preds))
    
    sparse_matrix_list = (X_descp, X_name, X_brand, X_item_cond_id,
                          X_shipping, X_cat_dae, X_cat_jung, X_cat_so)
    linear_preds , y_test = model_train_predict(model=linear_model , matrix_list=sparse_matrix_list)
    print('Item Description을 포함한 rmsle 값:',  evaluate_org_price(y_test ,linear_preds))
    Item Description을 제외했을 때 rmsle 값: 0.503396904091685
    Item Description을 포함한 rmsle 값: 0.47114342682343463
    
    import gc
    gc.collect()
    140
    

     

    LightGBM 회귀 모델 구축과 앙상블을 이용한 최종 예측 평가

    from lightgbm import LGBMRegressor
    
    sparse_matrix_list = (X_descp, X_name, X_brand, X_item_cond_id,
                          X_shipping, X_cat_dae, X_cat_jung, X_cat_so)
    
    lgbm_model = LGBMRegressor(n_estimators=200, learning_rate=0.5, num_leaves=125, random_state=156)
    lgbm_preds , y_test = model_train_predict(model = lgbm_model , matrix_list=sparse_matrix_list)
    print('LightGBM rmsle 값:',  evaluate_org_price(y_test , lgbm_preds))
    LightGBM rmsle 값: 0.45668225613337415
    
    preds = lgbm_preds * 0.45 + linear_preds * 0.55
    print('LightGBM과 Ridge를 ensemble한 최종 rmsle 값:',  evaluate_org_price(y_test , preds))
    LightGBM과 Ridge를 ensemble한 최종 rmsle 값: 0.4506274879031209
    

    '머신러닝 & 딥러닝' 카테고리의 다른 글

    [추천] Latent Collaborative Filtering  (0) 2021.12.16
    [추천] Contents based filtering  (0) 2021.12.14
    [텍스트] KoNLPy 맥 M1 설치하기  (0) 2021.12.05
    [텍스트] 문서 유사도  (0) 2021.12.04
    [정규표현식] re 모듈  (0) 2021.11.30

    댓글