[n113] Data Manipulation (concat, merge, melt, pivot, conditioning)

AI 부트캠프 2021. 9. 13. 09:25

학습 내용

데이터를 concat / merge.
tidy 데이터 에 대한 개념을 이해.
melt와 pivot / pivot_table 함수를 사용.

여러개로 나뉘어 있는 데이터셋을 병합해야 하는 경우가 많다. 판다스를 통해 데이터를 합치는 과정을 배운다.

Concat

2개의 문자열을 더하는 기능이다. data frame은 concat을 통해 붙일 수 있다.

문자를 더하는 다른 함수 예시

tostring
join

문자를 나누는 함수 예시

split

코드 실행

pd.concat([x, y]) # column을 기준으로 붙일 때는 axis=1을 설정.

df_stock_combined = pd.concat([df_stock, df_theme], axis=1)
df_stock_combined

axis=1로 설정하면 (기본값은 0), 컬럼을 기준으로 데이터셋이 병합한다.

Merge

공통된 부분을 기반으로 합친다.

pandas.DataFrame.merge — pandas 1.3.3 documentation

If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only

pandas.pydata.org

# df와 df2를 합치되 '종목'이 겹치는 데이터만 합친다.
df.merge(df2, how = 'inner', on = '종목')

Condition

# 두 개의 조건을 모두 만족하는 경우 df_subset에 입력된다.

condition = (df['순이익률'] > 0 & df['분기이익률'] > 10) 
df_subset = df[condition]

Groupby

pandas.DataFrame.groupby — pandas 1.3.3 documentation

Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; s

pandas.pydata.org

Tidy Data

Tidy Format : 하나의 행에는 하나의 observation 만 넣자.

.melt() : wide -> tidy

tidy1 = tidy1.melt(id_vars = 'index', value_vars = ['A', 'B'])
tidy1

.pivot_table() : tidy -> wide

wide = tidy1.pivot_table(index = 'row', columns = 'column', values = 'value')
wide

DataFrame 컬럼을 index로 설정 하고 싶지 않을 때

as_index = False를 설정 안했을 때

stock_tidy = df_stock_combined.groupby(["테마"])['매출액', '자본총계', 'EPS'].mean().round()
stock_tidy

즉, '테마' 컬럼이 index로 설정된다. 이런 경우 그래프를 그릴 때 문제가 생긴다.

그러므로, 인덱스 값으로 테마 설정을 해지 해야한다.

stock_tidy = df_stock_combined.groupby(["테마"], as_index=False)['매출액', '자본총계', 'EPS'].mean().round()
stock_tidy

차이점이 분명하게 보인다. 두 테이블의 '테마' 컬럼을 보면 다른 것을 알 수 있다. 이후에 bar 와 같은 그래프를 그릴 수 있다. 매우 단순해 보이지만 이걸 찾는데 1시간이 걸렸다.

부트캠프 공부 내용 한 눈에 보기

[인덱스] 코드 스테이츠 AI 부트캠프

구성 2021.09.09 ~ 2022.04 (총 28주) 배우는 내용 Section 1. 데이터 분석 입문 SPRINT 1. EDA SPRINT 2. Statistics SPRINT 3. DAY 1 EDA 데이터 전처리 Pandas in Colab 가설 검정 (t-test) T-Test 행렬 및 벡..

da-journal.com

저작자표시 비영리 변경금지 (새창열림)

'AI 부트캠프' 카테고리의 다른 글

[n121] 가설 검정 (t-test) (0)	2021.09.16
[n114] 미분 개념과 경사하강법 (0)	2021.09.14
[n112] Feature Engineering (0)	2021.09.10
[n111] Pandas in Colab (0)	2021.09.09
[n111] EDA 데이터 전처리 (0)	2021.09.09

ABOUT ME

엔지니어 한다운의 저널