Published 2022. 6. 12. 20:17

pandas 함수 정리 [ing]

TIL/01_Python

누락 데이터 처리

df.value_counts() : 해당 열의 데이터 파악 (default : dropna=True)

df.value_counts(dropna=False) : NaN값까지 출력

df.isnull() : 메소드 누락 데이터 찾기

df.notnull() : null값이 아닌 데이터 값 차기

df.isnull().sum(axis=0) : 각 열의 누락 데이터(NaN) 개수 합

for 반복문 이용 각 열의 NaN 개수 계산

# for 반복문 각 열의 NaN 개수 계산하기(묘미****)
null_df = df.isnull()

for col in null_df.columns:
    null_count = null_df[col].value_counts()  # 각 열의 NaN 개수 파악
    
    try:
        print(col, ':', null_count[True])  # NaN 값이 있으면 개수 출력
    except:
        print(col, ':', 0)  # NaN 값이 없으면 0개 출력

df.dropna() : 누락 데이터 제거

df.dropna(axis=1, thresh=500) : NaN 값을 500개 이상 갖는 모든 열 삭제

df.dropna(subset=[ ], how='any', axis=0) : 해당 열의 행 중에서 NaN값이 하나라도 존재하면 모든 행(axis=0) 삭제

(how = 'all' : 모든 데이터가 NaN값일 경우에만 삭제)

df.fillna() : 누락 데이터 치환

mean_age = df.mean(axis=0)

df.fillna(mean_age, inplace=True) : mean 값으로 치환

most_freq = df.value_counts(dropna=True).idxmax()

df.fillna(most_freq, inplace=True) : 가장 많이 나타나는 값으로 치환

df.fillna(method='ffill', inplace=True) : NaN이 있는 행의 직전 행에 있는 값으로 치환

(method = 'bfill' : NaN이 있는 행 다음 행에 있는 값)

중복 데이터 처리

df.duplicated() : 동일한 관측값이 중복되는지 여부 확인 (중복O = True, 중복X = False)

drop.duplicates() : 중복 데이터 제거

drop.duplicates(['c2','c3'])

범주형 데이터 처리

# 구간분할
# np.histogram 함수로 3개의 bin으로 나누는 경계 값의 리스트 구하기

np.histogram(df['horsepower'], bins=3)

count, bin_dividers = np.histogram(df['horsepower'], bins=3)

# 3개의 bin에 이름을 지어 줘야지
bin_names = ['저출력','보통 출력','고출력']

# pd.cut 함수로 각 데이터를 3개의 bin에 할당
df['hp_bin'] = pd.cut(x=df['horsepower'],
               bins = bin_dividers,
               labels = bin_names,
               include_lowest=True)

pd.get_dummies() : 더미 변수

pandas 함수 정리 [ing]

티스토리툴바