아파트 경매 가격 예측 프로젝트 6 : 모델 최적화

머신러닝/머신러닝: 실전 프로젝트 학습

아파트 경매 가격 예측 프로젝트 6 : 모델 최적화

qordnswnd123 2025. 1. 8. 15:39

1. 데이터 로드

import pandas as pd

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
result = pd.read_csv('Auction_result.csv')
submission = pd.read_csv('sample_submission.csv')

2. 학습(train) 데이터 전처리

# year 피처 생성 및 날짜 피처 제거
train['Final_auction_date'] = pd.to_datetime(train['Final_auction_date'], errors = 'ignore')
train['year'] = train['Final_auction_date'].dt.year
date_col = ['Appraisal_date', 'First_auction_date', 'Final_auction_date', 'Preserve_regist_date', 'Close_date']
train = train.drop(date_col, axis= 1)

# 최빈값으로 결측값 보완
addr_freq = train['addr_bunji1'].value_counts().index[0]
road_freq = train['road_bunji1'].value_counts().index[0]
train['addr_bunji1'] = train['addr_bunji1'].fillna(addr_freq)
train['road_bunji1'] = train['road_bunji1'].fillna(road_freq)

# 결측값 많은 피처 제거
much_null = ['addr_li', 'addr_bunji2', 'Specific', 'road_bunji2']
train = train.drop(much_null, axis = 1)

# Target을 제외하고 상관계수가 높았던 피처 제거
highcorr_col = ['Total_land_real_area', 'Total_land_auction_area', 'Total_building_area', 'Total_building_auction_area']
train = train.drop(highcorr_col, axis =1)

# 데이터 분석 과정에서 제거하기로 결정한 피처 제거
drop_list = ['Close_result', 'Final_result', 'addr_dong', 'addr_etc', 'road_name', 'Appraisal_company', 'Creditor', 'addr_si']
train = train.drop(drop_list, axis = 1)

3. 테스트(test) 데이터 전처리

# year 피처 생성 및 날짜 피처 제거
test['Final_auction_date'] = pd.to_datetime(test['Final_auction_date'], errors = 'ignore')
test['year'] = test['Final_auction_date'].dt.year
date_col = ['Appraisal_date', 'First_auction_date', 'Final_auction_date', 'Preserve_regist_date', 'Close_date']
test = test.drop(date_col, axis= 1)

# 최빈값으로 결측값 보완
addr_freq = train['addr_bunji1'].value_counts().index[0]
road_freq = train['road_bunji1'].value_counts().index[0]
test['addr_bunji1'] = test['addr_bunji1'].fillna(addr_freq)
test['road_bunji1'] = test['road_bunji1'].fillna(road_freq)

# 결측값 많은 피처 제거
much_null = ['addr_li', 'addr_bunji2', 'Specific', 'road_bunji2']
test = test.drop(much_null, axis = 1)

# Target을 제외하고 상관계수가 높았던 피처 제거
highcorr_col = ['Total_land_real_area', 'Total_land_auction_area', 'Total_building_area', 'Total_building_auction_area']
test = test.drop(highcorr_col, axis =1)

# 데이터 분석 과정에서 제거하기로 결정한 피처 제거
drop_list = ['Close_result', 'Final_result', 'addr_dong', 'addr_etc', 'road_name', 'Appraisal_company', 'Creditor', 'addr_si']
test = test.drop(drop_list, axis = 1)

4. 두 데이터를 활용한 파생변수 생성 및 데이터 결합

# result_year 변수 생성 및 2014년 이후 데이터만 사용
result['result_year'] = pd.to_datetime(result['Auction_date'], errors = 'ignore').dt.year
result = result[result['result_year'] >= 2014]

# 마지막 데이터 추출
need_merge = result[['Auction_key', 'Auction_results']].drop_duplicates(subset = 'Auction_key', keep = 'last')
need_merge = need_merge.reset_index(drop = True)

# 감정가 최솟값 결합
appraisal_min = result.groupby('Auction_key')['Appraisal_price'].min()
need_merge['appraisal_min'] = appraisal_min.values

# 최저매각가격 최솟값 결합
sales_min = result.groupby('Auction_key')['Minimum_sales_price'].min()
need_merge['sales_min'] = sales_min.values

# 경매횟수의 최대값 가져오기
max_seq = result.groupby('Auction_key')['Auction_seq'].max()
need_merge['max_seq'] = max_seq.values

# 경매결과가 유찰인 횟수 가져오기
failed_auction = result[result['Auction_results'] == '유찰']
auction_count = failed_auction.groupby('Auction_key')['Auction_results'].count()
auction_result = need_merge.join(auction_count, on = 'Auction_key', how = 'left', lsuffix = '_left', rsuffix = '_right')

# auction_results_right 결측치 보완
null_indice = auction_result[auction_result['Auction_results_right'].isnull()].index
auction_result.loc[null_indice, 'Auction_results_right'] = 0

# 증감률 피처 생성
last_price = result.groupby('Auction_key')['Minimum_sales_price'].min().values
first_price = result.groupby('Auction_key')['Minimum_sales_price'].max().values
auction_result['change_rate'] = (last_price - first_price) / first_price * 100

# 데이터 결합
train_result = train.merge(auction_result, on = 'Auction_key', how = 'left')
test_result = test.merge(auction_result, on = 'Auction_key', how = 'left')

5. 학습(train_result) 데이터에 원 핫 인코딩 사용

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False, handle_unknown = 'ignore')
onehot_train = ohe.fit_transform(train_result[['Bid_class']])
onehot_frame = pd.DataFrame(onehot_train, columns = ohe.categories_[0])
train_result = pd.concat([train_result, onehot_frame], axis = 1)

train_result = train_result.drop(['Bid_class', '일괄'], axis = 1)

6. 테스트(test_result) 데이터에 원 핫 인코딩 사용

test_onehot = ohe.transform(test_result[['Bid_class']])
onehot_frame = pd.DataFrame(test_onehot, columns = ohe.categories_[0])
test_result = pd.concat([test_result, onehot_frame], axis = 1)

test_result = test_result.drop(['Bid_class', '일괄'], axis = 1)

7. 라벨 인코딩 사용

from sklearn.preprocessing import LabelEncoder
import numpy as np

label_col = ['Auction_class', 'addr_do', 'addr_san', 'Share_auction_YorN', 'Auction_results_left', 'Apartment_usage']

for col in label_col:
    le = LabelEncoder()
    train_result[col] = le.fit_transform(train_result[col])
    
    for label in np.unique(test_result[col]): 
        if label not in le.classes_: 
            le.classes_ = np.append(le.classes_, label)
    test_result[col] = le.transform(test_result[col])

8. train_test_split

from sklearn.model_selection import train_test_split

train_x = train_result.drop(['Hammer_price', 'Auction_key'], axis = 1)
train_y = train_result['Hammer_price']

x_train, x_valid, y_train, y_valid = train_test_split(train_x, train_y, test_size=0.3, random_state=42)

9. XGBoost 학습

from xgboost import XGBRegressor

xgb = XGBRegressor(random_state = 42)
xgb.fit(x_train, y_train)
valid_pred = xgb.predict(x_valid)

10. XGBoost 평가

from sklearn.metrics import mean_squared_error

mean_squared_error(y_valid, valid_pred, squared = False)

436683542.07750994

11. XGBoost 그리드 서치

from sklearn.model_selection import GridSearchCV

xgb_grid = XGBRegressor(random_state = 42)

params = {'n_estimators': [15, 30],
          'max_depth': [3, 8]}

greedy_CV = GridSearchCV(xgb_grid, param_grid=params, cv = 2, n_jobs = -1)
greedy_CV.fit(x_train, y_train)

12. 그리드 서치 XGBoost 평가

xgb_best_model = greedy_CV.best_estimator_
valid_pred = xgb_best_model.predict(x_valid)
mean_squared_error(y_valid, valid_pred, squared = False)

437016248.3772861

'머신러닝 > 머신러닝: 실전 프로젝트 학습' 카테고리의 다른 글

아파트 경매 가격 예측 프로젝트 5 : 단순 회귀 분석 (0)	2025.01.08
아파트 경매 가격 예측 프로젝트 4 : 데이터 결합 (0)	2025.01.08
아파트 경매 가격 예측 프로젝트 3 : Feature engineering (0)	2025.01.08
아파트 경매 가격 예측 프로젝트 2 : EDA (0)	2025.01.07
아파트 경매 가격 예측 프로젝트 1 : 데이터 분석 및 기본 예측 (0)	2025.01.07

현재글아파트 경매 가격 예측 프로젝트 6 : 모델 최적화

qordnswnd123 님의 블로그

qordnswnd123 님의 블로그 입니다.

Today :
Yesterday :

qordnswnd123 님의 블로그

아파트 경매 가격 예측 프로젝트 6 : 모델 최적화

1. 데이터 로드

2. 학습(train) 데이터 전처리

3. 테스트(test) 데이터 전처리

4. 두 데이터를 활용한 파생변수 생성 및 데이터 결합

5. 학습(train_result) 데이터에 원 핫 인코딩 사용

6. 테스트(test_result) 데이터에 원 핫 인코딩 사용

7. 라벨 인코딩 사용

8. train_test_split

9. XGBoost 학습

10. XGBoost 평가

11. XGBoost 그리드 서치

12. 그리드 서치 XGBoost 평가

'머신러닝 > 머신러닝: 실전 프로젝트 학습' 카테고리의 다른 글

'머신러닝/머신러닝: 실전 프로젝트 학습'의 다른글

티스토리툴바

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

아파트 경매 가격 예측 프로젝트 6 : 모델 최적화

1. 데이터 로드

2. 학습(train) 데이터 전처리

3. 테스트(test) 데이터 전처리

4. 두 데이터를 활용한 파생변수 생성 및 데이터 결합

5. 학습(train_result) 데이터에 원 핫 인코딩 사용

6. 테스트(test_result) 데이터에 원 핫 인코딩 사용

7. 라벨 인코딩 사용

8. train_test_split

9. XGBoost 학습

10. XGBoost 평가

11. XGBoost 그리드 서치

12. 그리드 서치 XGBoost 평가

'머신러닝 > 머신러닝: 실전 프로젝트 학습' 카테고리의 다른 글

'머신러닝/머신러닝: 실전 프로젝트 학습'의 다른글

관련글

티스토리툴바