[e-commerce] final-project

어떤 데이터를 분석하나?

https://www.kaggle.com/datasets/bytadit/transactional-ecommerce/data

E-commerce App Transactional Dataset

Cuctomer360 transactional dataset. For study only please, not for reserach use!

www.kaggle.com

테이블 구성
- customer
- click_stream
- transactions
- product
Customer Table (고객 정보):
- 고객의 고유 ID, 이름, 성별, 이메일, 기기 정보, 위치 정보, 앱 가입 날짜를 포함.
Product Table (제품 정보):
- 패션 제품의 카테고리, 색상, 사용 목적, 시즌별 타겟 등 제품 세부 정보를 포함.
Transaction Table (거래 정보):
- 고객이 수행한 거래 기록으로, 제품 메타데이터, 결제 방법, 배송 정보 등을 포함.
Click Stream Table (사용자 활동 기록):
- 앱 사용 중 발생한 이벤트(활동) 기록으로 세션 ID, 이벤트 이름 및 메타데이터를 포함.

분석 프로세스

Process diagram: Conversation Ratio Development

프로젝트 목표 설정

https://www.notion.so/teamsparta/1732dc3ef5148197bc3ce24f35ec9823?pvs=25

데이터 분석 프로젝트 기획서 | Notion

Made with Notion, the all-in-one connected workspace with publishing capabilities.

teamsparta.notion.site

- [제목]
  - 이커머스 플랫폼 거래 전환율 상승 프로젝트
- [주제]
  - 고객 관리 방안 수립
  - 이탈 방지 전략 수립
- [목표]
  - 초개인화 CRM 고객 관리 설계
- [문제]
  - 장바구니 -> 구매 전환율이 낮다(현재 44.01%)
  - 거래량이 줄었다(2021년 -> 2022년 6만 건 감소)

프로젝트 일정 계획

1주차 : 결과물 초안 작성(ppt 스크립트 & 대시보드 스케치)
2주차 : EDA 및 시각화 총정리 // Key Questions(문제 정의 및 인사이트) 정리
3주차 : 분석(통계, 가설검증, 머신러닝, 대시보드 설계)
4주차 : 비즈니스 성과 측정(예측 및 검증)
5주차 : 결과물 수정 보완
6주차 : 발표

PPT 흐름 초안 작성

page	질문(목차)		대답(내용)
1	표지
2	프로젝트 개요	[Background] 어떤 기업?	우리 기업을 소개합니다 [주요 사업] - 온라인 쇼핑 플랫폼 - 플랫폼 이름: 1) BeliLah - 말레이시아어로 'Beli'는 '구매하다' + 'Lah'는 친근한 강조어, 고객 맞춤형 쇼핑 플랫폼 느낌? 2) Shopendra - 'Shop'과 'Endra(인도네시아에서 흔한 이름 접미사)'
3		[5W1H]	- What? - Who? - Where? - When? - How?
4		[As-Is - To-Be] 현재 상황은? 현재 문제점은? 뭘 할 수 있어?	- As-Is(현재 상황): 장바구니 -> 결제 전환율 44%(목표 수치보다 낮은 상태) - To-Be(우리의 과업): 분석 방법 등 요약
5		[Objective] 이 프로젝트의 목표는?	- 장바구니 전환율 60% 목표 - 이 프로젝트의 필요성
6		[Limitations] 프로젝트의 한계점은? 그에 대한 대응책은?
7		[Process] 전반적인 프로젝트 흐름 요약	- 분석 프로세스 다이어그램
8	EDA	[Dataset Overview] 어떤 데이터를 사용하는가?	- 익명의 패션 산업 데이터(모바일 사용자) / 무신사와 비슷하다고 생각
9		[Analysis Metrics] 성과를 측정할 주요 지표는?	- 전환율(conversation ratio) - 이탈률(churn rate) - 잔존율(retension) - CLV(customer lifetime value)
10		[Takeaways] 그래서 분석의 핵심이 뭐야?	- 핵심 1: 전환율이 낮은 집단(주요 관리가 필요한 군집)은 어디? - RFM 고객 segments 및 Funnel 분석을 통해 집중 관리 집단 데이터로 확인 - sankey 차트 시각화
11		[Takeaways] 그래서 분석의 핵심이 뭐야?	- 핵심 2: 거래량이 떨어진 집단은 어디? - 지역 및 상품 segments 및 거래량 감소 집단 데이터로 확인 - Ex) 2021년 1월과 2022년 1월의 상품 비교 - 라인 차트 or 지도 or 상품 시각화
12		[Assumtions] 설정한 가설은? 전환율이 떨어진 원인 분석?	- 문제의 원인 가정 - 대표 그래프 시각화
13	분석 및 검증	[Solution] 전환율 어떻게 높여?	- 데이터 전처리(핵심 독립 변수 및 종속 변수 설정) - 핵심 그래프 하나 - 결과 요약
14		[Solution] 이탈 고객 어떻게 잡아?	- 데이터 전처리(핵심 독립 변수 및 종속 변수 설정) - 핵심 그래프 하나 - 결과 요약
15		[Solution] CLV 어떻게 높여?	- 데이터 전처리(핵심 독립 변수 및 종속 변수 설정) - 핵심 그래프 하나 - 결과 요약
16		[Management] 어떻게 지속 관리하나?	- 잘 만든 대시보드 하나
17	인사이트	분석적 해석
18	인사이트	비즈니스적 해석
19	Action Plan	결론

Table of Contents in the notebook

* [Analysis. 이탈 고객을 잡으면 얼마나 이익을 가져올까](#chapter1)
    * [이탈 고객 정의](#section_1_1)
    * [지난 60일 수익(AOV, LTV)](#section_1_2)
    * [이탈 고객이 매출에 미치는 영향](#section_1_3)
    * [이탈 가능성 고객 정의](#section_1_4)
    * [이탈 관련 핵심 feature 선정](#section_1_5)
    * [이탈 고객 예측](#section_1_6)
    * [프로모션과 이탈 고객과의 연관성 혹은 프로모션이 기업에 주는 이익](#section_1_7)
    * [CRM 관리 방안](#section_1_8)

총 고객 1년간 AOV 및 CLTV

# 총 고객 1년간 CLTV

def current(value):
    return f'{value:,.0f} 루피아'


df = merged_df.copy()

present_day = df['transaction_date'].max() + dt.timedelta(days = 2)
yearago = present_day - pd.DateOffset(years = 1)
mask = df['transaction_date'] >= yearago
df = df[mask]

grouped = df.groupby('customer_id').agg(
    # total_revenue = ('amount', 'sum'), # 토탈 거래 금액
    total_transaction = ('session_id', 'nunique'), # 거래 횟수
    lifespan = ('first_join_date', lambda x: (present_day - x.max()).days // 365 if not x.isnull().all() else None) # 가입 기간
).reset_index()

# CLTV 계산
aov = (df['amount'].sum() / len(df)).round(2) # Step 1: 평균 구매 금액(aov)
purchase_frequency = grouped['total_transaction'].mean() # Step 2: 평균 구매 빈도율
customer_lifespan_years = grouped['lifespan'].mean() # Step 3: 고객 생애 기간
cltv = aov * purchase_frequency * customer_lifespan_years # Step 4: CLTV 계산

# 결과 출력
print(f"평균 구매 금액(AOV): {current(aov)}")
print(f"평균 구매 빈도율: {purchase_frequency:.2f}회")
print(f"고객 생애 기간: {customer_lifespan_years:.2f}년")
print(f"CLTV: {current(cltv)}")

>>>
평균 구매 금액(AOV): 374,021 루피아
평균 구매 빈도율: 8.25회
고객 생애 기간: 1.94년
CLTV: 5,977,484 루피아

rfm 세그먼트

# rfm_df 함수 생성
# 지난 1년간 기록을 바탕으로 점수 산정

# rfm dataframe 생성
def rfm(df):

    present_day = df['transaction_date'].max() + dt.timedelta(days = 2)  # Timestamp('2022-08-02 00:00:00')

    rfm = df.groupby('customer_id').agg({'transaction_date': lambda x: (present_day - x.max()).days,
                                        'session_id': lambda x: x.nunique(),
                                        'amount': lambda x: x.sum(),
                                        'first_join_date': lambda x: (present_day - x.max()).days})

    rfm.columns = ['recency', 'frequency', 'monetary', 'join_period']
    rfm = rfm.reset_index()

    return rfm

# rfm score 
def get_rfm_scores(df) -> pd.core.frame.DataFrame:

    # recency 구분
    recency_quantiles = [0.0, 0.25, 0.45, 0.60, 0.75, 1.0]
    recency_bins = df['recency'].quantile(recency_quantiles).tolist()
    df["recency_score"] = pd.cut(df["recency"], bins = recency_bins, labels=[5, 4, 3, 2, 1], include_lowest = True)
    
    # frequency 구분
    df["frequency_score"] = pd.qcut(
        df["frequency"].rank(method="first"), 5, labels=[1, 2, 3, 4, 5]
    )
    
    # monetary 구분
    df["monetary_score"] = pd.qcut(df["monetary"], 5, labels=[1, 2, 3, 4, 5])
    
    # 가입 기간 구분
    df["period_score"] = pd.cut(df["join_period"], 
                                bins=[0, 90, 180, 270, 360, float('inf')],
                                labels=[5, 4, 3, 2, 1],
                                right = True)
    
    df['RF_SCORE'] = df["recency_score"].astype(str) + df["frequency_score"].astype(str)
    df["RFM_SCORE"] = df["recency_score"].astype(str) + df["frequency_score"].astype(str) + \
        df["monetary_score"].astype(str)

    return df

# rfm segmentation
def get_segment(df):
    seg_map = {
    r'[1-2][1-2]': 'hibernating', # Hibernating: Recency와 Frequency가 모두 낮은 고객 (오랫동안 거래가 없고 활동이 적음)
    r'[1-2][3-4]': 'at_Risk', # At Risk: Recency가 낮고 Frequency가 중간 수준인 고객 (과거에는 활발했으나 최근 거래가 줄어든 고객)
    r'[1-2]5': 'cant_loose', # Cannot Lose Them: Recency가 낮고 Frequency가 높아 중요한 고객 (이탈 방지가 필요한 고객)
    r'3[1-2]': 'about_to_sleep', # About To Sleep: Recency는 중간 수준이지만 Frequency가 낮은 고객 (거래가 줄어들 가능성이 있는 고객)
    r'33': 'need_attention', # Need Attention: Recency와 Frequency가 모두 중간 수준인 고객 (추가적인 관심이 필요한 고객)
    r'[3-4][4-5]': 'loyal_customers', # Loyal Customers: Recency와 Frequency가 모두 높은 고객 (충성도가 높은 고객)
    r'41|51': 'promising', # Promising: Recency가 높지만 Frequency는 낮은 고객 (잠재적으로 성장 가능성이 있는 신규 또는 초기 고객)
    r'[4-5][2-3]': 'potential_loyalists', # Potential Loyalists: Recency와 Frequency가 중간에서 높은 수준인 고객 (충성도가 높아질 가능성이 있는 고객)
    r'5[4-5]': 'champions' # Champions: Recency와 Frequency가 모두 최고 수준인 VIP 고객 (가장 가치 있는 고객)
}
      
    df['segment'] = df['RF_SCORE'].replace(seg_map, regex = True)
    df.loc[df['period_score'] == 5, 'segment'] = 'new_customers' # New Customers: 가입 기간 점수가 5점인 고객
    
    return df
    
    
 >>>
rfm_df = rfm(merged_df)
rfm_df = get_rfm_scores(rfm_df)
rfm_df = get_segment(rfm_df)
rfm_df

평균 구매주기 계산 및 시각화

# 평균 구매 주기 계산
group = merged_df.groupby('customer_id')

def purchase_cycle(group):
    if len(group) > 1:
        return group['transaction_date'].diff().mean().days
    else:
        return None
    
df_avg_cycle = group.apply(lambda group: purchase_cycle(group)).reset_index(name = 'avg_pur_cycle').fillna(0)


# 데이터 준비
df = df_avg_cycle[df_avg_cycle['avg_pur_cycle'] != 0].copy()

# 평균 계산
mean_value = df['avg_pur_cycle'].mean()

# Seaborn 스타일 설정
sns.set_theme(style="whitegrid")  # 배경 스타일 설정
plt.figure(figsize=(10, 6))  # 그래프 크기 설정

# 히스토그램 생성 (KDE 제외)
sns.histplot(
    df['avg_pur_cycle'],  # 데이터
    color="#F0EDCC",  # 히스토그램 색상
    bins=30,  # 구간 개수 설정
    edgecolor="black",  # 막대 테두리 색상
    stat="density",  # 밀도로 정규화
    kde=False  # KDE 제외
)

# KDE 플롯 추가 (별도 생성)
sns.kdeplot(
    df['avg_pur_cycle'],  # 데이터
    color="#02343F",  # KDE 선 색상
    linewidth=2  # 선 두께 설정
)

# 평균선 추가
plt.axvline(mean_value, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_value:.2f}')

# 그래프 제목 및 축 레이블 추가
plt.title("Average Purchase Cycle Distribution", fontsize=16, fontweight="bold")  # 제목 설정
plt.xlabel("Average Purchase Cycle (days)", fontsize=12)  # x축 레이블 설정
plt.ylabel("Density", fontsize=12)  # y축 레이블 설정

# 축 눈금 스타일 조정
plt.xticks(fontsize=10)  # x축 눈금 폰트 크기 조정
plt.yticks(fontsize=10)  # y축 눈금 폰트 크기 조정

# 그래프 테두리 제거
sns.despine()  # 상단과 오른쪽 테두리 제거

# 범례 추가
plt.legend(fontsize=12)

# 그래프 표시
plt.show()

Customer_id 기준 변수 생성

# customer_id 기준 파생 변수 생성

def get_customer_info(merged_df, customer_df, click_stream_df):

    # 평균 구매 주기 계산
    group = merged_df.groupby('customer_id')

    def purchase_cycle(group):
        if len(group) > 1:
            return group['transaction_date'].diff().mean().days
        else:
            return None
        
    df_avg_cycle = group.apply(lambda group: purchase_cycle(group)).reset_index(name = 'avg_pur_cycle').fillna(0)

    # 고객별 AOV 계산
    df_aov = group['amount'].mean().reset_index(name = 'aov').round(2)

    # 고객별 전환율(total -> booking)
    data = merged_df[['session_id', 'customer_id']]
    df = pd.merge(click_stream_df, data, on='session_id')
    total_session = df.groupby('customer_id')['session_id'].count()
    booking_session = df[df['event_name'] == 'BOOKING'].groupby('customer_id')['session_id'].count()
    conversion_rate = (booking_session / total_session * 100).fillna(0).round(2).reset_index(name = 'conversion_rate')

    # rfm_df
    rfm_df2 = rfm_df[['customer_id', 'recency', 'frequency', 'join_period']]

    # session별 행동 분석
    data = merged_df[['session_id', 'customer_id', 'amount']]
    session_pivot = pd.pivot_table(data = click_stream_df, index='session_id', columns='event_name', values='datetime', aggfunc='count').reset_index().fillna(0)
    session_pivot = session_pivot.merge(data, how='inner', on='session_id')
    session_df = session_pivot.groupby('customer_id').agg\
        (
        CLICK= ('CLICK', 'sum'),
        SCROLL= ('SCROLL', 'sum'),
        SEARCH= ('SEARCH', 'sum'),
        ADD_TO_CART = ('ADD_TO_CART', 'sum'),
        ADD_PROMO = ('ADD_PROMO', 'sum'),
        ITEM_DETAIL = ('ITEM_DETAIL', 'sum'),
        BOOKING = ('BOOKING', 'sum'),
        amount= ('amount', 'sum')
        ).reset_index()
        
    # 고객 특성(성별, 나이, 지역)
    current_date = datetime.now()
    customer = customer_df[['customer_id', 'gender_encoding', 'birthdate', 'home_location_lat', 'home_location_long']].copy()
    customer.loc[:, 'age'] = pd.to_datetime(customer['birthdate']).apply(
        lambda x: current_date.year - x.year - ((current_date.month, current_date.day) < (x.month, x.day))
    )
    customer.drop(columns=['birthdate'], inplace=True)

    # RFM 고객 세그먼트
    seg = rfm_df[['customer_id', 'segment']].copy()

    # 이탈 유저 여부
    df = df_avg_cycle[df_avg_cycle['avg_pur_cycle'] != 0].copy()
    Q1 = df['avg_pur_cycle'].quantile(0.25)  # IQR 계산
    Q3 = df['avg_pur_cycle'].quantile(0.75)
    IQR = Q3 - Q1
    upper_bound = Q3 + 1.5 * IQR # 상한선 계산 (이상치 기준)
    print(f"IQR 상한선: {upper_bound}일")
    last_pur = merged_df.groupby('customer_id')[['transaction_date']].max().reset_index()
    last_pur.columns = ['customer_id', 'last_purchase']
    present_day = merged_df['transaction_date'].max() + dt.timedelta(days = 2)  # Timestamp('2022-08-02 00:00:00')
    churn_threshold = dt.timedelta(days = upper_bound) # 현재 데이터 상으로는 141일
    last_pur['is_churned'] = (present_day - last_pur['last_purchase']) > churn_threshold
    last_pur.drop(columns='last_purchase', inplace=True)

    # 결과 데이터프레임 생성
    customer_info = pd.merge(df_avg_cycle, df_aov, on='customer_id')
    customer_info = customer_info.merge(rfm_df2, on='customer_id')
    customer_info = customer_info.merge(session_df, on='customer_id')
    customer_info = customer_info.merge(customer, on='customer_id')
    customer_info = customer_info.merge(seg, on='customer_id')
    customer_info = customer_info.merge(last_pur, on='customer_id')
    customer_info = customer_info.merge(conversion_rate, on='customer_id')
    
    return customer_info

customer_info = get_customer_info(merged_df, customer_df, click_stream_df)
customer_info

성별 및 이탈 여부 파이 차트 생성

# 성별 및 이탈 여부 파이 차트 생성

# 데이터프레임 복사
df = customer_info.copy()

# 레이블 설정
g_labels = ['Female', 'Male']
c_labels = ['No', 'Yes']

# 서브플롯 생성: 'domain' 타입을 사용하여 파이 차트 추가
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])

# # 성별 파이 차트 추가
fig.add_trace(go.Pie(labels=g_labels, 
                     values=df['gender_encoding'].value_counts(), 
                     name="Gender",
                     marker=dict(colors=['#F0EDCC', '#02343F'])),  # 색상 변경
              1, 1)

# 이탈 여부 파이 차트 추가
fig.add_trace(go.Pie(labels=c_labels, 
                     values=df['is_churned'].value_counts(), 
                     name="Churn",
                     marker=dict(colors=['#F0EDCC', '#02343F'])),  # 색상 변경
              1, 2)

# 도넛 형태로 변경 및 정보 표시 설정
fig.update_traces(hole=.4, hoverinfo="label+percent+name", textfont_size=16)

# 레이아웃 업데이트: 제목 및 중앙 텍스트 추가
fig.update_layout(
    title_text="Gender and Churn Distributions",
    annotations=[dict(text='Gender', x=0.16, y=0.5, font_size=20, showarrow=False),
                 dict(text='Churn', x=0.84, y=0.5, font_size=20, showarrow=False)]
)

# 그래프 출력
fig.show()

고객 군집별 이탈 고객 수

# 고객 군집별 이탈 고객 수
df = customer_info.copy()
fig = px.histogram(df, x="is_churned", color="segment", barmode="group", title="<b>Customer segment distribution<b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

코호트 잔존율 확인

# 최근 1년 이내 고객 필터링 후 코호트 시각화

def CohortAnalysis(dataframe):
    
   
    data = dataframe.copy()
    data = data[["customer_id", "session_id", "transaction_date"]].drop_duplicates()
    data["order_month"] = data["transaction_date"].dt.to_period("M")
    data["cohort"] = (
        data.groupby("customer_id")["transaction_date"].transform("min").dt.to_period("M")
    )
    cohort_data = (
        data.groupby(["cohort", "order_month"])
        .agg(n_customers=("customer_id", "nunique"))
        .reset_index(drop=False)
    )
    cohort_data["period_number"] = (cohort_data.order_month - cohort_data.cohort).apply(
        attrgetter("n")
    )
    cohort_pivot = cohort_data.pivot_table(
        index="cohort", columns="period_number", values="n_customers"
    )
    cohort_size = cohort_pivot.iloc[:, 0]
    retention_matrix = cohort_pivot.divide(cohort_size, axis=0)
    with sns.axes_style("white"):
        fig, ax = plt.subplots(
            1, 2, figsize=(12, 8), sharey=True, gridspec_kw={"width_ratios": [1, 11]}
        )
        sns.heatmap(
            retention_matrix,
            mask=retention_matrix.isnull(),
            annot=True,
            cbar=False,
            fmt=".0%",
            cmap="coolwarm",
            ax=ax[1],
        )
        ax[1].set_title("Monthly Cohorts: User Retention", fontsize=14)
        ax[1].set(xlabel="# of periods", ylabel="")
        white_cmap = mcolors.ListedColormap(["white"])
        sns.heatmap(
            pd.DataFrame(cohort_size).rename(columns={0: "cohort_size"}),
            annot=True,
            cbar=False,
            fmt="g",
            cmap=white_cmap,
            ax=ax[0],
        )
        fig.tight_layout()
    
CohortAnalysis(merged_df)

이탈 고객을 잡으면 우리에게 얼마나 이익을 가져올까?

EDA 결과 고객의 평균 거래 주기는 46일
이탈 고객 정의: iqr 상한선 기준 141일 이상 거래하지 않은 고객으로 정의

# 평균 구매 주기 계산
group = merged_df.groupby('customer_id')

def purchase_cycle(group):
    if len(group) > 1:
        return group['transaction_date'].diff().mean().days
    else:
        return None
    
df_avg_cycle = group.apply(lambda group: purchase_cycle(group)).reset_index(name = 'avg_pur_cycle').fillna(0)

# 이탈 유저 여부
df = df_avg_cycle[df_avg_cycle['avg_pur_cycle'] != 0].copy()
Q1 = df['avg_pur_cycle'].quantile(0.25)  # IQR 계산
Q3 = df['avg_pur_cycle'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR # 상한선 계산 (이상치 기준)
print(f"IQR 상한선: {upper_bound}일")
last_pur = merged_df.groupby('customer_id')[['transaction_date']].max().reset_index()
last_pur.columns = ['customer_id', 'last_purchase']
present_day = merged_df['transaction_date'].max() + dt.timedelta(days = 2)  # Timestamp('2022-08-02 00:00:00')
churn_threshold = dt.timedelta(days = upper_bound) # 현재 데이터 상으로는 141일
last_pur['is_churned'] = (present_day - last_pur['last_purchase']) > churn_threshold

print('이탈/활성화 유저 수: True - 이탈, False - 활성 \n', last_pur['is_churned'].value_counts())
print('*'*60)
print('이탈/활성화 유저 비율:\n', (last_pur['is_churned'].value_counts() / len(last_pur)) * 100)

>>>
IQR 상한선: 141.5일
이탈/활성화 유저 수: True - 이탈, False - 활성 
 False    28133
True     10041
Name: is_churned, dtype: int64
************************************************************
이탈/활성화 유저 비율:
 False    73.696757
True     26.303243
Name: is_churned, dtype: float64

지난 매출 vs 과거 활성 고객 수익 비교(AOV & LTV)

단순 매출 비교
82일 이전 이탈 고객 특성은?
이탈 고객의 AOV와 CLTV는?

# 2022년 8월 기준(현재): 지난 141일 매출

churn_days = upper_bound

present_day = merged_df['transaction_date'].max() + dt.timedelta(days = 2)  # Timestamp('2022-08-02 00:00:00')
last_churn_days = present_day - dt.timedelta(days = churn_days)
last_churn2_days = present_day - dt.timedelta(days = (churn_days * 2))

mask1 = merged_df['transaction_date'] >= last_churn_days
mask2 = (merged_df['transaction_date'] >= last_churn2_days) & (merged_df['transaction_date'] < last_churn_days)
last_churn_revenue = merged_df[mask1]['amount'].sum()
last_churn2_revenue = merged_df[mask2]['amount'].sum()
difference = last_churn_revenue - last_churn2_revenue

def current(value):
    return f'{value:,.0f} 루피아'

print(f'지난 {churn_days}일 매출:', current(last_churn_revenue)) 
print(f'지난 {churn_days*2}~{churn_days}일 사이 매출:', current(last_churn2_revenue))
print('차이:', current(difference)) 

categories = [f'Last {churn_days} Days', f'{churn_days*2}~{churn_days} Days Ago', 'Revenue Difference']
revenues = [last_churn_revenue, last_churn2_revenue, difference]
colors = ['#F0EDCC', '#02343F', '#F28D35']  # Color palette

plt.figure(figsize=(10, 6))
bars = plt.bar(categories, revenues, color=colors, alpha=0.8, edgecolor='black')

for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width() / 2.0, height / 2, f'{height:,.0f}',
             ha='center', va='center', fontsize=14, color='black', fontweight='bold')

plt.ylabel('Revenue', fontsize=12)
plt.xlabel('Transaction Date', fontsize=12)
plt.tight_layout()
plt.show()

>>>
지난 141.5일 매출: 71,482,646,044 루피아
지난 283.0~141.5일 사이 매출: 67,367,969,897 루피아
차이: 4,114,676,147 루피아

# 지난 141일 전에는 활성 고객이었는데, 최근 이탈한 고객은?

# 이탈 유저 여부
last_churn_days = present_day - dt.timedelta(days = churn_days)
last_pur['is_churned_last'] = (last_churn_days - last_pur['last_purchase']) > churn_threshold

# 'is_churned_last'가 False이고 'is_churned'가 True인 경우 계산
changed_to_churned = ((last_pur['is_churned'] == True) & (last_pur['is_churned_last'] == False)).sum()
print(f"{churn_days}일 기준 이탈 고객으로 바뀐 경우의 수: {changed_to_churned}명")

>>>
141.5일 기준 이탈 고객으로 바뀐 경우의 수: 7568명

# 7568명 지난 거래 금액

mask = ((last_pur['is_churned'] == True) & (last_pur['is_churned_last'] == False))
churn_customer = last_pur[mask]['customer_id']
mask2 = (merged_df['transaction_date'] >= last_churn2_days) & (merged_df['transaction_date'] < last_churn_days)
df = merged_df[mask2]
revenue = df[df['customer_id'].isin(churn_customer)]['amount'].sum()
print(f'이탈 고객의 지난 {churn_days*2}~{churn_days}일 사이 매출: ', current(revenue))

>>>
이탈 고객의 지난 283.0~141.5일 사이 매출:  4,295,338,692 루피아

# 7568명 AOV
 
mask = ((last_pur['is_churned'] == True) & (last_pur['is_churned_last'] == False))
churn_customer = last_pur[mask]['customer_id']
df = merged_df[merged_df['customer_id'].isin(churn_customer)]
churn_aov = df.groupby('customer_id')['amount'].mean().reset_index(name = 'aov').round(2)
print('이탈 고객 AOV 평균:', current(churn_aov['aov'].mean()))

>>>
이탈 고객 AOV 평균: 378,358 루피아

# 7568명 CLTV

df = merged_df.copy()
df = df[df['customer_id'].isin(churn_customer)]
grouped = df.groupby('customer_id').agg(
    total_revenue = ('amount', 'sum'), # 토탈 거래 금액
    total_transaction = ('session_id', 'nunique'), # 거래 횟수
    lifespan = ('first_join_date', lambda x: (present_day - x.max()).days // 365 if not x.isnull().all() else None) # 가입 기간
).reset_index()

# CLTV 계산
aov = churn_aov['aov'].mean()
average_purchase_value = grouped['total_revenue'].mean() # Step 1: 평균 구매 금액
purchase_frequency = grouped['total_transaction'].mean() # Step 2: 평균 구매 빈도율
customer_lifespan_years = grouped['lifespan'].mean() # Step 3: 고객 생애 기간
cltv = aov * purchase_frequency * customer_lifespan_years # Step 4: CLTV 계산

# 결과 출력
print(f"평균 구매 금액: {current(aov)}")
print(f"평균 구매 빈도율: {purchase_frequency:.2f}회")
print(f"고객 생애 기간: {customer_lifespan_years:.2f}년")
print(f"CLTV: {current(cltv)}")

>>>
평균 구매 금액: 737,707 루피아
평균 구매 빈도율: 1.35회
고객 생애 기간: 2.20년
CLTV: 2,191,972 루피아

이탈 고객이 매출에 미치는 영향

- 이탈 고객과 활성 고객 매출 t-test 비교
- 이탈 고객의 매출 변화 추세 시각화
- 이탈 여부가 매출에 미치는 영향 회귀분석
- 이탈 여부와 amount 간의 상관관계 계산(피어슨 상관 계수, p-value 확인하여 두 변수가 관계가 있음을 평가)

통계 검정

# 이탈 고객과 활성 고객 매출 비교

# 이탈/활성 매출 분리
churn_revenue = result[result['is_churned'] == True]['amount'].sum()
non_churn_revenue = result[result['is_churned'] == False]['amount'].sum()

# 전체 매출 대비 이탈 고객 매출 비율
churn_ratio = churn_revenue / (churn_revenue + non_churn_revenue)

print(f"이탈 고객 매출: {current(churn_revenue)}")
print(f"비이탈 고객 매출: {current(non_churn_revenue)}")
print(f"전체 매출 중 이탈 고객 매출 비율: {churn_ratio:.2%}")

>>>
이탈 고객 매출: 47,442,834,676 루피아
비이탈 고객 매출: 296,066,656,496 루피아
전체 매출 중 이탈 고객 매출 비율: 13.81%

# 왜도 첨도 계산

from scipy.stats import skew, kurtosis

data1 = result[result['is_churned'] == True]['amount']
data2 = result[result['is_churned'] == False]['amount']

# 데이터의 왜도와 첨도 계산
skewness1 = skew(data1)
kurt1 = kurtosis(data1)
skewness2 = skew(data2)
kurt2 = kurtosis(data2)

print(f"Skewness_True: {skewness1}")
print(f"Kurtosis_True: {kurt1}")
print(f"Skewness_False: {skewness2}")
print(f"Kurtosis_False: {kurt2}")

>>>
Skewness_True: 15.760818356824009
Kurtosis_True: 383.8635853163137
Skewness_False: 2.582421136704925
Kurtosis_False: 10.925171810245763

# 로그 변환

log_transformed_data1 = np.log(data1 + 1)
log_transformed_data2 = np.log(data2 + 1)

# 시각화
plt.figure(figsize=(10, 6))
sns.histplot(log_transformed_data1, kde=True, bins=30, color='green')
plt.title('Log-Transformed Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# 시각화
plt.figure(figsize=(10, 6))
sns.histplot(log_transformed_data2, kde=True, bins=30, color='green')
plt.title('Log-Transformed Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# 이탈 고객과 활성 고객 통계 비교

# shaipro-walk 정규성 확인
def get_shapiro(segment1, segment2):
    
    from scipy.stats import shapiro

    segment_data1 = result[result['is_churned'] == segment1]['amount']
    segment_data2 = result[result['is_churned'] == segment2]['amount']
    
    segment_data1 = np.log(segment_data1 + 1) # 왜도 심한 경우 로그 변환
    segment_data2 = np.log(segment_data2 + 1) # 왜도 심한 경우 로그 변환

    shapiro1 = shapiro(segment_data1)
    shapiro2 = shapiro(segment_data2)

    print(f"{segment1}: statistic={shapiro1.statistic}, p-value={shapiro1.pvalue}")
    print(f"{segment2}: statistic={shapiro2.statistic}, p-value={shapiro2.pvalue}")

    if shapiro1.pvalue > 0.05 and shapiro2.pvalue > 0.05:
        print("두 그룹 모두 정규성을 만족합니다.")
    else:
        print("정규성을 만족하지 않는 그룹이 있습니다.")
        
    return segment_data1, segment_data2

# 정규성 시각화 확인
def get_normal(segment1, segment2):
    from scipy.stats import probplot

    fig, ax = plt.subplots(1, 2, figsize=(12, 6))

    probplot(segment1, dist = 'norm', rvalue=True, plot=ax[0])
    ax[0].set_title('rating_Probplot for Normal Distribution')

    probplot(segment2, dist = 'norm', rvalue=True, plot=ax[1])
    ax[1].set_title('rating_Probplot for Normal Distribution')

    plt.tight_layout()
    plt.show()
    
# 분포 검정
def get_mann(segment1, segment2):
    from scipy.stats import mannwhitneyu

    statistic, pvalue = mannwhitneyu(segment1, segment2, alternative = 'less', method = 'auto')

    print(f'mann-whitney statistic: {statistic}')
    print(f'p-value: {pvalue}')

    if pvalue > 0.05:
        print('귀무가설 채택: 두 샘플의 분포는 동일합니다')
    else:
        print('대립가설 채택: 두 샘플의 분포는 다릅니다')

# T-test 함수
def get_ttest(segment1, segment2):
    from scipy.stats import ttest_ind

    # T-test 수행 (equal_var=False는 Welch's t-test를 수행)
    statistic, pvalue = ttest_ind(segment1, segment2, equal_var=False, alternative='less')

    print(f'T-test statistic: {statistic}')
    print(f'p-value: {pvalue}')

    if pvalue > 0.05:
        print('귀무가설 채택: 두 샘플의 평균은 동일합니다')
    else:
        print('대립가설 채택: 두 샘플의 평균은 다릅니다')        
        
# 시각화로 확인
def get_kde(segment1, segment2):
    mean1 = np.mean(segment1)
    mean2 = np.mean(segment2)
    plt.figure(figsize=(12, 6))
    sns.histplot(segment1, color="blue", label='segment1', kde=True)
    sns.histplot(segment2, color="orange", label='segment2', kde=True)
    plt.axvline(mean1, color='blue', linestyle='--', label=f'Segment 1 Mean: {mean1:.2f}')
    plt.axvline(mean2, color='orange', linestyle='--', label=f'Segment 2 Mean: {mean2:.2f}')
    plt.legend()
    plt.title("Conversion Ratings by Segment")
    plt.show()

True_data, False_data = get_shapiro(True, False)
get_normal(True_data, False_data)
get_mann(True_data, False_data)
get_ttest(True_data, False_data)
get_kde(True_data, False_data)

>>>
True: statistic=0.9876415783180748, p-value=9.552277734597878e-41
False: statistic=0.9643150423383146, p-value=2.685188592182786e-57
정규성을 만족하지 않는 그룹이 있습니다.

>>>
mann-whitney statistic: 54921185.0
p-value: 0.0
대립가설 채택: 두 샘플의 분포는 다릅니다
T-test statistic: -191.12615572805595
p-value: 0.0
대립가설 채택: 두 샘플의 평균은 다릅니다

최근 이탈 고객 8129명의 매출 변화 추세

# 최근 이탈 고객 8129명의 매출 변화 추세

df = merged_df[merged_df['customer_id'].isin(churn_customer)]
churned_trend = df.groupby(merged_df['transaction_date'].dt.to_period('M')).sum()
plt.figure(figsize=(10, 6))
plt.plot(churned_trend.index.astype(str), churned_trend['amount'], marker='o', label='Churned Revenue')
plt.title('Churned Customer Revenue Trend')
plt.xlabel('Month')
plt.ylabel('Revenue')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

이탈 여부가 매출에 미치는 영향 회귀 분석

# 이탈 여부가 매출에 미치는 영향 회귀 분석

import statsmodels.api as sm

df = result.copy()
df['is_churned_numeric'] = df['is_churned'].astype(int)
df = df[['is_churned_numeric', 'amount']]
df = np.log(df + 1)
X = df[['is_churned_numeric']]
y = df['amount']
X = sm.add_constant(X)  # 상수항 추가

# 회귀 분석 수행
model = sm.OLS(y, X).fit()
print(model.summary())

# 회귀선 계산
y_pred = model.predict(X)  # OLS 모델로 예측된 값

# 시각화
plt.figure(figsize=(10, 6))
plt.scatter(df['is_churned_numeric'], df['amount'], alpha=0.5, label='Actual Data')  # 실제 데이터 산점도
plt.plot(df['is_churned_numeric'], y_pred, color='red', label='Regression Line')  # 회귀선
plt.title('Impact of Churn on Revenue')
plt.xlabel('Is Churned (Numeric)')
plt.ylabel('Amount')
plt.xticks([0, 1], ['Not Churned', 'Churned'])  
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

OLS Regression Results                            
==============================================================================
Dep. Variable:                 amount   R-squared:                       0.443
Model:                            OLS   Adj. R-squared:                  0.443
Method:                 Least Squares   F-statistic:                 3.658e+04
Date:                Fri, 17 Jan 2025   Prob (F-statistic):               0.00
Time:                        23:24:35   Log-Likelihood:                -72177.
No. Observations:               46061   AIC:                         1.444e+05
Df Residuals:                   46059   BIC:                         1.444e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 15.8999      0.008   2018.536      0.000      15.884      15.915
is_churned_numeric    -2.9868      0.016   -191.250      0.000      -3.017      -2.956
==============================================================================
Omnibus:                     1003.985   Durbin-Watson:                   1.992
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1070.556
Skew:                          -0.368   Prob(JB):                    3.40e-233
Kurtosis:                       3.123   Cond. No.                         3.32
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

이분 변수와 연속형 변수 간 선형 관계 측정

# 이분 변수와 연속형 변수 간의 선형 관계 측정

from scipy.stats import pointbiserialr

df = result.copy()
df['is_churned_numeric'] = df['is_churned'].astype(int)
df = df[['is_churned_numeric', 'amount']]
df = np.log(df + 1)
X = df['is_churned_numeric']
y = df['amount']
correlation, p_value = pointbiserialr(X, y)

print(f"Point-Biserial Correlation: {correlation}")
print(f"P-value: {p_value}")

>>>
Point-Biserial Correlation: -0.6653009869972599
P-value: 0.0

데이터 분석 조각모음