Python 데이터 처리와 Pandas 활용법 완벽 가이드

Python이 데이터 과학과 AI 분야에서 압도적인 1위인 이유는 무엇일까요? 바로 Pandas와 NumPy 같은 강력한 데이터 처리 라이브러리 덕분입니다.

Why is Python the overwhelming leader in data science and AI? Thanks to powerful data processing libraries like Pandas and NumPy.

이 가이드에서는 CSV 파일 읽기부터 통계 분석, 이상치 제거까지 실무에서 매일 사용하는 데이터 처리 패턴을 모두 다룹니다.

This guide covers all daily data processing patterns used in practice, from reading CSV files to statistical analysis and outlier removal.

1. NumPy: 고성능 배열 연산의 기초 1. NumPy: Foundation of High-Performance Array Operations

NumPy는 Python에서 수치 계산을 위한 핵심 라이브러리입니다. Python의 기본 리스트보다 10~100배 빠르며, 다차원 배열을 효율적으로 처리합니다.

NumPy is the core library for numerical computing in Python. It's 10-100x faster than Python's built-in lists and efficiently handles multi-dimensional arrays.

                    import numpy as np

                    # 배열 생성

                    arr = np.array([1, 2, 3, 4, 5])

                    zeros = np.zeros(5) # [0. 0. 0. 0. 0.]

                    ones = np.ones((3, 3)) # 3x3 행렬

                    range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]

                    # 벡터 연산 (매우 빠름!)

                    arr * 2 # [2, 4, 6, 8, 10]

                    arr + 10 # [11, 12, 13, 14, 15]

                    arr ** 2 # [1, 4, 9, 16, 25]

                    # 통계 함수

                    arr.mean() # 평균: 3.0

                    arr.std() # 표준편차

                    arr.sum() # 합계: 15

                    arr.min() # 최소값: 1

                    arr.max() # 최대값: 5

                    import numpy as np

                    # Array creation

                    arr = np.array([1, 2, 3, 4, 5])

                    zeros = np.zeros(5) # [0. 0. 0. 0. 0.]

                    ones = np.ones((3, 3)) # 3x3 matrix

                    range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6,
                        8]

                    # Vectorized operations (very fast!)

                    arr * 2 # [2, 4, 6, 8, 10]

                    arr + 10 # [11, 12, 13, 14, 15]

                    arr ** 2 # [1, 4, 9, 16, 25]

                    # Statistical functions

                    arr.mean() # Mean: 3.0

                    arr.std() # Standard deviation

                    arr.sum() # Sum: 15

                    arr.min() # Min: 1

                    arr.max() # Max: 5

💡 NumPy vs Python 리스트 💡 NumPy vs Python List

Python 리스트로 [x * 2 for x in list]를 하는 대신, NumPy는 arr * 2로 한 번에 처리합니다. C로 구현되어 있어 훨씬 빠릅니다!

Instead of [x * 2 for x in list] with Python lists, NumPy does arr * 2 all at once. It's implemented in C, so it's much faster!

2. Pandas DataFrame: 데이터의 엑셀 2. Pandas DataFrame: Excel for Data

Pandas의 핵심은 DataFrame입니다. 엑셀 시트처럼 행과 열로 구성된 2차원 테이블이며, SQL처럼 데이터를 조작할 수 있습니다.

The core of Pandas is the DataFrame. It's a 2D table with rows and columns like an Excel sheet, and you can manipulate data like SQL.

DataFrame 생성 Creating DataFrames

                    import pandas as pd

                    # 딕셔너리에서 생성

                    data = {

                        'name': ['Kim',
                    'Lee', 'Park'],

                        'age': [25, 30, 35],

                        'city': ['Seoul',
                    'Busan', 'Incheon']

                    }

                    df = pd.DataFrame(data)

                    # CSV 파일에서 읽기

                    df = pd.read_csv('data.csv')

                    # 기본 정보 확인

                    df.head() # 처음 5행

                    df.tail(3) # 마지막
                        3행

                    df.info() # 데이터 타입, null 개수

                    df.describe() # 통계 요약

                    import pandas as pd

                    # Create from dictionary

                    data = {

                        'name': ['Kim',
                    'Lee', 'Park'],

                        'age': [25, 30, 35],

                        'city': ['Seoul',
                    'Busan', 'Incheon']

                    }

                    df = pd.DataFrame(data)

                    # Read from CSV

                    df = pd.read_csv('data.csv')

                    # Basic info

                    df.head() # First 5 rows

                    df.tail(3) # Last 3
                        rows

                    df.info() # Data types, null counts

                    df.describe() # Statistical summary

데이터 선택 및 필터링 Data Selection & Filtering

                    # 열 선택

                    df['name'] # 단일 열 (Series)

                    df[['name', 'age']] #
                        여러 열 (DataFrame)

                    # 조건 필터링

                    df[df['age'] > 28] #
                        age가 28보다 큰 행

                    df[df['city'] == 'Seoul'] # Seoul 거주자

                    # 복합 조건

                    df[(df['age'] >= 25) & (df['age'] <= 35)]

                        df[(df['city'] == 'Seoul') | (df['city'] == 'Busan')]

                        # 인덱스 기반 선택

                        df.iloc[0] # 첫
                            번째 행

                        df.iloc[0:3] # 0~2번 행

                        df.loc[df['name'] == 'Kim'] # 레이블 기반

                    # Column selection

                    df['name'] # Single column (Series)

                    df[['name', 'age']] #
                        Multiple columns (DataFrame)

                    # Conditional filtering

                    df[df['age'] > 28] #
                        Rows where age > 28

                    df[df['city'] == 'Seoul'] # Seoul residents

                    # Multiple conditions

                    df[(df['age'] >= 25) & (df['age'] <= 35)]

                        df[(df['city'] == 'Seoul') | (df['city'] == 'Busan')]

                        # Index-based selection

                        df.iloc[0] #
                            First row

                        df.iloc[0:3] # Rows 0-2

                        df.loc[df['name'] == 'Kim'] # Label-based

3. 데이터 정제 & 변환 3. Data Cleaning & Transformation

결측치 처리 Handling Missing Values

                    # 결측치 확인

                    df.isnull().sum() # 각 열의 null 개수

                    df.isnull().any() # null이 있는 열

                    # 결측치 제거

                    df.dropna() # null이 있는 행 제거

                    df.dropna(axis=1) #
                        null이 있는 열 제거

                    # 결측치 채우기

                    df.fillna(0) # 0으로
                        채우기

                    df.fillna(df.mean()) # 평균값으로 채우기

                    df['age'].fillna(df['age'].median()) # 특정
                        열만

                    # Check missing values

                    df.isnull().sum() # Null count per column

                    df.isnull().any() # Columns with nulls

                    # Remove missing values

                    df.dropna() # Drop rows with nulls

                    df.dropna(axis=1) #
                        Drop columns with nulls

                    # Fill missing values

                    df.fillna(0) # Fill
                        with 0

                    df.fillna(df.mean()) # Fill with mean

                    df['age'].fillna(df['age'].median()) #
                        Specific column

데이터 변환 Data Transformation

                    # 새 열 추가

                    df['age_group'] = df['age'] // 10 * 10 # 20대, 30대
                        등

                    # apply로 함수 적용

                    df['name_upper'] = df['name'].apply(lambda x: x.upper())

                    # 조건에 따라 값 할당

                    df['category'] = df['age'].apply(

                        lambda x: 'young'
                    if x < 30 else
                        'old'

                        )

                        # 열 이름 변경

                        df.rename(columns={'name': '이름', 'age': '나이'})

                    # Add new column

                    df['age_group'] = df['age'] // 10 * 10 # 20s, 30s,
                        etc

                    # Apply function

                    df['name_upper'] = df['name'].apply(lambda x: x.upper())

                    # Assign based on condition

                    df['category'] = df['age'].apply(

                        lambda x: 'young'
                    if x < 30 else
                        'old'

                        )

                        # Rename columns

                        df.rename(columns={'name': 'Name', 'age': 'Age'})

4. 그룹화 & 집계 4. Grouping & Aggregation

SQL의 GROUP BY처럼 데이터를 그룹별로 집계할 수 있습니다.

Like SQL's GROUP BY, you can aggregate data by groups.

                    # 도시별 평균 나이

                    df.groupby('city')['age'].mean()

                    # 여러 집계 함수 한번에

                    df.groupby('city')['age'].agg(['mean', 'median', 'std'])

                    # 여러 열 그룹화

                    df.groupby(['city', 'age_group']).size() #
                        개수 세기

                    # 커스텀 집계

                    df.groupby('city')['age'].apply(

                        lambda x: x.max()
                    - x.min() # 범위

                    )

                    # Average age by city

                    df.groupby('city')['age'].mean()

                    # Multiple aggregations at once

                    df.groupby('city')['age'].agg(['mean', 'median', 'std'])

                    # Multiple column grouping

                    df.groupby(['city', 'age_group']).size() #
                        Count

                    # Custom aggregation

                    df.groupby('city')['age'].apply(

                        lambda x: x.max()
                    - x.min() # Range

                    )

5. 실무 패턴: 데이터 분석 & 이상치 제거 5. Practical Patterns: Data Analysis & Outlier Removal

실제 프로젝트에서 가장 많이 사용하는 패턴을 알아봅시다.

Let's look at the most commonly used patterns in real projects.

이상치 제거 (IQR 방식) Outlier Removal (IQR Method)

                    def filter_outliers(df: pd.DataFrame,
                    column: str) -> pd.DataFrame:

                        """IQR 방식으로 이상치 제거"""

                        q1 = df[column].quantile(0.25)

                        q3 = df[column].quantile(0.75)

                        iqr = q3 - q1

                        lower = q1 - 1.5 * iqr

                        upper = q3 + 1.5 * iqr

                        return df[(df[column] >= lower) & (df[column]
                    <= upper)]

                        # 사용 예시

                        clean_df = filter_outliers(df, 'value')

                    def filter_outliers(df: pd.DataFrame,
                    column: str) -> pd.DataFrame:

                        """Remove outliers using IQR method"""

                        q1 = df[column].quantile(0.25)

                        q3 = df[column].quantile(0.75)

                        iqr = q3 - q1

                        lower = q1 - 1.5 * iqr

                        upper = q3 + 1.5 * iqr

                        return df[(df[column] >= lower) & (df[column]
                    <= upper)]

                        # Usage example

                        clean_df = filter_outliers(df, 'value')

데이터 분석 요약 함수 Data Analysis Summary Function

                    def analyze_data(df: pd.DataFrame) ->
                    dict:

                        """데이터 통계 요약"""

                        return {

                            'mean': df['value'].mean(),

                            'median': df['value'].median(),

                            'std': df['value'].std(),

                            'min': df['value'].min(),

                            'max': df['value'].max(),

                            'count': len(df)

                        }

                    # 사용

                    stats = analyze_data(df)

                    print(f"평균: {stats['mean']}, 중앙값: {stats['median']}")

                    def analyze_data(df: pd.DataFrame) ->
                    dict:

                        """Statistical summary of data"""

                        return {

                            'mean': df['value'].mean(),

                            'median': df['value'].median(),

                            'std': df['value'].std(),

                            'min': df['value'].min(),

                            'max': df['value'].max(),

                            'count': len(df)

                        }

                    # Usage

                    stats = analyze_data(df)

                    print(f"Mean: {stats['mean']}, Median: {stats['median']}")

6. 데이터 저장 & 내보내기 6. Saving & Exporting Data

                    # CSV로 저장

                    df.to_csv('output.csv', index=False)

                    # 엑셀로 저장

                    df.to_excel('output.xlsx',
                    sheet_name='Sheet1')

                    # JSON으로 저장

                    df.to_json('output.json', orient='records')

                    # 딕셔너리로 변환

                    df.to_dict('records') # [{}, {}, ...]

                    df.to_dict('list') #
                        {'col': [], ...}

                    # Save as CSV

                    df.to_csv('output.csv', index=False)

                    # Save as Excel

                    df.to_excel('output.xlsx',
                    sheet_name='Sheet1')

                    # Save as JSON

                    df.to_json('output.json', orient='records')

                    # Convert to dictionary

                    df.to_dict('records') # [{}, {}, ...]

                    df.to_dict('list') #
                        {'col': [], ...}

7. 마무리: 데이터 처리를 타자로 익히기 7. Conclusion: Learning Data Processing Through Typing

데이터 처리는 현대 개발의 핵심 스킬입니다. 웹 애플리케이션, AI, 데이터 분석 등 거의 모든 분야에서 Pandas와 NumPy를 사용합니다.

Data processing is a core skill in modern development. Pandas and NumPy are used in almost every field: web applications, AI, data analysis, and more.

✨ 타자 연습에 포함된 패턴 ✨ Patterns Included in Typing Practice

import pandas as pd, import numpy as np
df['column'].mean(), df.groupby() 등 집계 함수
df[df['age'] > 30] 같은 조건 필터링
df.fillna(), df.dropna() 결측치 처리
리스트 컴프리헨션과 lambda 함수

import pandas as pd, import numpy as np
Aggregation functions like df['column'].mean(), df.groupby()
Conditional filtering like df[df['age'] > 30]
Missing value handling with df.fillna(), df.dropna()
List comprehensions and lambda functions

이 패턴들을 직접 타이핑하며 익히면, 실제 데이터 분석 프로젝트에서 자연스럽게 코드를 작성할 수 있게 됩니다.

By typing these patterns yourself, you'll naturally be able to write code in real data analysis projects.