Notice

[팁]주옥같은 안드로이드 팁

Recent Posts

Recent Comments

Link

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Tags more

Archives

Today

Total

관리 메뉴

오늘도 공부

Whisper 음성 인식 가이드 (파이썬버전) 본문

카테고리 없음

Whisper 음성 인식 가이드 (파이썬버전)

행복한 수지아빠 2025. 2. 11. 12:34

Whisper 음성 인식 가이드

OpenAI의 Whisper 모델을 사용한 음성 인식(STT) 기능 구현 가이드입니다.

설치 방법

1. 기본 설치

pip install openai-whisper

2. 의존성 설치

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Windows
# FFmpeg 공식 사이트에서 다운로드 후 PATH 설정

모델 종류

Whisper는 다양한 크기의 모델을 제공합니다:

모델	파라미터	상대적 속도	메모리 사용량	적합한 용도
tiny	39M	32x	1GB	빠른 테스트, 간단한 음성
base	74M	16x	1GB	일반적인 음성 인식
small	244M	6x	2GB	더 정확한 인식 필요 시
medium	769M	2x	5GB	전문적인 용도
large	1550M	1x	10GB	최고 정확도 필요 시

기본 사용법

1. 간단한 음성 인식

import whisper

# 모델 로드
model = whisper.load_model("base")

# 오디오 파일 변환
result = model.transcribe("audio.mp3")

# 결과 출력
print(result["text"])

2. 상세 옵션 설정

# 다양한 옵션 설정
result = model.transcribe(
    "audio.mp3",
    language="en",          # 언어 지정
    task="transcribe",      # transcribe 또는 translate
    temperature=0.2,        # 생성 다양성 (0-1)
    word_timestamps=True,   # 단어별 타임스탬프
    fp16=False,            # GPU 메모리 절약
)

3. 실시간 처리를 위한 비동기 구현

import asyncio
import whisper

class WhisperService:
    def __init__(self):
        self.model = whisper.load_model("base")

    async def transcribe_audio(self, audio_path: str):
        return await asyncio.to_thread(
            self.model.transcribe, 
            audio_path
        )

고급 기능

1. 단어별 타임스탬프 추출

result = model.transcribe(
    "audio.mp3",
    word_timestamps=True
)

for segment in result["segments"]:
    for word in segment["words"]:
        print(f"단어: {word['text']}")
        print(f"시작: {word['start']:.2f}초")
        print(f"종료: {word['end']:.2f}초")

2. 언어 감지

# 자동 언어 감지
result = model.transcribe("audio.mp3")
detected_language = result["language"]

# 특정 언어 강제 지정
result = model.transcribe("audio.mp3", language="ko")

3. 번역 기능

# 영어로 번역
result = model.transcribe(
    "audio.mp3",
    task="translate"  # 자동으로 영어로 번역
)

성능 최적화

1. GPU 가속

# CUDA 사용 (NVIDIA GPU)
model = whisper.load_model("base").cuda()

# CPU 강제 사용
model = whisper.load_model("base", device="cpu")

2. 배치 처리

# 여러 오디오 파일 처리
async def process_multiple_files(files):
    results = []
    for file in files:
        result = await asyncio.to_thread(
            model.transcribe, 
            file
        )
        results.append(result)
    return results

3. 메모리 관리

import torch

# GPU 메모리 절약
torch.cuda.empty_cache()

# 반정밀도(FP16) 사용
model = whisper.load_model("base", fp16=True)

결과 포맷

1. 기본 출력 구조

{
    "text": "전체 텍스트",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 3.0,
            "text": "세그먼트 텍스트",
            "tokens": [...],
            "temperature": 0.0,
            "avg_logprob": -0.5,
            "compression_ratio": 1.1,
            "no_speech_prob": 0.1,
            "words": [
                {
                    "text": "단어",
                    "start": 0.5,
                    "end": 1.0,
                    "probability": 0.9
                }
            ]
        }
    ],
    "language": "감지된 언어"
}

오류 처리

1. 일반적인 오류

try:
    result = model.transcribe("audio.mp3")
except Exception as e:
    if "CUDA out of memory" in str(e):
        # GPU 메모리 부족
        torch.cuda.empty_cache()
        # 더 작은 모델 사용
        model = whisper.load_model("base")
    elif "No such file" in str(e):
        # 파일 없음
        print("오디오 파일을 찾을 수 없습니다")
    else:
        # 기타 오류
        print(f"오류 발생: {str(e)}")

2. 성능 관련 문제 해결

메모리 부족
- 더 작은 모델 사용
- fp16=True 옵션 사용
- 배치 크기 줄이기
처리 속도
- GPU 사용 확인
- 더 작은 모델 선택
- 오디오 길이 제한
정확도
- 더 큰 모델 사용
- 노이즈 제거
- 오디오 품질 개선

팁과 모범 사례

모델 선택
- 개발/테스트: tiny 또는 base
- 프로덕션: small 또는 medium
- 최고 품질: large
전처리
- 오디오 노이즈 제거
- 적절한 샘플링 레이트 설정
- 무음 구간 제거
후처리
- 구두점 정규화
- 텍스트 정제
- 신뢰도 낮은 부분 필터링