이것저것 개발노트

[Pandasai] Pandasai 튜토리얼 2

pandasai 튜토리얼 2

목표

LLM을 불러오는 다양한 방법을 이용해서 LLM을 불러오고 LLM에게 pandas의 dataframe을 분석시켜보자.
SmartDatalake 사용하기
Agent 사용하기

튜토리얼 1에서 LLM을 불러오고 Pandasai의 SmartDataframe을 사용해보았습니다.

이번에는 SmartDatalake와 Agent를 사용해봅니다.

OpenAI API키 발급 링크

Setting (톱니바퀴 모양) - Billing 에서 적어도 5 달러 이상 결제하고 Limits에서 사용 한도 정하고

Your profile에서 User API Keys에 들어가서 Create new secret key 버튼 누르기

key 발급 받고 .env 파일에서 openai api key를 넣어줍니다.

import os 
from dotenv import load_dotenv

# API KEY 정보로드
load_dotenv()

True

PandasAI는 여러 데이터 프레임을 사용한 쿼리도 지원합니다.

이러한 쿼리를 수행하려면 SmartDataframe 대신 SmartDatalake를 사용할 수 있습니다.

Pandasai SmartDatalake 설명

Getting started with the Library - PandasAI

Getting started with the Library

docs.pandas-ai.com

import pandas as pd
from pandasai import SmartDatalake

모델을 정해봅니다.

from pandasai.llm import OpenAI
from pandasai.llm.local_llm import LocalLLM
from langchain_community.llms import Ollama
from langchain_community.chat_models import ChatOllama


# llm = OpenAI(api_token = os.environ['OPENAI_API_KEY'])

llm = OpenAI(
    temperature=0.2,  # 창의성 (0.0 ~ 2.0)
    model_name="gpt-3.5-turbo",  # 모델명
)

# llm = LocalLLM(
#             api_base="http://localhost:11434/v1", 
#             model="llama3.1"
#             )

# llm = Ollama(model="llama3.1", temperature=0.3)

예시 데이터를 만들어봅니다.

employees_data = {
    'EmployeeID': [1, 2, 3, 4, 5],
    'Name': ['John', 'Emma', 'Liam', 'Olivia', 'William'],
    'Department': ['HR', 'Sales', 'IT', 'Marketing', 'Finance']
}

salaries_data = {
    'EmployeeID': [1, 2, 3, 4, 5],
    'Salary': [5000, 6000, 4500, 7000, 5500]
}

employees_df = pd.DataFrame(employees_data)
salaries_df = pd.DataFrame(salaries_data)

employees_df.head()

	EmployeeID	Name	Department
0	1	John	HR
1	2	Emma	Sales
2	3	Liam	IT
3	4	Olivia	Marketing
4	5	William	Finance

salaries_df.head()

	EmployeeID	Salary
0	1	5000
1	2	6000
2	3	4500
3	4	7000
4	5	5500

lake = SmartDatalake(
    [employees_df, salaries_df],
    config={"llm": llm},
)
lake.chat("Salary가 가장 높은 사람은 누구인가요?")

{'type': 'string', 'value': 'The highest salary is 7000 and it belongs to Olivia.'}





'The highest salary is 7000 and it belongs to Olivia.'

lake = SmartDatalake(
    [employees_df, salaries_df],
    config={"llm": llm, "verbose": True},
)
lake.chat("Salary가 가장 높은 사람은 누구인가요?")

{'type': 'string', 'value': 'The highest salary is 7000 and it belongs to Olivia.'}





'The highest salary is 7000 and it belongs to Olivia.'

lake.chat("Return a dataframe of name against salaries")

{'type': 'dataframe', 'value':       Name  Salary
0     John    5000
1     Emma    6000
2     Liam    4500
3   Olivia    7000
4  William    5500}

	Name	Salary
0	John	5000
1	Emma	6000
2	Liam	4500
3	Olivia	7000
4	William	5500

lake.chat("Salary 대비 Name 데이터 프레임을 반환해주세요.")

{'type': 'dataframe', 'value':       Name  Salary
0     John    5000
1     Emma    6000
2     Liam    4500
3   Olivia    7000
4  William    5500}

	Name	Salary
0	John	5000
1	Emma	6000
2	Liam	4500
3	Olivia	7000
4	William	5500

Custom Response

위에서 출력 결과가 pandas Dataframe으로 제공하기도 하고 print로 제공하기도 합니다.

원하는 출력 형식으로만 출력할 수는 없을까요?

PandasAI는 맞춤형 방식으로 채팅 응답을 처리할 수 있는 유연성을 제공합니다.

기본적으로 PandasAI에는 필요에 따라 응답 출력을 수정하도록 확장할 수 있는 ResponseParser 클래스가 포함되어 있습니다.

StreamlitResponse와 같은 사용자 정의 파서를 다음과 같이 구성 개체에 제공할 수 있는 옵션이 있습니다.

Custom Response 다큐멘터리

import os
import pandas as pd
from pandasai import SmartDatalake
from pandasai.responses.response_parser import ResponseParser

class PandasDataFrame(ResponseParser):

    def __init__(self, context) -> None:
        super().__init__(context)

    def format_dataframe(self, result):
        # Returns Pandas Dataframe instead of SmartDataFrame
        return result["value"]

agent = SmartDatalake(
    [employees_df, salaries_df],
    config={"llm": llm, "response_parser": PandasDataFrame},
)

response = agent.chat("Return a dataframe of name against salaries")

{'type': 'dataframe', 'value':       Name  Salary
0     John    5000
1     Emma    6000
2     Liam    4500
3   Olivia    7000
4  William    5500}

response

	Name	Salary
0	John	5000
1	Emma	6000
2	Liam	4500
3	Olivia	7000
4	William	5500

response = agent.chat("Salary 대비 Name 데이터 프레임을 반환해주세요.")

response

{'type': 'dataframe', 'value':       Name  Salary
0     John    5000
1     Emma    6000
2     Liam    4500
3   Olivia    7000
4  William    5500}

	Name	Salary
0	John	5000
1	Emma	6000
2	Liam	4500
3	Olivia	7000
4	William	5500

Agent

SmartDataframe 또는 SmartDatalake는 단일 쿼리에 응답하는 데 사용될 수 있고 단일 세션 및 탐색적 데이터 분석(EDA)에 사용되지만,

에이전트는 다중 전환 대화에 사용될 수 있습니다.

import os
from pandasai import Agent
import pandas as pd

# Sample DataFrames
sales_by_country = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "sales": [5000, 3200, 2900, 4100, 2300, 2100, 2500, 2600, 4500, 7000],
    "deals_opened": [142, 80, 70, 90, 60, 50, 40, 30, 110, 120],
    "deals_closed": [120, 70, 60, 80, 50, 40, 30, 20, 100, 110]
})


sales_by_country

	country	sales	deals_opened	deals_closed
0	United States	5000	142	120
1	United Kingdom	3200	80	70
2	France	2900	70	60
3	Germany	4100	90	80
4	Italy	2300	60	50
5	Spain	2100	50	40
6	Canada	2500	40	30
7	Australia	2600	30	20
8	Japan	4500	110	100
9	China	7000	120	110


agent = Agent(sales_by_country, config={"llm": llm})
agent.chat('sales 기준 탑 5 country는 어디입니까?')

          country  sales  deals_opened  deals_closed
9           China   7000           120           110
0   United States   5000           142           120
8           Japan   4500           110           100
3         Germany   4100            90            80
1  United Kingdom   3200            80            70

	country	sales	deals_opened	deals_closed
9	China	7000	120	110
0	United States	5000	142	120
8	Japan	4500	110	100
3	Germany	4100	90	80
1	United Kingdom	3200	80	70

SmartDataframe 또는 SmartDatalake와 달리 Agent는 대화 상태를 추적하고 다중 대화 대화에 응답할 수 있습니다.

예를 들어:

agent.chat('And which one has the most deals?')

{'type': 'string', 'value': 'The country with the most deals opened is United States with 142 deals.'}





'The country with the most deals opened is United States with 142 deals.'

Clarification questions

Agent는 쿼리에 답변한 정보가 충분하지 않은 경우 추가 설명 질문을 할 수도 있습니다.

이렇게 하면 Agent가 사용자에게 질문에 답하기 위한 더 많은 정보를 얻기 위해 요청할 수 있는 최대 3개의 명확한 질문이 반환됩니다.

예를 들어:

agent.clarification_questions('What is the GDP of the United States?')

['Are you looking for the current GDP or historical GDP data?',
 'Do you want the GDP in nominal terms or adjusted for purchasing power parity (PPP)?',
 'Should I include projections for future GDP growth or just the latest available data?']

Explanation

Agent는 사용자에게 제공된 답변을 설명할 수도 있습니다.

예를 들어:

response = agent.chat('What is the GDP of the United States?')
explanation = agent.explain()

print("The answer is", response)
print("The explanation is", explanation)

The answer is 21000000000000
The explanation is In our previous conversation, we discussed sales data for different countries and identified the United States as having the most deals opened. We also noted that the GDP of the United States is 21 trillion dollars.

To create a summary of this information, I organized it into a simple format. First, I listed some other countries along with their sales figures and the number of deals they opened and closed. This gives a clearer picture of how the United States compares to others in terms of sales and deals.

Finally, I highlighted the GDP of the United States as a separate piece of information, indicating its significance. The result is presented in a straightforward way, making it easy to understand the key figures without diving into complex details.

Rephrase Question

모델로부터 정확하고 포괄적인 응답을 얻으려면 질문을 바꿔보세요.

예를 들어:

rephrased_query = agent.rephrase_query('What is the GDP of the United States?')

print("The rephrased query is", rephrased_query)

The rephrased query is What is the GDP of the United States in dollars?

Config

PandasAI의 SmartDataframe에 사용자 정의를 조금씩 해보았습니다.

더 자세한 설명은 아래를 참고해주세요.

llm: 사용할 LLM. LLM의 인스턴스나 LLM의 이름을 전달할 수 있습니다.
llm_options: LLM에 사용할 옵션 (예: API 토큰 등).
save_logs: LLM의 로그를 저장할지 여부. 기본값은 True입니다. 프로젝트의 루트에 있는 pandasai.log 파일에서 로그를 찾을 수 있습니다.
verbose: PandasAI가 실행되는 동안 콘솔에 로그를 출력할지 여부. 기본값은 False입니다.
enforce_privacy: 개인 정보 보호를 적용할지 여부. 기본값은 False입니다. True로 설정하면 PandasAI는 메타데이터만 전송하며 LLM에 데이터를 보내지 않습니다. 기본적으로 PandasAI는 정확도를 높이기 위해 익명화된 5개의 샘플을 전송합니다.
save_charts: PandasAI가 생성한 차트를 저장할지 여부. 기본값은 False입니다. 프로젝트의 루트 또는 save_charts_path로 지정한 경로에서 차트를 찾을 수 있습니다.
save_charts_path: 차트를 저장할 경로. 기본값은 exports/charts/입니다. 이 설정을 사용하여 기본 경로를 변경할 수 있습니다.
open_charts: LLM의 응답을 파싱하는 동안 차트를 열지 여부. 기본값은 True입니다. 이 옵션을 False로 설정하여 차트 표시를 완전히 비활성화할 수 있습니다.
enable_cache: 캐싱을 활성화할지 여부. 기본값은 True입니다. True로 설정하면 PandasAI는 응답 시간을 개선하기 위해 LLM의 결과를 캐시합니다. False로 설정하면 PandasAI는 항상 LLM을 호출합니다.
use_error_correction_framework: 오류 수정 프레임워크를 사용할지 여부. 기본값은 True입니다. True로 설정하면 PandasAI는 LLM이 생성한 코드의 오류를 추가 LLM 호출을 통해 수정하려고 시도합니다. False로 설정하면 PandasAI는 코드 오류 수정을 시도하지 않습니다.
max_retries: 오류 수정 프레임워크를 사용할 때 최대 재시도 횟수. 기본값은 3입니다. 이 설정을 사용하여 기본 재시도 횟수를 변경할 수 있습니다.
custom_whitelisted_dependencies: 사용자 정의 화이트리스트 종속성을 사용할지 여부. 기본값은 {}입니다. 이 설정을 사용하여 기본 사용자 정의 화이트리스트 종속성을 변경할 수 있습니다. 사용자 정의 화이트리스트 종속성에 대한 자세한 내용은 여기를 참조하세요.

'이것저것 개발노트' 카테고리의 다른 글

[AirFlow 공부하기] [1.개발환경구성] 2. Docker와 Airflow 설치하기 (1)	2024.09.20
[AirFlow 공부하기] [1.개발환경구성]1. WSL 설치 및 간단한 리눅스 명령어 정리 (0)	2024.09.20
[Pandasai] Pandasai 튜토리얼 1 (0)	2024.09.18
streamlit 오류, AxiosError: Request failed with status code 403 (0)	2024.09.14
Poetry 설치 가이드 (0)	2024.06.30

Contents

새소식

인기 검색어

[Pandasai] Pandasai 튜토리얼 2

pandasai 튜토리얼 2

목표

Custom Response

Agent

Clarification questions

Explanation

Rephrase Question

Config

'이것저것 개발노트' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바