우선 OpenVLA를 하기위해 OpenVLA 깃헙과 embodied-agent 깃헙, 그리고 SimplerEnv를 알아볼 필요가 있다.
https://github.com/openvla/openvla
GitHub - openvla/openvla: OpenVLA: An open-source vision-language-action model for robotic manipulation.
OpenVLA: An open-source vision-language-action model for robotic manipulation. - openvla/openvla
github.com
https://github.com/mbodiai/embodied-agents?tab=readme-ov-file
GitHub - mbodiai/embodied-agents: Seamlessly integrate state-of-the-art transformer models into robotics stacks
Seamlessly integrate state-of-the-art transformer models into robotics stacks - mbodiai/embodied-agents
github.com
https://github.com/simpler-env/SimplerEnv
GitHub - simpler-env/SimplerEnv: Evaluating and reproducing real-world robot manipulation policies (e.g., RT-1, RT-1-X, Octo) in
Evaluating and reproducing real-world robot manipulation policies (e.g., RT-1, RT-1-X, Octo) in simulation under common setups (e.g., Google Robot, WidowX+Bridge) (CoRL 2024) - simpler-env/SimplerEnv
github.com
https://colab.research.google.com/drive/15mElfn43Ge5Uj_OeSs47OthB3c8mmGQL?usp=sharing
OpenVLA_tutorial_01.ipynb
Colab notebook
colab.research.google.com
colab에서 실행할 수 있는 튜토리얼을 만들어 보았다.
다음은 GPU가 있는 로컬에서도 수행할 수 있게 코드를 짜보았다.
당연히 가장 먼저 OpenVLA 깃헙에서 지시해준데로 가상환경과 설치를 진행한다.
openvla폴더안에서 01.test.ipynb를 만든다.
from transformers import AutoModelForVision2Seq, AutoProcessor, BitsAndBytesConfig
from PIL import Image
import requests
import time
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device
필요한 라이브러리들을 임포트한다.
로컬이라도 GPU는 필요하다.
# === Verification Arguments
UNNORM_KEY = 'bridge_orig'
MODEL_PATH = "openvla/openvla-7b"
SYSTEM_PROMPT = (
"A chat between a curious user and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the user's questions."
)
INSTRUCTION = "Pick up the remote"
pretrained된 모델이 학습한 데이터 bridge_orig
모델은 허깅페이스에서 openvla/openvla-7b
시스템(AI)의 프롬프트는 SYSYEM_PROMPT
로봇에게 지시할 명령은 INSTRUCTION
def get_openvla_prompt(instruction: str) -> str:
if "v01" in MODEL_PATH:
return f"{SYSTEM_PROMPT} USER: What action should the robot take to {instruction.lower()}? ASSISTANT:"
else:
return f"In: What action should the robot take to {instruction.lower()}?\nOut:"
vla-scripts/extern/verify_openvla.py 코드를 참고하였다. 위 함수도 여기에 있어서 똑같이 사용하였다.
# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
Multimodal을 수행할 땐 두 가지의 전처리 도구가 함께 필요하다. 이미지와 텍스트를 처리하기 위해 AutoProcessor를 사용하고 processor로 정의한다.
# vla = AutoModelForVision2Seq.from_pretrained(
# "openvla/openvla-7b",
# attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn`
# torch_dtype=torch.bfloat16,
# low_cpu_mem_usage=True,
# trust_remote_code=True
# ).to("cuda:0")
# # === 8-BIT QUANTIZATION MODE (`pip install bitsandbytes`) :: [~9GB of VRAM Passive || 10GB of VRAM Active] ===
# print("[*] Loading in 8-Bit Quantization Mode")
# vla = AutoModelForVision2Seq.from_pretrained(
# MODEL_PATH,
# attn_implementation="flash_attention_2",
# torch_dtype=torch.float16,
# quantization_config=BitsAndBytesConfig(load_in_8bit=True),
# low_cpu_mem_usage=True,
# trust_remote_code=True,
# )
# === 4-BIT QUANTIZATION MODE (`pip install bitsandbytes`) :: [~6GB of VRAM Passive || 7GB of VRAM Active] ===
print("[*] Loading in 4-Bit Quantization Mode")
vla = AutoModelForVision2Seq.from_pretrained(
MODEL_PATH,
attn_implementation="flash_attention_2",
torch_dtype=torch.float16,
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
low_cpu_mem_usage=True,
trust_remote_code=True,
)
vla-scripts/extern/verify_openvla.py 코드를 보면 3가지 버전이 있다. 너무 감사하게도 낮은 사양의 GPU로도 로컬에서 확인해볼 수 있다.
4-Bit 또는 8-Bit 양자화 모드를 사용하려면 아래 라이브러리를 설치해야한다.
pip install bitsandbytes
pip install accelerate
이제 프롬프트를 정하고 이미지를 불러오자.
prompt = get_openvla_prompt(INSTRUCTION)
print(prompt)
# 이미지를 직접 불러오던가 URL로 불러오던가
# image_path = "./images/bridge_orig.jpeg"
# image = Image.open(image_path).convert("RGB")
DEFAULT_IMAGE_URL = (
"https://api.mbodi.ai/community-models/file=/tmp/gradio/c213d531d13cdcd19391acfd08b14e629b1118063fd303d9da1f4b5e065857e4/example.jpeg"
)
image = Image.open(requests.get(DEFAULT_IMAGE_URL, stream=True).raw).convert("RGB")
import matplotlib.pyplot as plt
plt.imshow(image)
plt.axis("off") # Hide axis for better visualization
plt.show()
# === BFLOAT16 MODE ===
# inputs = processor(prompt, image).to(device, dtype=torch.bfloat16)
# === 8-BIT/4-BIT QUANTIZATION MODE ===
inputs = processor(prompt, image).to(device, dtype=torch.float16)
양자화 모드 별로 다르게 dtype을 정하는거 잊지 말고
OpenVLA Inference를 실행한다.
# Run OpenVLA Inference
start_time = time.time()
action = vla.predict_action(**inputs, unnorm_key=UNNORM_KEY, do_sample=False)
print(f"=>> Time: {time.time() - start_time:.4f} \n Action: \n {action}")
=>> Time: 1.1822
Action:
[ 2.69897781e-03 -7.47882333e-04 8.33299030e-03 8.16284417e-03 -2.75751861e-02 -1.94082882e-02 9.96078431e-01]
X,Y,Z, Roll, Pitch, Yall, Griper가 나왔다.