Table of Contents
맥(Mac) 에 VLLM 을 설치에 대한 문서 입니다. 보통 VLLM 은 Nvidia GPU 를 가진 시스템에서 주로 설치하지만 맥에서도 사용할 수 있다. 이때 핵심은 CPU 기반 LLM 이 되도록 설치하는 것이다.
Brew 설치
맥(Mac) 에 Brew 를 설치해준다. 이 brew 는 리눅스에 apt, dnf 와 같은 기능을 한다. 맥으로 포팅된 각종 프로그램들을 brew 명령어를 이용해서 설치할 수 있다. 설치만 해주는게 아니라 프로그램 업데이트 추적, 패키지 삭제 등도 함께 제공한다.
|
1 |
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" |
brew 를 이용해서 Python3.12 를 설치해 준다.
Python3.12 설치
brew 를 이용해서 python3.12 를 설치 해준다.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
/opt/homebrew/bin/brew install python@3.12 ==> Fetching downloads for: python@3.12 ==> Downloading https://ghcr.io/v2/homebrew/core/python/3.12/manifests/3.12.12 #################################################################################################################################################################################### 100.0% ==> Fetching python@3.12 ==> Downloading https://ghcr.io/v2/homebrew/core/python/3.12/blobs/sha256:a000fd4856f62bf895f00834763827b03270f1c9fdcb27326a32edac66b7fe05 #################################################################################################################################################################################### 100.0% ==> Pouring python@3.12--3.12.12.arm64_tahoe.bottle.tar.gz ==> Caveats Python is installed as /opt/homebrew/bin/python3.12 Unversioned and major-versioned symlinks `python`, `python3`, `python-config`, `python3-config`, `pip`, `pip3`, etc. pointing to `python3.12`, `python3.12-config`, `pip3.12` etc., respectively, are installed into /opt/homebrew/opt/python@3.12/libexec/bin If you do not need a specific version of Python, and always want Homebrew's `python3` in your PATH: brew install python3 `idle3.12` requires tkinter, which is available separately: brew install python-tk@3.12 See: https://docs.brew.sh/Homebrew-and-Python ==> Summary 🍺 /opt/homebrew/Cellar/python@3.12/3.12.12: 3,611 files, 66.8MB ==> Running `brew cleanup python@3.12`... |
설치가 완료 된다.
VirtualEnv 생성
Python3 은 가상 디렉토리 환경을 제공한다. 다음과 같이 VLLM 을 설치하기 위한 Python 가상 디렉토리 환경을 생성한다.
|
1 2 3 |
$ pwd /Users/systemv $ /opt/homebrew/bin/python3.12 -m venv python3.12 |
가상 디렉토리 환경을 활성화 해줍니다.
|
1 |
$ source python3.12/bin/activate |
VLLM 설치
VLLM 설치를 위해 소스를 다운로드 한다.
|
1 2 3 |
$ pwd /Users/systemv/python3.12 $ git clone https://github.com/vllm-project/vllm.git |
VLLM 소스가 성공적으로 다운로드 되었다면 이제 Python 설치를 해줘야 한다.
|
1 2 3 4 5 6 7 8 9 10 11 |
$ pip install -r requirements/cpu.txt # 의존성 패키지 설치 $ export VLLM_TARGET_DEVICE=cpu $ export VLLM_BUILD_WITH_CUDA=0 $ pip install -e . Building wheels for collected packages: vllm Building editable for vllm (pyproject.toml) ... done Created wheel for vllm: filename=vllm-0.11.1rc2.dev191+g80e945298-0.editable-cp312-cp312-macosx_26_0_arm64.whl size=14144 sha256=bc26664471cc7220ad02b629c4db2060b4e16cf5b150f65203b8f2f649a6b0cb Stored in directory: /private/var/folders/zl/btg3q3c167d57jph52vg688c0000gn/T/pip-ephem-wheel-cache-02cmvmxd/wheels/ac/31/60/d6c757e5b6aacd66f820349f9c37c76eec071553670e52274f Successfully built vllm Installing collected packages: vllm Successfully installed vllm-0.11.1rc2.dev191+g80e945298 |
VLLM 설치가 잘 되었는지 다음과 같이 확인 한다.
|
1 2 3 4 5 6 |
$ vllm --version INFO 10-21 19:04:17 [__init__.py:225] Automatically detected platform cpu. INFO 10-21 19:04:17 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. 0.11.1rc2.dev191+g80e945298 $ which vllm /Users/systemv/python3.12/bin/vllm |
테스트
이제 다음의 코드를 이용해 테스트를 해본다.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
from vllm import LLM, SamplingParams # 추론할 프롬프트 리스트를 정의합니다. prompts = [ "Hello, my name is", "The future of AI is", "It's a beautiful day, isn't it?", ] # 샘플링 파라미터를 정의합니다. sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100) # 사용할 모델을 지정하여 LLM 객체를 생성합니다. # 처음 실행할 때 모델을 다운로드하므로 시간이 다소 걸릴 수 있습니다. llm = LLM(model="facebook/opt-125m") # 프롬프트에 대한 텍스트를 생성합니다. outputs = llm.generate(prompts, sampling_params) # 결과를 출력합니다. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
실행을 했는데, 다음과 같이 오류가 발생 했다.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
$ python3.12 vllm_test.py ✔ python3.12 at 19:33:02 INFO 10-21 19:33:59 [__init__.py:225] Automatically detected platform cpu. INFO 10-21 19:33:59 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. INFO 10-21 19:34:00 [utils.py:243] non-default args: {'disable_log_stats': True, 'model': 'facebook/opt-125m'} INFO 10-21 19:34:01 [model.py:663] Resolved architecture: OPTForCausalLM INFO 10-21 19:34:01 [model.py:1751] Using max model len 2048 INFO 10-21 19:34:02 [arg_utils.py:1314] Chunked prefill is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend. INFO 10-21 19:34:02 [arg_utils.py:1320] Prefix caching is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend. INFO 10-21 19:34:03 [__init__.py:225] Automatically detected platform cpu. INFO 10-21 19:34:03 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. INFO 10-21 19:34:04 [utils.py:243] non-default args: {'disable_log_stats': True, 'model': 'facebook/opt-125m'} INFO 10-21 19:34:05 [model.py:663] Resolved architecture: OPTForCausalLM INFO 10-21 19:34:05 [model.py:1751] Using max model len 2048 INFO 10-21 19:34:06 [arg_utils.py:1314] Chunked prefill is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend. INFO 10-21 19:34:06 [arg_utils.py:1320] Prefix caching is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend. Traceback (most recent call last): File "<string>", line 1, in <module> File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main exitcode = _main(fd, parent_sentinel) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/spawn.py", line 131, in _main prepare(preparation_data) File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/spawn.py", line 246, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/spawn.py", line 297, in _fixup_main_from_path main_content = runpy.run_path(main_path, ^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen runpy>", line 287, in run_path File "<frozen runpy>", line 98, in _run_module_code File "<frozen runpy>", line 88, in _run_code File "/Users/systemv/python3.12/vllm_examples/vllm_test.py", line 15, in <module> llm = LLM(model="facebook/opt-125m") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/systemv/python3.12/vllm/vllm/entrypoints/llm.py", line 324, in __init__ self.llm_engine = LLMEngine.from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/systemv/python3.12/vllm/vllm/v1/engine/llm_engine.py", line 188, in from_engine_args return cls( ^^^^ File "/Users/systemv/python3.12/vllm/vllm/v1/engine/llm_engine.py", line 122, in __init__ self.engine_core = EngineCoreClient.make_client( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/systemv/python3.12/vllm/vllm/v1/engine/core_client.py", line 93, in make_client return SyncMPClient(vllm_config, executor_class, log_stats) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/systemv/python3.12/vllm/vllm/v1/engine/core_client.py", line 639, in __init__ super().__init__( File "/Users/systemv/python3.12/vllm/vllm/v1/engine/core_client.py", line 468, in __init__ with launch_core_engines(vllm_config, executor_class, log_stats) as ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 137, in __enter__ return next(self.gen) ^^^^^^^^^^^^^^ File "/Users/systemv/python3.12/vllm/vllm/v1/engine/utils.py", line 862, in launch_core_engines local_engine_manager = CoreEngineProcManager( ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/systemv/python3.12/vllm/vllm/v1/engine/utils.py", line 142, in __init__ proc.start() File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) ^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/context.py", line 289, in _Popen return Popen(process_obj) ^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/spawn.py", line 164, in get_preparation_data _check_not_importing_main() File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/spawn.py", line 140, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. To fix this issue, refer to the "Safe importing of main module" section in https://docs.python.org/3/library/multiprocessing.html Traceback (most recent call last): File "/Users/systemv/python3.12/vllm_examples/vllm_test.py", line 15, in <module> llm = LLM(model="facebook/opt-125m") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/systemv/python3.12/vllm/vllm/entrypoints/llm.py", line 324, in __init__ self.llm_engine = LLMEngine.from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/systemv/python3.12/vllm/vllm/v1/engine/llm_engine.py", line 188, in from_engine_args return cls( ^^^^ File "/Users/systemv/python3.12/vllm/vllm/v1/engine/llm_engine.py", line 122, in __init__ self.engine_core = EngineCoreClient.make_client( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/systemv/python3.12/vllm/vllm/v1/engine/core_client.py", line 93, in make_client return SyncMPClient(vllm_config, executor_class, log_stats) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/systemv/python3.12/vllm/vllm/v1/engine/core_client.py", line 639, in __init__ super().__init__( File "/Users/systemv/python3.12/vllm/vllm/v1/engine/core_client.py", line 468, in __init__ with launch_core_engines(vllm_config, executor_class, log_stats) as ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.12/3.12.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 144, in __exit__ next(self.gen) File "/Users/systemv/python3.12/vllm/vllm/v1/engine/utils.py", line 880, in launch_core_engines wait_for_engine_startup( File "/Users/systemv/python3.12/vllm/vllm/v1/engine/utils.py", line 937, in wait_for_engine_startup raise RuntimeError( RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1} |
잘 안된다.
문제해결
이 문제는 python 의 main() 함수를 이용해야하는 문제다. 다음과 같이 소스코드를 수정 한다.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
from vllm import LLM, SamplingParams # 추론할 프롬프트 리스트를 정의합니다. prompts = [ "Hello, my name is", "The future of AI is", "It's a beautiful day, isn't it?", ] # 샘플링 파라미터를 정의합니다. sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100) def main(): # 사용할 모델을 지정하여 LLM 객체를 생성합니다. # 처음 실행할 때 모델을 다운로드하므로 시간이 다소 걸릴 수 있습니다. llm = LLM( model="facebook/opt-125m", tensor_parallel_size=1, enable_prefix_caching=False, trust_remote_code=True, gpu_memory_utilization=0.75, max_model_len=2048 ) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if __name__ == "__main__": main() |
코드를 위와같이 변경하고 난 후에 실행하면 잘 된다.
Api 서버
VLLM 은 API Server 로 서빙할 수 있다. Client 는 JSON 포맷으로 값을 넣으면 리턴을 해주는 방식이다. Api Server 구동을 위한 명령어는 다음과 같다.
|
1 2 3 4 5 6 7 8 9 10 11 |
#!/bin/zsh export VLLM_USE_CUDA=0 python3.12 -m vllm.entrypoints.api_server \ --port 8000 \ --model facebook/opt-125m \ --tensor_parallel_size 1 \ --max-model-len 2048 \ --enable-prefix-caching \ --trust-remote-code \ --gpu-memory-utilization 0.75 |