2025-06-12T15:18:41.718561091Z INFO 06-12 15:18:41 [__init__.py:243] Automatically detected platform cuda. 2025-06-12T15:18:44.783595731Z INFO 06-12 15:18:44 [__init__.py:31] Available plugins for group vllm.general_plugins: 2025-06-12T15:18:44.783627483Z INFO 06-12 15:18:44 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver 2025-06-12T15:18:44.783632686Z INFO 06-12 15:18:44 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. 2025-06-12T15:18:44.784140833Z engine.py :27 2025-06-12 15:18:44,783 Engine args: AsyncEngineArgs(model='meta-llama/Llama-3.1-8B-Instruct', served_model_name=None, tokenizer=None, hf_config_path=None, task='auto', skip_tokenizer_init=False, enable_prompt_embeds=False, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path='', download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', seed=0, max_model_len=None, cuda_graph_sizes=[512], distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, data_parallel_size_local=None, data_parallel_address=None, data_parallel_rpc_port=None, enable_expert_parallel=False, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, prefix_caching_hash_algo='builtin', disable_sliding_window=False, disable_cascade_attn=False, use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling={}, rope_theta=None, hf_token=None, hf_overrides={}, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, fully_sharded_loras=False, max_cpu_loras=None, lora_dtype='auto', lora_extra_vocab_size=256, long_lora_scaling_factors=None, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config={}, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, disable_chunked_mm_input=False, guided_decoding_backend='outlines', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, logits_processor_pattern=None, speculative_config=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config={}, override_pooler_config=None, compilation_config=None, worker_cls='auto', worker_extension_cls='', kv_transfer_config=None, kv_events_config=None, generation_config='auto', enable_sleep_mode=False, override_generation_config={}, model_impl='auto', calculate_kv_scales=False, additional_config=None, enable_reasoning=None, reasoning_parser='', use_tqdm_on_load=True, pt_load_map_location='cpu', disable_log_requests=False) 2025-06-12T15:18:57.895630392Z tokenizer_name_or_path: meta-llama/Llama-3.1-8B-Instruct, tokenizer_revision: None, trust_remote_code: False 2025-06-12T15:18:57.895671966Z INFO 06-12 15:18:57 [config.py:793] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'. 2025-06-12T15:18:57.895845028Z WARNING 06-12 15:18:57 [arg_utils.py:1583] --guided-decoding-backend=outlines is not supported by the V1 Engine. Falling back to V0. 2025-06-12T15:18:57.895896295Z WARNING 06-12 15:18:57 [arg_utils.py:1420] Chunked prefill is enabled by default for models with max_model_len > 32K. Chunked prefill might not work with some features or models. If you encounter any issues, please disable by launching with --enable-chunked-prefill=False. 2025-06-12T15:18:57.896422636Z INFO 06-12 15:18:57 [config.py:2118] Chunked prefill is enabled with max_num_batched_tokens=2048. 2025-06-12T15:18:57.899404060Z INFO 06-12 15:18:57 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.0.1) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='outlines', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=False, 2025-06-12T15:18:59.194045264Z INFO 06-12 15:18:59 [cuda.py:292] Using Flash Attention backend. 2025-06-12T15:18:59.775721159Z INFO 06-12 15:18:59 [parallel_state.py:1064] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 2025-06-12T15:18:59.778065862Z INFO 06-12 15:18:59 [model_runner.py:1170] Starting to load model meta-llama/Llama-3.1-8B-Instruct... 2025-06-12T15:19:00.246102244Z INFO 06-12 15:19:00 [weight_utils.py:291] Using model weights format ['*.safetensors'] 2025-06-12T15:19:17.266335310Z INFO 06-12 15:19:17 [weight_utils.py:307] Time spent downloading weights for meta-llama/Llama-3.1-8B-Instruct: 17.019623 seconds 2025-06-12T15:19:17.566182854Z Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00