usage: koboldcpp.exe [-h] [--model [filenames] [[filenames] ...]] [--port [portnumber]] [--host [ipaddr]] [--launch] [--config [filename]] [--threads [threads]] [--usecuda [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] ...]]] [--usevulkan [[Device IDs] [[Device IDs] ...]]] [--useclblast {0,1,2,3,4,5,6,7,8} {0,1,2,3,4,5,6,7,8}] [--usecpu] [--contextsize [256 to 262144]] [--gpulayers [[GPU layers]]] [--tensor_split [Ratios] [[Ratios] ...]] [--version] [--analyze [filename]] [--maingpu [Device ID]] [--ropeconfig [rope-freq-scale] [[rope-freq-base] ...]] [--blasbatchsize {-1,16,32,64,128,256,512,1024,2048}] [--blasthreads [threads]] [--lora [lora_filename] [[lora_filename] ...]] [--loramult [amount]] [--noshift] [--nofastforward] [--useswa] [--usemmap] [--usemlock] [--noavx2] [--failsafe] [--debugmode [DEBUGMODE]] [--onready [shell command]] [--benchmark [[filename]]] [--prompt [prompt]] [--cli] [--promptlimit [token limit]] [--multiuser [limit]] [--multiplayer] [--websearch] [--remotetunnel] [--highpriority] [--foreground] [--preloadstory [savefile]] [--savedatafile [savefile]] [--quiet] [--ssl [cert_pem] [[key_pem] ...]] [--nocertify] [--mmproj [filename]] [--mmprojcpu] [--visionmaxres [max px]] [--draftmodel [filename]] [--draftamount [tokens]] [--draftgpulayers [layers]] [--draftgpusplit [Ratios] [[Ratios] ...]] [--password [API key]] [--ignoremissing] [--chatcompletionsadapter [filename]] [--flashattention] [--quantkv [quantization level 0/1/2]] [--forceversion [version]] [--smartcontext] [--unpack destination] [--exportconfig [filename]] [--exporttemplate [filename]] [--nomodel] [--moeexperts [num of experts]] [--defaultgenamt DEFAULTGENAMT] [--nobostoken] [--enableguidance] [--maxrequestsize [size in MB]] [--overridekv [name=type:value]] [--overridetensors [tensor name pattern=buffer type]] [--developerMode DEVELOPERMODE] [--showgui | --skiplauncher] [--singleinstance] [--hordemodelname [name]] [--hordeworkername [name]] [--hordekey [apikey]] [--hordemaxctx [amount]] [--hordegenlen [amount]] [--sdmodel [filename]] [--sdthreads [threads]] [--sdclamped [[maxres]]] [--sdclampedsoft [maxres]] [--sdt5xxl [filename]] [--sdclipl [filename]] [--sdclipg [filename]] [--sdphotomaker [filename]] [--sdvae [filename] | --sdvaeauto] [--sdquant | --sdlora [filename]] [--sdloramult [amount]] [--sdtiledvae [maxres]] [--whispermodel [filename]] [--ttsmodel [filename]] [--ttswavtokenizer [filename]] [--ttsgpu] [--ttsmaxlen TTSMAXLEN] [--ttsthreads [threads]] [--embeddingsmodel [filename]] [--embeddingsmaxctx [amount]] [--embeddingsgpu] [--admin] [--adminpassword [password]] [--admindir [directory]] [--admintextmodelsdir [directory]] [--admindatadir [directory]] [--adminallowhf] [model_param] [port_param] KoboldCpp Server - Version 1.96 positional arguments: model_param Model file to load (positional) port_param Port to listen on (positional) optional arguments: -h, --help show this help message and exit --model [filenames] [[filenames] ...] Model file to load. Accepts multiple values if they are URLs. --port [portnumber] Port to listen on. (Defaults to 5001) --host [ipaddr] Host IP to listen on. If this flag is not set, all routable interfaces are accepted. --launch Launches a web browser when load is completed. --config [filename] Load settings from a .kcpps file. Other arguments will be ignored --threads [threads] Use a custom number of threads if specified. Otherwise, uses an amount based on CPU cores --usecuda [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] ...]], --usecublas [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] ...]], --usehipblas [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] ...]] Use CUDA for GPU Acceleration. Requires CUDA. Enter a number afterwards to select and use 1 GPU. Leaving no number will use all GPUs. --usevulkan [[Device IDs] [[Device IDs] ...]] Use Vulkan for GPU Acceleration. Can optionally specify one or more GPU Device ID (e.g. --usevulkan 0), leave blank to autodetect. --useclblast {0,1,2,3,4,5,6,7,8} {0,1,2,3,4,5,6,7,8} Use CLBlast for GPU Acceleration. Must specify exactly 2 arguments, platform ID and device ID (e.g. --useclblast 1 0). --usecpu Do not use any GPU acceleration (CPU Only) --contextsize [256 to 262144] Controls the memory allocated for maximum context size, only change if you need more RAM for big contexts. (default 8192). --gpulayers [[GPU layers]] Set number of layers to offload to GPU when using GPU. Requires GPU. Set to -1 to try autodetect, set to 0 to disable GPU offload. --tensor_split [Ratios] [[Ratios] ...] For CUDA and Vulkan only, ratio to split tensors across multiple GPUs, space-separated list of proportions, e.g. 7 3 --showgui Always show the GUI instead of launching the model right away when loading settings from a .kcpps file. --skiplauncher Doesn't display or use the GUI launcher. Overrides showgui. Advanced Commands: --version Prints version and exits. --analyze [filename] Reads the metadata, weight types and tensor names in any GGUF file. --maingpu [Device ID] Only used in a multi-gpu setup. Sets the index of the main GPU that will be used. --ropeconfig [rope-freq-scale] [[rope-freq-base] ...] If set, uses customized RoPE scaling from configured frequency scale and frequency base (e.g. --ropeconfig 0.25 10000). Otherwise, uses NTK-Aware scaling set automatically based on context size. For linear rope, simply set the freq-scale and ignore the freq-base --blasbatchsize {-1,16,32,64,128,256,512,1024,2048} Sets the batch size used in BLAS processing (default 512). Setting it to -1 disables BLAS mode, but keeps other benefits like GPU offload. --blasthreads [threads] Use a different number of threads during BLAS if specified. Otherwise, has the same value as --threads --lora [lora_filename] [[lora_filename] ...] GGUF models only, applies a lora file on top of model. --loramult [amount] Multiplier for the Text LORA model to be applied. --noshift If set, do not attempt to Trim and Shift the GGUF context. --nofastforward If set, do not attempt to fast forward GGUF context (always reprocess). Will also enable noshift --useswa If set, allows Sliding Window Attention (SWA) KV Cache, which saves memory but cannot be used with context shifting. --usemmap If set, uses mmap to load model. --usemlock Enables mlock, preventing the RAM used to load the model from being paged out. Not usually recommended. --noavx2 Do not use AVX2 instructions, a slower compatibility mode for older devices. --failsafe Use failsafe mode, extremely slow CPU only compatibility mode that should work on all devices. Can be combined with useclblast if your device supports OpenCL. --debugmode [DEBUGMODE] Shows additional debug info in the terminal. --onready [shell command] An optional shell command to execute after the model has been loaded. --benchmark [[filename]] Do not start server, instead run benchmarks. If filename is provided, appends results to provided file. --prompt [prompt] Passing a prompt string triggers a direct inference, loading the model, outputs the response to stdout and exits. Can be used alone or with benchmark. --cli Does not launch KoboldCpp HTTP server. Instead, enables KoboldCpp from the command line, accepting interactive console input and displaying responses to the terminal. --promptlimit [token limit] Sets the maximum number of generated tokens, usable only with --prompt or --benchmark --multiuser [limit] Runs in multiuser mode, which queues incoming requests instead of blocking them. --multiplayer Hosts a shared multiplayer session that others can join. --websearch Enable the local search engine proxy so Web Searches can be done. --remotetunnel Uses Cloudflare to create a remote tunnel, allowing you to access koboldcpp remotely over the internet even behind a firewall. --highpriority Experimental flag. If set, increases the process CPU priority, potentially speeding up generation. Use caution. --foreground Windows only. Sends the terminal to the foreground every time a new prompt is generated. This helps avoid some idle slowdown issues. --preloadstory [savefile] Configures a prepared story json save file to be hosted on the server, which frontends (such as KoboldAI Lite) can access over the API. --savedatafile [savefile] If enabled, creates or opens a persistent database file on the server, that allows users to save and load their data remotely. A new file is created if it does not exist. --quiet Enable quiet mode, which hides generation inputs and outputs in the terminal. Quiet mode is automatically enabled when running a horde worker. --ssl [cert_pem] [[key_pem] ...] Allows all content to be served over SSL instead. A valid UNENCRYPTED SSL cert and key .pem files must be provided --nocertify Allows insecure SSL connections. Use this if you have cert errors and need to bypass certificate restrictions. --mmproj [filename] Select a multimodal projector file for vision models like LLaVA. --mmprojcpu Force CLIP for Vision mmproj always on CPU. --visionmaxres [max px] Clamp MMProj vision maximum allowed resolution. Allowed values are between 512 to 2048 px (default 1024). --draftmodel [filename] Load a small draft model for speculative decoding. It will be fully offloaded. Vocab must match the main model. --draftamount [tokens] How many tokens to draft per chunk before verifying results --draftgpulayers [layers] How many layers to offload to GPU for the draft model (default=full offload) --draftgpusplit [Ratios] [[Ratios] ...] GPU layer distribution ratio for draft model (default=same as main). Only works if multi-GPUs selected for MAIN model and tensor_split is set! --password [API key] Enter a password required to use this instance. This key will be required for all text endpoints. Image endpoints are not secured. --ignoremissing Ignores all missing non-essential files, just skipping them instead. --chatcompletionsadapter [filename] Select an optional ChatCompletions Adapter JSON file to force custom instruct tags. --flashattention Enables flash attention. --quantkv [quantization level 0/1/2] Sets the KV cache data type quantization, 0=f16, 1=q8, 2=q4. Requires Flash Attention for full effect, otherwise only K cache is quantized. --forceversion [version] If the model file format detection fails (e.g. rogue modified model) you can set this to override the detected format (enter desired version, e.g. 401 for GPTNeoX-Type2). --smartcontext Reserving a portion of context to try processing less frequently. Outdated. Not recommended. --unpack destination Extracts the file contents of the KoboldCpp binary into a target directory. --exportconfig [filename] Exports the current selected arguments as a .kcpps settings file --exporttemplate [filename] Exports the current selected arguments as a .kcppt template file --nomodel Allows you to launch the GUI alone, without selecting any model. --moeexperts [num of experts] How many experts to use for MoE models (default=follow gguf) --defaultgenamt DEFAULTGENAMT How many tokens to generate by default, if not specified. Must be smaller than context size. Usually, your frontend GUI will override this. --nobostoken Prevents BOS token from being added at the start of any prompt. Usually NOT recommended for most models. --enableguidance Enables the use of Classifier-Free-Guidance, which allows the use of negative prompts. Has performance and memory impact. --maxrequestsize [size in MB] Specify a max request payload size. Any requests to the server larger than this size will be dropped. Do not change if unsure. --overridekv [name=type:value] Advanced option to override a metadata by key, same as in llama.cpp. Mainly for debugging, not intended for general use. Types: int, float, bool, str --overridetensors [tensor name pattern=buffer type] Advanced option to override tensor backend selection, same as in llama.cpp. --developerMode DEVELOPERMODE Enables developer utilities, such as hot reloading of Kobold Lite. --singleinstance Allows this KoboldCpp instance to be shut down by any new instance requesting the same port, preventing duplicate servers from clashing on a port. Horde Worker Commands: --hordemodelname [name] Sets your AI Horde display model name. --hordeworkername [name] Sets your AI Horde worker name. --hordekey [apikey] Sets your AI Horde API key. --hordemaxctx [amount] Sets the maximum context length your worker will accept from an AI Horde job. If 0, matches main context limit. --hordegenlen [amount] Sets the maximum number of tokens your worker will generate from an AI horde job. Image Generation Commands: --sdmodel [filename] Specify an image generation safetensors or gguf model to enable image generation. --sdthreads [threads] Use a different number of threads for image generation if specified. Otherwise, has the same value as --threads. --sdclamped [[maxres]] If specified, limit generation steps and image size for shared use. Accepts an extra optional parameter that indicates maximum resolution (eg. 768 clamps to 768x768, min 512px, disabled if 0). --sdclampedsoft [maxres] If specified, limit max image size to curb memory usage. Similar to --sdclamped, but less strict, allows trade-offs between width and height (e.g. 640 would allow 640x640, 512x768 and 768x512 images). Total resolution cannot exceed 1MP. --sdt5xxl [filename] Specify a T5-XXL safetensors model for use in SD3 or Flux. Leave blank if prebaked or unused. --sdclipl [filename] Specify a Clip-L safetensors model for use in SD3 or Flux. Leave blank if prebaked or unused. --sdclipg [filename] Specify a Clip-G safetensors model for use in SD3. Leave blank if prebaked or unused. --sdphotomaker [filename] PhotoMaker is a model that allows face cloning. Specify a PhotoMaker safetensors model which will be applied replacing img2img. SDXL models only. Leave blank if unused. --sdvae [filename] Specify an image generation safetensors VAE which replaces the one in the model. --sdvaeauto Uses a built-in VAE via TAE SD, which is very fast, and fixed bad VAEs. --sdquant If specified, loads the model quantized to save memory. --sdlora [filename] Specify an image generation LORA safetensors model to be applied. --sdloramult [amount] Multiplier for the image LORA model to be applied. --sdtiledvae [maxres] Adjust the automatic VAE tiling trigger for images above this size. 0 disables vae tiling. Whisper Transcription Commands: --whispermodel [filename] Specify a Whisper .bin model to enable Speech-To-Text transcription. TTS Narration Commands: --ttsmodel [filename] Specify the OuteTTS Text-To-Speech GGUF model. --ttswavtokenizer [filename] Specify the WavTokenizer GGUF model. --ttsgpu Use the GPU for TTS. --ttsmaxlen TTSMAXLEN Limit number of audio tokens generated with TTS. --ttsthreads [threads] Use a different number of threads for TTS if specified. Otherwise, has the same value as --threads. Embeddings Model Commands: --embeddingsmodel [filename] Specify an embeddings model to be loaded for generating embedding vectors. --embeddingsmaxctx [amount] Overrides the default maximum supported context of an embeddings model (defaults to trained context). --embeddingsgpu Attempts to offload layers of the embeddings model to GPU. Usually not needed. Administration Commands: --admin Enables admin mode, allowing you to unload and reload different configurations or models. --adminpassword [password] Require a password to access admin functions. You are strongly advised to use one for publically accessible instances! --admindir [directory] Specify a directory to look for .kcpps configs in, which can be used to swap models. --admintextmodelsdir [directory] Used with remote control config switching. By passing in this argument, models in the directory will by available for restarting operations. --admindatadir [directory] Specify a directory to store user data in. By passing in this argument, users with the admin password will be able to save and load data from the server database. --adminallowhf Enables downloading of HuggingFace models through the Lite UI.