usage: koboldcpp.exe [-h] [--model [filenames] [[filenames] ...]]
                     [--port [portnumber]] [--host [ipaddr]] [--launch]
                     [--config [filename]] [--threads [threads]]
                     [--usecuda [[lowvram|normal] [main GPU ID] [mmq|nommq]
                     [rowsplit] [[lowvram|normal] [main GPU ID] [mmq|nommq]
                     [rowsplit] ...]]] [--usevulkan [[Device IDs]
                     [[Device IDs] ...]]]
                     [--useclblast {0,1,2,3,4,5,6,7,8} {0,1,2,3,4,5,6,7,8}]
                     [--usecpu] [--contextsize [256 to 262144]]
                     [--gpulayers [[GPU layers]]] [--tensor_split [Ratios]
                     [[Ratios] ...]] [--version] [--analyze [filename]]
                     [--maingpu [Device ID]] [--ropeconfig [rope-freq-scale]
                     [[rope-freq-base] ...]]
                     [--blasbatchsize {-1,16,32,64,128,256,512,1024,2048}]
                     [--blasthreads [threads]] [--lora [lora_filename]
                     [[lora_filename] ...]] [--loramult [amount]] [--noshift]
                     [--nofastforward] [--useswa] [--usemmap] [--usemlock]
                     [--noavx2] [--failsafe] [--debugmode [DEBUGMODE]]
                     [--onready [shell command]] [--benchmark [[filename]]]
                     [--prompt [prompt]] [--cli] [--promptlimit [token limit]]
                     [--multiuser [limit]] [--multiplayer] [--websearch]
                     [--remotetunnel] [--highpriority] [--foreground]
                     [--preloadstory [savefile]] [--savedatafile [savefile]]
                     [--quiet] [--ssl [cert_pem] [[key_pem] ...]]
                     [--nocertify] [--mmproj [filename]] [--mmprojcpu]
                     [--visionmaxres [max px]] [--draftmodel [filename]]
                     [--draftamount [tokens]] [--draftgpulayers [layers]]
                     [--draftgpusplit [Ratios] [[Ratios] ...]]
                     [--password [API key]] [--ignoremissing]
                     [--chatcompletionsadapter [filename]] [--flashattention]
                     [--quantkv [quantization level 0/1/2]]
                     [--forceversion [version]] [--smartcontext]
                     [--unpack destination] [--exportconfig [filename]]
                     [--exporttemplate [filename]] [--nomodel]
                     [--moeexperts [num of experts]]
                     [--defaultgenamt DEFAULTGENAMT] [--nobostoken]
                     [--enableguidance] [--maxrequestsize [size in MB]]
                     [--overridekv [name=type:value]]
                     [--overridetensors [tensor name pattern=buffer type]]
                     [--developerMode DEVELOPERMODE]
                     [--showgui | --skiplauncher] [--singleinstance]
                     [--hordemodelname [name]] [--hordeworkername [name]]
                     [--hordekey [apikey]] [--hordemaxctx [amount]]
                     [--hordegenlen [amount]] [--sdmodel [filename]]
                     [--sdthreads [threads]] [--sdclamped [[maxres]]]
                     [--sdclampedsoft [maxres]] [--sdt5xxl [filename]]
                     [--sdclipl [filename]] [--sdclipg [filename]]
                     [--sdphotomaker [filename]] [--sdvae [filename] |
                     --sdvaeauto] [--sdquant | --sdlora [filename]]
                     [--sdloramult [amount]] [--sdtiledvae [maxres]]
                     [--whispermodel [filename]] [--ttsmodel [filename]]
                     [--ttswavtokenizer [filename]] [--ttsgpu]
                     [--ttsmaxlen TTSMAXLEN] [--ttsthreads [threads]]
                     [--embeddingsmodel [filename]]
                     [--embeddingsmaxctx [amount]] [--embeddingsgpu] [--admin]
                     [--adminpassword [password]] [--admindir [directory]]
                     [--admintextmodelsdir [directory]]
                     [--admindatadir [directory]] [--adminallowhf]
                     [model_param] [port_param]

KoboldCpp Server - Version 1.96

positional arguments:
  model_param           Model file to load (positional)
  port_param            Port to listen on (positional)

optional arguments:
  -h, --help            show this help message and exit
  --model [filenames] [[filenames] ...]
                        Model file to load. Accepts multiple values if they
                        are URLs.
  --port [portnumber]   Port to listen on. (Defaults to 5001)
  --host [ipaddr]       Host IP to listen on. If this flag is not set, all
                        routable interfaces are accepted.
  --launch              Launches a web browser when load is completed.
  --config [filename]   Load settings from a .kcpps file. Other arguments will
                        be ignored
  --threads [threads]   Use a custom number of threads if specified.
                        Otherwise, uses an amount based on CPU cores
  --usecuda [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] ...]], --usecublas [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] ...]], --usehipblas [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] [[lowvram|normal] [main GPU ID] [mmq|nommq] [rowsplit] ...]]
                        Use CUDA for GPU Acceleration. Requires CUDA. Enter a
                        number afterwards to select and use 1 GPU. Leaving no
                        number will use all GPUs.
  --usevulkan [[Device IDs] [[Device IDs] ...]]
                        Use Vulkan for GPU Acceleration. Can optionally
                        specify one or more GPU Device ID (e.g. --usevulkan
                        0), leave blank to autodetect.
  --useclblast {0,1,2,3,4,5,6,7,8} {0,1,2,3,4,5,6,7,8}
                        Use CLBlast for GPU Acceleration. Must specify exactly
                        2 arguments, platform ID and device ID (e.g.
                        --useclblast 1 0).
  --usecpu              Do not use any GPU acceleration (CPU Only)
  --contextsize [256 to 262144]
                        Controls the memory allocated for maximum context
                        size, only change if you need more RAM for big
                        contexts. (default 8192).
  --gpulayers [[GPU layers]]
                        Set number of layers to offload to GPU when using GPU.
                        Requires GPU. Set to -1 to try autodetect, set to 0 to
                        disable GPU offload.
  --tensor_split [Ratios] [[Ratios] ...]
                        For CUDA and Vulkan only, ratio to split tensors
                        across multiple GPUs, space-separated list of
                        proportions, e.g. 7 3
  --showgui             Always show the GUI instead of launching the model
                        right away when loading settings from a .kcpps file.
  --skiplauncher        Doesn't display or use the GUI launcher. Overrides
                        showgui.

Advanced Commands:
  --version             Prints version and exits.
  --analyze [filename]  Reads the metadata, weight types and tensor names in
                        any GGUF file.
  --maingpu [Device ID]
                        Only used in a multi-gpu setup. Sets the index of the
                        main GPU that will be used.
  --ropeconfig [rope-freq-scale] [[rope-freq-base] ...]
                        If set, uses customized RoPE scaling from configured
                        frequency scale and frequency base (e.g. --ropeconfig
                        0.25 10000). Otherwise, uses NTK-Aware scaling set
                        automatically based on context size. For linear rope,
                        simply set the freq-scale and ignore the freq-base
  --blasbatchsize {-1,16,32,64,128,256,512,1024,2048}
                        Sets the batch size used in BLAS processing (default
                        512). Setting it to -1 disables BLAS mode, but keeps
                        other benefits like GPU offload.
  --blasthreads [threads]
                        Use a different number of threads during BLAS if
                        specified. Otherwise, has the same value as --threads
  --lora [lora_filename] [[lora_filename] ...]
                        GGUF models only, applies a lora file on top of model.
  --loramult [amount]   Multiplier for the Text LORA model to be applied.
  --noshift             If set, do not attempt to Trim and Shift the GGUF
                        context.
  --nofastforward       If set, do not attempt to fast forward GGUF context
                        (always reprocess). Will also enable noshift
  --useswa              If set, allows Sliding Window Attention (SWA) KV
                        Cache, which saves memory but cannot be used with
                        context shifting.
  --usemmap             If set, uses mmap to load model.
  --usemlock            Enables mlock, preventing the RAM used to load the
                        model from being paged out. Not usually recommended.
  --noavx2              Do not use AVX2 instructions, a slower compatibility
                        mode for older devices.
  --failsafe            Use failsafe mode, extremely slow CPU only
                        compatibility mode that should work on all devices.
                        Can be combined with useclblast if your device
                        supports OpenCL.
  --debugmode [DEBUGMODE]
                        Shows additional debug info in the terminal.
  --onready [shell command]
                        An optional shell command to execute after the model
                        has been loaded.
  --benchmark [[filename]]
                        Do not start server, instead run benchmarks. If
                        filename is provided, appends results to provided
                        file.
  --prompt [prompt]     Passing a prompt string triggers a direct inference,
                        loading the model, outputs the response to stdout and
                        exits. Can be used alone or with benchmark.
  --cli                 Does not launch KoboldCpp HTTP server. Instead,
                        enables KoboldCpp from the command line, accepting
                        interactive console input and displaying responses to
                        the terminal.
  --promptlimit [token limit]
                        Sets the maximum number of generated tokens, usable
                        only with --prompt or --benchmark
  --multiuser [limit]   Runs in multiuser mode, which queues incoming requests
                        instead of blocking them.
  --multiplayer         Hosts a shared multiplayer session that others can
                        join.
  --websearch           Enable the local search engine proxy so Web Searches
                        can be done.
  --remotetunnel        Uses Cloudflare to create a remote tunnel, allowing
                        you to access koboldcpp remotely over the internet
                        even behind a firewall.
  --highpriority        Experimental flag. If set, increases the process CPU
                        priority, potentially speeding up generation. Use
                        caution.
  --foreground          Windows only. Sends the terminal to the foreground
                        every time a new prompt is generated. This helps avoid
                        some idle slowdown issues.
  --preloadstory [savefile]
                        Configures a prepared story json save file to be
                        hosted on the server, which frontends (such as
                        KoboldAI Lite) can access over the API.
  --savedatafile [savefile]
                        If enabled, creates or opens a persistent database
                        file on the server, that allows users to save and load
                        their data remotely. A new file is created if it does
                        not exist.
  --quiet               Enable quiet mode, which hides generation inputs and
                        outputs in the terminal. Quiet mode is automatically
                        enabled when running a horde worker.
  --ssl [cert_pem] [[key_pem] ...]
                        Allows all content to be served over SSL instead. A
                        valid UNENCRYPTED SSL cert and key .pem files must be
                        provided
  --nocertify           Allows insecure SSL connections. Use this if you have
                        cert errors and need to bypass certificate
                        restrictions.
  --mmproj [filename]   Select a multimodal projector file for vision models
                        like LLaVA.
  --mmprojcpu           Force CLIP for Vision mmproj always on CPU.
  --visionmaxres [max px]
                        Clamp MMProj vision maximum allowed resolution.
                        Allowed values are between 512 to 2048 px (default
                        1024).
  --draftmodel [filename]
                        Load a small draft model for speculative decoding. It
                        will be fully offloaded. Vocab must match the main
                        model.
  --draftamount [tokens]
                        How many tokens to draft per chunk before verifying
                        results
  --draftgpulayers [layers]
                        How many layers to offload to GPU for the draft model
                        (default=full offload)
  --draftgpusplit [Ratios] [[Ratios] ...]
                        GPU layer distribution ratio for draft model
                        (default=same as main). Only works if multi-GPUs
                        selected for MAIN model and tensor_split is set!
  --password [API key]  Enter a password required to use this instance. This
                        key will be required for all text endpoints. Image
                        endpoints are not secured.
  --ignoremissing       Ignores all missing non-essential files, just skipping
                        them instead.
  --chatcompletionsadapter [filename]
                        Select an optional ChatCompletions Adapter JSON file
                        to force custom instruct tags.
  --flashattention      Enables flash attention.
  --quantkv [quantization level 0/1/2]
                        Sets the KV cache data type quantization, 0=f16, 1=q8,
                        2=q4. Requires Flash Attention for full effect,
                        otherwise only K cache is quantized.
  --forceversion [version]
                        If the model file format detection fails (e.g. rogue
                        modified model) you can set this to override the
                        detected format (enter desired version, e.g. 401 for
                        GPTNeoX-Type2).
  --smartcontext        Reserving a portion of context to try processing less
                        frequently. Outdated. Not recommended.
  --unpack destination  Extracts the file contents of the KoboldCpp binary
                        into a target directory.
  --exportconfig [filename]
                        Exports the current selected arguments as a .kcpps
                        settings file
  --exporttemplate [filename]
                        Exports the current selected arguments as a .kcppt
                        template file
  --nomodel             Allows you to launch the GUI alone, without selecting
                        any model.
  --moeexperts [num of experts]
                        How many experts to use for MoE models (default=follow
                        gguf)
  --defaultgenamt DEFAULTGENAMT
                        How many tokens to generate by default, if not
                        specified. Must be smaller than context size. Usually,
                        your frontend GUI will override this.
  --nobostoken          Prevents BOS token from being added at the start of
                        any prompt. Usually NOT recommended for most models.
  --enableguidance      Enables the use of Classifier-Free-Guidance, which
                        allows the use of negative prompts. Has performance
                        and memory impact.
  --maxrequestsize [size in MB]
                        Specify a max request payload size. Any requests to
                        the server larger than this size will be dropped. Do
                        not change if unsure.
  --overridekv [name=type:value]
                        Advanced option to override a metadata by key, same as
                        in llama.cpp. Mainly for debugging, not intended for
                        general use. Types: int, float, bool, str
  --overridetensors [tensor name pattern=buffer type]
                        Advanced option to override tensor backend selection,
                        same as in llama.cpp.
  --developerMode DEVELOPERMODE
                        Enables developer utilities, such as hot reloading of
                        Kobold Lite.
  --singleinstance      Allows this KoboldCpp instance to be shut down by any
                        new instance requesting the same port, preventing
                        duplicate servers from clashing on a port.

Horde Worker Commands:
  --hordemodelname [name]
                        Sets your AI Horde display model name.
  --hordeworkername [name]
                        Sets your AI Horde worker name.
  --hordekey [apikey]   Sets your AI Horde API key.
  --hordemaxctx [amount]
                        Sets the maximum context length your worker will
                        accept from an AI Horde job. If 0, matches main
                        context limit.
  --hordegenlen [amount]
                        Sets the maximum number of tokens your worker will
                        generate from an AI horde job.

Image Generation Commands:
  --sdmodel [filename]  Specify an image generation safetensors or gguf model
                        to enable image generation.
  --sdthreads [threads]
                        Use a different number of threads for image generation
                        if specified. Otherwise, has the same value as
                        --threads.
  --sdclamped [[maxres]]
                        If specified, limit generation steps and image size
                        for shared use. Accepts an extra optional parameter
                        that indicates maximum resolution (eg. 768 clamps to
                        768x768, min 512px, disabled if 0).
  --sdclampedsoft [maxres]
                        If specified, limit max image size to curb memory
                        usage. Similar to --sdclamped, but less strict, allows
                        trade-offs between width and height (e.g. 640 would
                        allow 640x640, 512x768 and 768x512 images). Total
                        resolution cannot exceed 1MP.
  --sdt5xxl [filename]  Specify a T5-XXL safetensors model for use in SD3 or
                        Flux. Leave blank if prebaked or unused.
  --sdclipl [filename]  Specify a Clip-L safetensors model for use in SD3 or
                        Flux. Leave blank if prebaked or unused.
  --sdclipg [filename]  Specify a Clip-G safetensors model for use in SD3.
                        Leave blank if prebaked or unused.
  --sdphotomaker [filename]
                        PhotoMaker is a model that allows face cloning.
                        Specify a PhotoMaker safetensors model which will be
                        applied replacing img2img. SDXL models only. Leave
                        blank if unused.
  --sdvae [filename]    Specify an image generation safetensors VAE which
                        replaces the one in the model.
  --sdvaeauto           Uses a built-in VAE via TAE SD, which is very fast,
                        and fixed bad VAEs.
  --sdquant             If specified, loads the model quantized to save
                        memory.
  --sdlora [filename]   Specify an image generation LORA safetensors model to
                        be applied.
  --sdloramult [amount]
                        Multiplier for the image LORA model to be applied.
  --sdtiledvae [maxres]
                        Adjust the automatic VAE tiling trigger for images
                        above this size. 0 disables vae tiling.

Whisper Transcription Commands:
  --whispermodel [filename]
                        Specify a Whisper .bin model to enable Speech-To-Text
                        transcription.

TTS Narration Commands:
  --ttsmodel [filename]
                        Specify the OuteTTS Text-To-Speech GGUF model.
  --ttswavtokenizer [filename]
                        Specify the WavTokenizer GGUF model.
  --ttsgpu              Use the GPU for TTS.
  --ttsmaxlen TTSMAXLEN
                        Limit number of audio tokens generated with TTS.
  --ttsthreads [threads]
                        Use a different number of threads for TTS if
                        specified. Otherwise, has the same value as --threads.

Embeddings Model Commands:
  --embeddingsmodel [filename]
                        Specify an embeddings model to be loaded for
                        generating embedding vectors.
  --embeddingsmaxctx [amount]
                        Overrides the default maximum supported context of an
                        embeddings model (defaults to trained context).
  --embeddingsgpu       Attempts to offload layers of the embeddings model to
                        GPU. Usually not needed.

Administration Commands:
  --admin               Enables admin mode, allowing you to unload and reload
                        different configurations or models.
  --adminpassword [password]
                        Require a password to access admin functions. You are
                        strongly advised to use one for publically accessible
                        instances!
  --admindir [directory]
                        Specify a directory to look for .kcpps configs in,
                        which can be used to swap models.
  --admintextmodelsdir [directory]
                        Used with remote control config switching. By passing
                        in this argument, models in the directory will by
                        available for restarting operations.
  --admindatadir [directory]
                        Specify a directory to store user data in. By passing
                        in this argument, users with the admin password will
                        be able to save and load data from the server
                        database.
  --adminallowhf        Enables downloading of HuggingFace models through the
                        Lite UI.