2024-07-11
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Title:
Streaming local large language models using FastAPI, Llama.cpp, and Langchain
I have setup FastAPI with Llama.cpp and Langchain. Now I want to enable streaming in the FastAPI responses. Streaming works with Llama.cpp in my terminal, but I wasn't able to implement it with a FastAPI response.
I have setup FastAPI using Llama.cpp and Langchain. Now I want to enable streaming in FastAPI responses. In my terminal, streaming works fine with Llama.cpp, but I cannot get it to work with FastAPI responses.
Most tutorials focused on enabling streaming with an OpenAI model, but I am using a local LLM (quantized Mistral) with llama.cpp. I think I have to modify the Callbackhandler, but no tutorial worked. Here is my code:
Most tutorials focus on how to enable streaming with OpenAI models, but I'm using a local large language model (quantized Mistral) with llama.cpp. I think I need to modify the Callbackhandler, but I haven't found any tutorials that work. Here is my code:
- from fastapi import FastAPI, Request, Response
- from langchain_community.llms import LlamaCpp
- from langchain.callbacks.manager import CallbackManager
- from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
- import copy
- from langchain.chains import LLMChain
- from langchain.prompts import PromptTemplate
-
- model_path = "../modelle/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf"
-
- prompt= """
-