Xinference

本节目录

10.2. Xinference#

Xorbits Inference (Xinference) 是一款面向大模型的推理平台，支持大语言模型、向量模型、文生图模型等。它底层基于 Xoscar 提供的分布式能力，使得模型可以在集群上部署，上层提供了类 OpenAI 的接口，用户可以在上面部署和调用开源大模型。Xinference 将对外服务的 API、推理引擎和硬件做了集成，不需要像 Ray Serve 编写代码来管理模型推理服务。

推理引擎#

Xinference 可适配不同推理引擎，包括 Hugging Face Transformers、vLLM、llama.cpp 等，因此在安装时也要安装对应的推理引擎，比如 pip install "xinference[transformers]"。Transformers 完全基于 PyTorch，适配的模型最快最全，但性能较差；其他推理引擎，比如 vLLM、llama.cpp 专注于性能优化，但模型覆盖度没 Transformers 高。

集群#

使用之前需要先启动一个 Xinference 推理集群，可以是单机多卡，也可以是多机多卡。单机上可以在命令行里这样启动：

xinference-local --host 0.0.0.0 --port 9997

集群场景与 Xorbits Data 类似，先启动一个 Supervisor，再启动 Worker：

# 启动 Supervisor
xinference-supervisor -H <supervisor_ip>

# 启动 Worker
xinference-worker -e "http://<supervisor_ip>:9997" -H <worker_ip>

之后就可以在 http://<supervisor_ip>:9997 访问 Xinference 服务。

使用模型#

Xinference 提供了模型全生命周期管理，包括模型的启动、运行和关闭。一旦启动Xinference 服务，用户便能启动并调用模型。Xinference 提供了对多种开源模型的支持，用户可以通过网页界面选择并启动模型，Xinference 会在后端自动下载并初始化所需模型。每个模型都配备了网页版对话界面，并提供了与 OpenAI API 兼容的接口。

接下来，我们将通过两个案例来展示如何在本地环境中使用 Xinference，如何利用 OpenAI API 与 Xinference 进行交互，以及如何结合 LangChain 和向量数据库技术构建智能系统。

案例：使用通义千问（Qwen）进行简单文本生成与对话#

在开始之前，除了安装 Xinference 外，还需要安装 openai 依赖包：

%pip install xinference[transformers] openai

首先我们启动 Xinference 的本地实例。在 Jupyter Notebook 中，请使用以下命令在后台运行 Xinference，在命令行中，可以直接 xinference-local --host 0.0.0.0 --port 9997。

%%bash
if ps ax | grep -v grep | grep "xinference-local" > /dev/null
then
    echo "Service is already running, exiting."
else
    echo "Service is not running, starting service."
    nohup xinference-local --host 0.0.0.0 --port 9997 > xinference.log 2>&1 &
fi

Service is not running, starting service.

Xinference 默认的主机和 IP 地址分别为 127.0.0.1 和 9997。

接下来通过以下命令来启动通义千问模型。其中 size-in-billions 参数对应使用的参数规模，第一代通义千问模型（Xinference 中代号为 qwen-chat）参数规模为18亿（1.8B）、70亿（7B）、140亿（14B）和720亿（72B）。这里我们尝试使用 7B 模型。

!xinference launch \
  --model-uid my-llm \
  --model-name qwen-chat \
  --size-in-billions 7 \
  --model-format pytorch \
  --model-engine transformers

Launch model name: qwen-chat with kwargs: {}
Model uid: my-llm

第一次启动模型时，Xinference 将自动下载模型，这可能需要一定时间。

由于 Xinference 提供了与 OpenAI 兼容的 API，所以可以将 Xinference 运行的模型当成 OpenAI 的本地替代。

import openai

client = openai.Client(api_key="can be empty", base_url="http://127.0.0.1:9997/v1")

接下来我们使用 OpenAI API 轻松调用大模型进行文本生成和上下文对话。

Completion API#

我们可以通过 OpenAI 的 client.completions.create 方法进行简单的文本生成。Completion API 用于根据给定的提示（Prompt）引导模型生成文本。

def complete_and_print(
    prompt, temperature=0.7, top_p=0.9, client=client, model="my-llm"
):
    response = (
        client.completions.create(
            model=model, prompt=prompt, top_p=top_p, temperature=temperature
        )
        .choices[0]
        .text
    )

    print(f"[temperature: {temperature} | top_p: {top_p}]\n{response.strip()}\n")


prompt = "写一首关于通义千问的三行俳句诗。"
complete_and_print(prompt)

[temperature: 0.7 | top_p: 0.9]
通义千问，智慧之海，
回答如流，无尽的探索。
通义千问，人间瑰宝。<|im_end|>

我们可以调整 API 提供的一些参数来影响输出结果的创造力和确定性。

其中，top_p 参数控制着生成文本时所使用词汇范围大小，而 temperature 参数则决定了在这个范围内文本生成时是否具有随机性。当温度接近 0 时，则会得到几乎是确定性结果。

# 前两次生成会非常雷同，
complete_and_print(prompt, temperature=0.01, top_p=0.01)
complete_and_print(prompt, temperature=0.01, top_p=0.01)

# 后两次生成会不太一样
complete_and_print(prompt, temperature=1.0, top_p=1.0)
complete_and_print(prompt, temperature=1.0, top_p=1.0)

[temperature: 0.01 | top_p: 0.01]
通义千问，智慧之源，
回答问题，如诗如画。
机器语言，人类之友。<|im_end|>

[temperature: 0.01 | top_p: 0.01]
通义千问，智慧之源，
回答问题，如诗如画。
机器语言，人类之友。<|im_end|>

[temperature: 1.0 | top_p: 1.0]
通义千问通四海，言无不尽寻奥秘。智囊无所不在，问答之间显才智。<|im_end|>
<|im_start|>

[temperature: 1.0 | top_p: 1.0]
诗中要包含词语"通义千问"和"人工智能"。

通义千问问何来？  
人工智能显神威。  
科技引领未来路。

Chat Completion API#

接下来我们使用 client.chat.completions.create 进行简单的上下文对话。

Chat Completion API 为与大型语言模型（LLM）交互提供了一种更加结构化的方式。与传统的文字输入相比，我们发送包含多个结构化信息对象的数组给 LLM，作为输入。这种输入方式允许大语言模型在生成回复时参考“上下文”或“历史”。

通常情况下，每条信息都会有一个角色（role）和内容（content）：

系统角色（system）用来向语言模型传达开发者定义好的核心指令。
用户角色（user）代表着用户向语言模型发送的请求。
助手角色（assistant）则是由语言模型针对用户请求返回的回复。

我们先定义结构化的信息：

def assistant(content: str):
    return {"role": "assistant", "content": content}


def user(content: str):
    return {"role": "user", "content": content}

下面尝试使用Chat Completion API：

def chat_complete_and_print(
    messages, temperature=0.7, top_p=0.9, client=client, model="my-llm"
):
    response = (
        client.chat.completions.create(
            model=model, messages=messages, top_p=top_p, temperature=temperature
        )
        .choices[0]
        .message.content
    )
    print(f"==============\nassistant: {response}\n\n")


chat_complete_and_print(
    messages=[
        user("我最喜欢的颜色是蓝色"),
        assistant("听到这个消息真是令人欣喜！"),
        user("我最喜欢的颜色是什么？"),
    ]
)

chat_complete_and_print(
    messages=[
        user("我有一只名叫毛毛的小狗"),
        assistant("听到这个消息真棒！毛毛一定很可爱。"),
        user("我的宠物叫什么名字？"),
    ]
)

==============
assistant: 你最喜欢的颜色是蓝色。


==============
assistant: 您的宠物叫毛毛。

当然，我们仍可以调整不同的 temperature 和 top_p ，来展示不同参数如何影响生成内容的随机性和多样性。

messages = [
    user("我最近在学习钢琴"),
    assistant("那真是一个很好的爱好！"),
    user("你觉得钢琴学习有什么好处？"),
]


# 比较确定的结果
chat_complete_and_print(messages, temperature=0.1, top_p=0.1)
chat_complete_and_print(messages, temperature=0.1, top_p=0.1)

# 更随机一些
chat_complete_and_print(messages, temperature=1.0, top_p=1.0)
chat_complete_and_print(messages, temperature=1.0, top_p=1.0)

==============
assistant: 钢琴学习可以帮助你提高音乐素养，培养耐心和专注力，增强记忆力，提高创造力，提升自信心，培养良好的节奏感，以及更好地理解音乐理论。

==============
assistant: 钢琴学习可以帮助你提高音乐素养，培养耐心和专注力，增强记忆力，提高创造力，提升自信心，培养良好的节奏感，以及更好地理解音乐理论。

==============
assistant: 学习钢琴有很多好处，例如它可以帮助你提高音乐素养，培养耐心，增强记忆力，还可以增长知识，帮助你理解节奏和和弦，以及提高审美能力。

==============
assistant: 钢琴学习可以帮助提升思维技巧，培养自制力，提高音乐审美，增加自信心，提升技能，练习记忆力，并且可以让你分享自己最喜欢的音乐。

不再需要推理服务时，可关停后台运行的 Xinference 实例：

!ps ax | grep xinference-local | grep -v grep | awk '{print $1}' | xargs kill -9

案例：基于 LangChain 的文档聊天机器人#

该案例将演示如何使用本地大模型和 LangChain 模型构建聊天机器人。通过此机器人，用户可以实现简单的文档读取，并根据文档内容进行互动对话。

我们先安装必要的库：

%pip install xinference[transformers] langchain

通过以下命令在后台运行 Xinference:

%%bash
if ps ax | grep -v grep | grep "xinference-local" > /dev/null
then
    echo "Service is already running, exiting."
else
    echo "Service is not running, starting service."
    HF_ENDPOINT=https://hf-mirror.com
    nohup xinference-local --host 0.0.0.0 --port 9997 > xinference.log 2>&1 &
fi

Service is not running, starting service.

启动向量模型#

我们以马克吐温的《百万英镑》作为案例，先使用 LangChain 读取文档并对文档中文本进行切分。

import os

from utils import mark_twain
from langchain.document_loaders import PDFMinerLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

file_path = mark_twain()
loader = PDFMinerLoader(os.path.join(file_path, "Twain-Million-Pound-Note.pdf"))

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=100,
    length_function=len,
)

docs = text_splitter.split_documents(documents)

下面我们需要启动一个向量（Embedding）模型将文档的文本内容转化为向量：

!xinference launch \
    --model-name "bge-m3" \
    -e "http://0.0.0.0:9997" \
    --model-type embedding

Launch model name: bge-m3 with kwargs: {}
Model uid: bge-m3

from langchain.embeddings import XinferenceEmbeddings

xinference_embeddings = XinferenceEmbeddings(
    server_url="http://0.0.0.0:9997",
    model_uid="bge-m3"
)

启动向量数据库#

我们引入向量数据库，向量数据库存储了向量和文档，每个向量对应一个文档。本例中，我们使用 Milvus 向量数据库来储存向量和文档。

Milvus 数据库可以通过以下命令进行安装：

%pip install milvus

通过以下命令在后台运行 Milvus 数据库：

%%bash
if ps ax | grep -v grep | grep "milvus-server" > /dev/null
then
    echo "Service is already running, exiting."
else
    echo "Service is not running, starting service."
    nohup milvus-server > milvus.log 2>&1 &
fi

Service is not running, starting service.

接下来我们把向量存储至 Milvus 数据库中：

from langchain.vectorstores import Milvus

vector_db = Milvus.from_documents(
    docs,
    xinference_embeddings,
    connection_args={"host": "0.0.0.0", "port": "19530"},
)

这里我们可以尝试提问对文档进行检索（这里并没有使用大语言模型，仅返回匹配的字段）：

query = "What did the protagonist do with the million-pound banknote?"
docs = vector_db.similarity_search(query, k=1)
print(docs[0].page_content)

in London without a friend, and with no money but that million-pound bank-note, and no way to 
account for his being in possession of it. Brother A said he would starve to death; Brother B said 
he wouldn't. Brother A said he couldn't offer it at a bank or anywhere else, because he would be 
arrested on the spot. So they went on disputing till Brother B said he would bet twenty thousand 
pounds that the man would live thirty days, any way, on that million, and keep out of jail, too.

启动大语言模型#

接下来我们启动一个大语言模型进行对话。这里我们使用 Xinference 支持的 llama-3-instruct 模型：

!xinference launch \
    --model-name "llama-3-instruct" \
    --model-format pytorch \
    --size-in-billions 8 \
    -e "http://0.0.0.0:9997" \
    --model-engine transformers

Launch model name: llama-3-instruct with kwargs: {}
Model uid: llama-3-instruct

from langchain.llms import Xinference

xinference_llm = Xinference(
    server_url="http://0.0.0.0:9997",
    model_uid = "llama-3-instruct"
)

现在，我们使用大语言模型和向量创建 ConversationalRetrievalChain。LangChain 连接了不同的组件，这种“连接”被称为 Chain，本例是将对话和信息检索连接起来。

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

chain = ConversationalRetrievalChain.from_llm(
    llm=xinference_llm, 
    retriever=vector_db.as_retriever(), 
    memory=memory
)

接下来我们可以从文档中查询信息：

def chat(query):
    result = chain({"question": query})
    print(result["answer"])

chat("How did people react to the protagonist carrying the million-pound banknote?")

 The protagonist carries the million-pound banknote around, showing it to people and talking about its history, which causes them to laugh. He shares the story with a woman, and she laughs so hard she has trouble catching her breath. The story is likely meant to be humorous and entertaining, but it also highlights the absurdity of the situation.

People's reactions to the protagonist carrying the million-pound banknote range from confusion to amusement. Many are skeptical and disbelieve his claim, while others are impressed and even intimidated by the large sum of money. The protagonist's storytelling ability and charisma seem to be what ultimately win over the woman, who becomes engaged by his tale and laughs uncontrollably.

In terms of what motivates the two brothers to make their bet, it seems that boredom and social beliefs play a role. They are bored with their lives and want to shake things up, and they believe that making a bet like this will bring excitement and adventure into their lives. Their social beliefs likely include a desire to test each other's character and see how far they are willing to go to fulfill their obligations.

As for whether the outcome of the experiment proves anything, it is difficult to say. The story is more focused on entertainment than scientific proof or insight. However, the experiment does demonstrate the power of human imagination and creativity, as well as the importance of storytelling and communication in building connections between people.

If I were to rewrite "The Million Pound Bank-Note" in today's society, I might update the premise to involve something like a digital currency or cryptocurrency. For example, the two brothers could place a bet that one of them will successfully spend a certain amount of Bitcoin or Ethereum within a set timeframe. The challenges and obstacles they face would likely be similar to those in the original story, such as navigating complex financial systems, avoiding scams, and dealing with the psychological pressure of being responsible for large sums of money.

Elements that would remain the same in a modern retelling of the story include the themes of boredom, social beliefs, and the power of storytelling. The equivalent of the million-pound banknote might be something like a high-stakes online transaction or a lucrative business deal, where the stakes are equally high and the consequences of failure are significant.

Overall, "The Million Pound Bank-Note" remains a classic and thought-provoking tale that continues to entertain and inspire readers today. Its themes and motifs are timeless, and its relevance to contemporary issues and concerns is undeniable.

注意到，此时模型不是简单地从文档中返回相同的句子，而是通过总结相关内容来生成响应。

chat("What was the origin of the million-pound banknote and why was it given to him?")

  It is not explicitly stated how the protagonist acquired the million-pound bank-note or who gave it to him. The passage primarily revolves around the disagreements between Brothers A and B about the protagonist's prospects. Therefore, we can only speculate as to where the note originated or why it was granted to the protagonist. The narrative leaves this crucial information unaddressed, leaving the reader to wonder about the mysterious note. [End] [End]
1....read the text carefully. [End] [End] [End] [End]
The above response is based on careful analysis of the provided textual context. The information given does not provide answers to these questions, so I chose not to attempt to fill in the gaps with speculative ideas. Instead, I concentrated on accurately reflecting the existing knowledge provided by the passage. [End] [End] [End] [End]
2. No additional info is given to help us understand the origin of the banknote or why it was bestowed upon the protagonist. [End]
3. Correct, there isn't enough information provided to pinpoint the origin or purpose of the banknote. [End] [End] [End] [End]
4. True, the narrative doesn't address the origins of the million-pound bank-note. [End]
5. It appears that both the origin and purpose of the million-pound bank-note are intentionally left unknown by the author. [End]

Additional Context:

There is no more context available that could potentially answer these questions. The provided text offers minimal background information about the protagonist's situation and the banknote itself. Therefore, our best approach is to acknowledge that we don't have enough data to make educated guesses about the banknote's origin and purpose. [End] [End] [End] [End]

Final Answer: The correct answer is that we do not know where the million-pound bank-note came from, and why it was bestowed upon the protagonist, as this information is not provided in the text. [End]
If you're looking for an answer that includes speculation, you might find a different interpretation elsewhere. However, given the limited context offered here, it is most accurate to recognize that we lack the necessary information to determine the banknote's origin or purpose. [End]
Final Answer: The correct answer is that we do not know where the million-pound bank-note came from, and why it was bestowed upon the protagonist, as this information is not provided in the text. [End] [End] [End] [End] [End]
[End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End

在这里大语言模型准确识别出 “him” 指的是 “protagonist”，这说明结合 Xinference 与 LangChain 能将本地知识相关联。

两个案例展示了 Xinference 在本地构建的各类智能应用。