GraphRAG如何配置处理csv文件

2024年9月4日修改

作者：还就可 | 深入LLM Agent应用开发

原文： https://mp.weixin.qq.com/s/qhQeYfdX...

经常有粉丝朋友在群里问，GraphRAG怎么处理CSV文件啊？你会发现如果只是按照生成的settings.yaml模板配置，你是不可能成功的。比如这样​

代码块

input:​
  type: file # or blob​
  file_type: csv # or csv​
  base_dir: "input"​
  file_encoding: utf-8​
  file_pattern: ".*\\.csv"​

为什么呢？让我们一探究竟。

我已经建了一个LLM Agent应用和GraphRAG讨论群，如果希望进群交流的朋友，后台回复加群即可。​

1. 配置csv文件输入

GraphRAG的索引输入代码位于 graphrag/index/config/input.py ，它目前支持加载csv文件和txt文本文件。因此如果你想实现类似PDF加载，我们需要在这里实现相应代码。回到正题，让我们看一下csv.py代码。​

代码块

 async def load_file(path: str, group: dict | None) -> pd.DataFrame:​
        ....​
        if "id" not in data.columns:​
            data["id"] = data.apply(lambda x: gen_md5_hash(x, x.keys()), axis=1)​
        # 获取指定的source列，并保存为source列​
        if csv_config.source_column is not None and "source" not in data.columns:​
            ...​
            else:​
                data["source"] = data.apply(​
                    lambda x: x[csv_config.source_column], axis=1​
                )​
        # 获取指定的text列，并保存为text列​
        if csv_config.text_column is not None and "text" not in data.columns:​
            ...​
            else:​
                data["text"] = data.apply(lambda x: x[csv_config.text_column], axis=1)​
        # 获取指定的title_column并将其保存为tilte列​
        if csv_config.title_column is not None and "title" not in data.columns:​
            ...​
                data["title"] = data.apply(lambda x: x[csv_config.title_column], axis=1)​
    # 获取指定的时间列，处理时间列timestamp_column​
        if csv_config.timestamp_column is not None:​
          ...​
         else:​
            data["timestamp"] = pd.to_datetime(​
                      data[csv_config.timestamp_column], format=fmt​
                  )​
        return data​

所以如果我们要处理CSV，需要通过指定配置说明你的文本，标题，来源和时间，当然你也可以直接修改你的csv文件来包含这几个列名。那么通过配置的话，我们有哪些选项可以配置呢？​

代码块

type: The type of input to use. Options are file or blob.​
file_type: The file type field discriminates between the different input types. Options are csv and text.​
base_dir: The base directory to read the input files from. This is relative to the config file.​
file_pattern: A regex to match the input files. The regex must have named groups for each of the fields in the file_filter.​
post_process: A DataShaper workflow definition to apply to the input before executing the primary workflow.​
source_column (type: csv only): The column containing the source/author of the data​
text_column (type: csv only): The column containing the text of the data​
timestamp_column (type: csv only): The column containing the timestamp of the data​
timestamp_format (type: csv only): The format of the timestamp​

GraphRAG如何配置处理csv文件​

GraphRAG如何配置处理csv文件