TXT 文档分割工具

这是一个简单而强大的 Python 脚本，旨在帮助你根据目录标题自动分割大型 TXT 文档。它特别适用于将包含多章节、多篇文章或多段内容的单一文本文件，分解成多个独立的、易于管理的子文件。

核心功能

自动识别目录：脚本能够识别文档中以 ***** 包围的目录部分，并提取所有章节标题。
智能分割：它会根据这些标题在文档正文中的位置，将文件精确地切割成多个部分。
自动命名：每个分割后的文件都会根据目录标题进行命名，并按顺序添加编号，方便组织和查找。
容错处理：该脚本能处理标题大小写不一致以及文件名中包含特殊字符的情况，确保分割过程顺畅无误。

使用方法

将你的文档内容保存为 .txt 文件，确保目录部分被 ***** 包围。
将你的文档路径填写到脚本的 split_txt_by_toc 函数中。
运行脚本。

输入与输出示例

输入文件 (document.txt)：

不相关内容
*****
标题1
标题2
标题3
*****

标题1
这是标题1的内容。
它包含第一部分的所有信息。

标题2
这是标题2的内容。
这部分可能是关于另一个主题的。

标题3
这是标题3的内容。
这是文档的最后一部分。

运行脚本后生成的输出文件：

脚本会在同一个目录下创建一个名为 document_split 的新文件夹，其中包含以下文件：

01_标题1.txt
02_标题2.txt
03_标题3.txt

01_标题1.txt 文件的内容：

标题1

这是标题1的内容。
它包含第一部分的所有信息。

import os
import re

def split_txt_by_toc(file_path):
    """
    Splits a TXT file into multiple files based on a table of contents (TOC).

    Args:
        file_path (str): The path to the input TXT file.
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return
    except Exception as e:
        print(f"An error occurred while reading the file: {e}")
        return

    # Find the table of contents block
    toc_start = content.find('*****')
    if toc_start == -1:
        print("Error: Table of contents start marker '*****' not found.")
        return

    toc_end = content.find('*****', toc_start + 1)
    if toc_end == -1:
        print("Error: Table of contents end marker '*****' not found.")
        return

    # Extract the TOC titles and their order
    toc_block = content[toc_start + 5:toc_end].strip()
    toc_titles = [line.strip() for line in toc_block.split('\n') if line.strip()]
    
    # Create a new directory for the output files
    output_dir = os.path.splitext(os.path.basename(file_path))[0] + "_split"
    os.makedirs(output_dir, exist_ok=True)
    print(f"Creating output directory: {output_dir}")

    # Split the main content
    main_content = content[toc_end + 5:].strip()

    # Function to sanitize filenames
    def sanitize_filename(name):
        # Remove characters that are not letters, numbers, hyphens, or underscores
        # Replace spaces with underscores
        sanitized_name = re.sub(r'[^\w\s-]', '', name).strip().replace(' ', '_')
        return sanitized_name

    for i, title in enumerate(toc_titles):
        # Escape special regex characters in the title
        
        # Regex to match the title as a standalone line, with optional leading/trailing whitespace
        pattern = re.compile(rf'^\s*{re.escape(title)}\s*$', re.IGNORECASE | re.MULTILINE)
        
        # Find the starting position of the current title in the main content
        match = pattern.search(main_content)
        if not match:
            print(f"Warning: Title '{title}' not found in the main content. Skipping.")
            continue
            
        start_pos = match.end()

        # Find the end position for the current section
        end_pos = -1
        if i + 1 < len(toc_titles):
            next_title = toc_titles[i + 1]
            next_pattern = re.compile(rf'^\s*{re.escape(next_title)}\s*$', re.IGNORECASE | re.MULTILINE)
            next_match = next_pattern.search(main_content, start_pos)
            if next_match:
                end_pos = next_match.start()
        
        # Extract the content for the current section
        if end_pos != -1:
            section_content = main_content[start_pos:end_pos].strip()
        else:
            section_content = main_content[start_pos:].strip()

        # Create the new file name and path
        sanitized_title = sanitize_filename(title)
        file_name = f"{i+1:02d}_{sanitized_title}.txt"
        file_path = os.path.join(output_dir, file_name)

        # Write the content to the new file
        try:
            with open(file_path, 'w', encoding='utf-8') as f:
                # Add the title back to the beginning of each file
                f.write(f"{title}\n\n{section_content}\n")
            print(f"Created file: {file_path}")
        except Exception as e:
            print(f"An error occurred while writing file '{file_name}': {e}")
            
# Example usage:
# Assuming your file is named 'document.txt'
split_txt_by_toc('document.txt')