From Fractal Reader Development and Operation Diary

I want to achieve appropriate sentence segmentation.

  • For example, if there is a change in the topic within a sentence, I would like the summary to be divided at that point.
    • It is currently quite difficult to read materials that are structured like chapters, such as academic papers.
      • If it’s a paper in my field, I am familiar with the template structure of chapters, so I think using chapter information would make it easier to read.

I see (nishio).

  • If limited to Plurality Book, parsing the Markdown data seems like a good approach.

  • As a general rule, the appearance of “short lines that are not sentences” seems to be a hint.

    • However, there are many pitfalls such as bullet points, tables, and figure captions.
  • If cost is not a concern, I would like to hand it over to a contextually broad Large Language Model (LLM) for segmentation.

    • I think this will work (blu3mo).
      • Either way, since I am repeatedly inputting the entire text, the cost doesn’t change much.
  • Design in consideration (blu3mo)

    • Input into LLM, have it divided nicely at a macro level once
      • If there are chapters, I want it to follow that, and if not, I want it to be split nicely based on the context.
    • At a summary level higher in detail than that division, I want chunks to be separated at this division point.
    • At a summary level lower in detail than that division, continue as before.