Pemo

RAG-Anything

<p>RAG-Anything是香港大学数据智能实验室推出的开源多模态RAG系统。系统支持处理包含文本、图像、表格和公式的复杂文档，提供从文档摄取到智能查询的端到端解决方案。系统基于多模态知识图谱、灵活的解析架构和混合检索机制，显著提升复杂文档处理能力，支持多种文档格式，如PDF、Office文档、图像和文本文件等。RAG-Anything核心优势包括端到端多模态流水线、多格式文档支持、多模态内容分析引擎、知识图谱索引、灵活的处理架构和跨模态检索机制等。</p> <h2 style="font-size: 20px;">RAG-Anything的项目地址</h2> <ul> <li>GitHub仓库：https://github.com/HKUDS/RAG-Anything</li> <li>arXiv技术论文：https://arxiv.org/pdf/2410.05779</li> </ul>

Agentic Document Extraction

<div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">概述</h2> <a id="user-content-overview" class="anchor" href="https://github.com/landing-ai/agentic-doc#overview" aria-label="永久链接：概述"></a></div> <p>LandingAI Agentic 文档提取API 从视觉复杂的文档（如表格、图片和图表）中提取结构化数据，并返回具有精确元素位置的分层 JSON。</p> <p>这个 Python 库包装了该 API 以提供：</p> <ul dir="auto"> <li>长文档支持——一次调用即可处理 100 多页 PDF</li> <li>自动重试/分页——处理并发、超时和速率限制</li> <li>辅助实用程序——边界框代码片段、可视化调试器等</li> </ul> <div class="markdown-heading" dir="auto"> <h3 class="heading-element" dir="auto" tabindex="-1">特征</h3> <a id="user-content-features" class="anchor" href="https://github.com/landing-ai/agentic-doc#features" aria-label="永久链接：功能"></a></div> <ul dir="auto"> <li>📦包含电池的安装： <code>pip install agentic-doc</code> – 无需其他任何操作 → 请参阅 <a href="https://github.com/landing-ai/agentic-doc#installation">安装</a></li> <li>🗂️所有文件类型：解析任意长度的 PDF、单个图像或 URL → 请参阅 <a href="https://github.com/landing-ai/agentic-doc#supported-files">支持的文件</a></li> <li>📚长文档就绪：自动拆分和并行处理 1000 多页 PDF，然后拼接结果 → 参见 <a href="https://github.com/landing-ai/agentic-doc#parse-large-pdf-files">解析大型 PDF 文件</a></li> <li>🧩结构化输出：返回分层 JSON 以及可渲染的 Markdown → 参见 <a href="https://github.com/landing-ai/agentic-doc#result-schema">结果架构</a></li> <li>👁️真实视觉效果：可选的边界框片段和整页可视化 → 请参阅 <a href="https://github.com/landing-ai/agentic-doc#save-groundings-as-images">将 Groundings 另存为图像</a></li> <li>🏃批处理和并行：提供列表；库管理线程和速率限制（<code>BATCH_SIZE</code>，<code>MAX_WORKERS</code>）→参见 <a href="https://github.com/landing-ai/agentic-doc#parse-multiple-files-in-a-batch">批量解析多个文件</a></li> <li>🔄弹性：针对 408/429/502/503/504 和速率限制命中的指数退避重试 → 请参阅 <a href="https://github.com/landing-ai/agentic-doc#automatically-handle-api-errors-and-rate-limits-with-retries">使用重试自动处理 API 错误和速率限制</a></li> <li>🛠️嵌入式助手： <code>parse_documents</code>，，<code>parse_and_save_documents</code>→<code>parse_and_save_document</code>参见 <a href="https://github.com/landing-ai/agentic-doc#main-functions">主要功能</a></li> <li>⚙️通过 env / .env 配置：调整并行度、日志记录样式、重试上限 — 无需更改代码 → 请参阅 <a href="https://github.com/landing-ai/agentic-doc#configuration-options">配置选项</a></li> <li>🌐原始 API 就绪：高级用户仍然可以直接访问 REST 端点 → 请参阅 <a href="https://support.landing.ai/docs/document-extraction" rel="nofollow">API 文档</a></li> </ul> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">快速入门</h2> <a id="user-content-quick-start" class="anchor" href="https://github.com/landing-ai/agentic-doc#quick-start" aria-label="永久链接：快速入门"></a></div> <div class="markdown-heading" dir="auto"> <h3 class="heading-element" dir="auto" tabindex="-1">安装</h3> <a id="user-content-installation" class="anchor" href="https://github.com/landing-ai/agentic-doc#installation" aria-label="永久链接：安装"></a></div> <div class="highlight highlight-source-shell notranslate position-relative overflow-auto" dir="auto"> <pre>pip install agentic-doc</pre> </div>

Dolphin

<p>Dolphin 是字节跳动开源的轻量级、高效的文档解析大模型。基于先解析结构后解析内容的两阶段方法，第一阶段生成文档布局元素序列，第二阶段用元素作为锚点并行解析内容。Dolphin在多种文档解析任务上表现出色，性能超越GPT-4.1、Mistral-OCR等模型。Dolphin 具有322M参数，体积小、速度快，支持多种文档元素解析，包括文本、表格、公式等。Dolphin的代码和预训练模型已公开，方便开发者使用和研究。</p> <h2 style="font-size: 20px;">Dolphin的主要功能</h2> <ul> <li>布局分析：识别文档中的各种元素（如标题、图表、表格、脚注等），按照自然阅读顺序生成元素序列。</li> <li>内容提取：将整个文档页面解析为结构化的JSON格式或Markdown格式，便于后续处理和展示。</li> <li>文本段落解析：准确识别和提取文档中的文本内容，支持多语言（如中文和英文）。</li> <li>公式识别：支持复杂公式的识别，包括行内公式和块级公式，输出LaTeX格式。</li> <li>表格解析：支持解析复杂的表格结构，提取单元格内容并生成HTML格式的表格。</li> <li>轻量级架构：模型参数量为322M，体积小，运行速度快，适合在资源受限的环境中使用。</li> <li>支持多种输入格式：支持处理多种类型的文档图像，包括学术论文、商业报告、技术文档等。</li> <li>多样化的输出格式：支持将解析结果输出为JSON、Markdown、HTML等多种格式，便于与不同系统集成。</li> </ul> <h2 style="font-size: 20px;">Dolphin的技术原理</h2> <ul> <li>页面级布局分析：用Swin Transformer对输入的文档图像进行编码，提取视觉特征。基于解码器生成文档元素序列，每个元素包含其类别（如标题、表格、图表等）和坐标位置。这一阶段的目标是按照自然阅读顺序生成结构化的布局信息。</li> <li>元素级内容解析：根据第一阶段生成的布局信息，从原始图像中裁剪出每个元素的局部视图。用特定的提示词（prompts），对每个元素进行并行内容解析。例如，表格用专门的提示词解析HTML格式，公式和文本段落共享提示词解析LaTeX格式。解码器根据裁剪后的元素图像和提示词，生成最终的解析内容。</li> </ul> <h2 style="font-size: 20px;">Dolphin的项目地址</h2> <ul> <li>GitHub仓库：<a class="external" href="https://github.com/bytedance/Dolphin" target="_blank" rel="noopener">https://github.com/bytedance/Dolphin</a></li> <li>HuggingFace模型库：<a class="external" href="https://huggingface.co/ByteDance/Dolphin" target="_blank" rel="noopener nofollow">https://huggingface.co/ByteDance/Dolphin</a></li> <li>arXiv技术论文：<a class="external" href="https://arxiv.org/pdf/2505.14059" target="_blank" rel="noopener nofollow">https://arxiv.org/pdf/2505.14059</a></li> <li>在线体验Demo：<a class="external" href="http://115.190.42.15:8888/dolphin/" target="_blank" rel="noopener nofollow">http://115.190.42.15:8888/dolphin/</a></li> </ul>

ContextGem

<p dir="auto"><a href="https://camo.githubusercontent.com/dcc762e8d3dc538b9e7dffbc07f3a0b3bfae2e4b56c8d5670075d156cd5d53b6/68747470733a2f2f636f6e7465787467656d2e6465762f5f7374617469632f636f6e7465787467656d5f726561646d655f6865616465722e706e67" target="_blank" rel="noopener noreferrer nofollow"><img title="ContextGem - 轻松从文档中提取 LLM" src="https://camo.githubusercontent.com/dcc762e8d3dc538b9e7dffbc07f3a0b3bfae2e4b56c8d5670075d156cd5d53b6/68747470733a2f2f636f6e7465787467656d2e6465762f5f7374617469632f636f6e7465787467656d5f726561646d655f6865616465722e706e67" alt="ContextGem" data-canonical-src="https://contextgem.dev/_static/contextgem_readme_header.png"></a></p> <div class="markdown-heading" dir="auto"> <h1 class="heading-element" dir="auto" tabindex="-1">ContextGem：轻松从文档中提取 LLM</h1> <a id="user-content-contextgem-effortless-llm-extraction-from-documents" class="anchor" href="https://github.com/shcherbak-ai/contextgem#contextgem-effortless-llm-extraction-from-documents" aria-label="永久链接：ContextGem：轻松从文档中提取 LLM"></a></div> <p dir="auto"><a href="https://github.com/shcherbak-ai/contextgem/actions/workflows/ci-tests.yml"><img src="https://github.com/shcherbak-ai/contextgem/actions/workflows/ci-tests.yml/badge.svg?branch=main" alt="测试"></a>&nbsp;<a href="https://github.com/shcherbak-ai/contextgem/actions"><img src="https://camo.githubusercontent.com/347c395f771dc077b9f35a3e297c18ae2bdec42178f6a8b86f301bbd237109a9/68747470733a2f2f696d672e736869656c64732e696f2f656e64706f696e743f75726c3d68747470733a2f2f676973742e67697468756275736572636f6e74656e742e636f6d2f53657267696953686368657262616b2f64616165653030653164666666376132396361313061393232656333626563642f7261772f636f7665726167652e6a736f6e" alt="覆盖范围" data-canonical-src="https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/SergiiShcherbak/daaee00e1dfff7a29ca10a922ec3becd/raw/coverage.json"></a>&nbsp;<a href="https://github.com/shcherbak-ai/contextgem/actions/workflows/docs.yml"><img src="https://github.com/shcherbak-ai/contextgem/actions/workflows/docs.yml/badge.svg?branch=main" alt="文档"></a>&nbsp;<a href="https://shcherbak-ai.github.io/contextgem/" rel="nofollow"><img src="https://camo.githubusercontent.com/c21574baf34c81b3651a5274c8b074471bc1e142a9516c0cfa75dea9223d93fe/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f646f63732d6c61746573742d626c75652e737667" alt="文档" data-canonical-src="https://img.shields.io/badge/docs-latest-blue.svg"></a>&nbsp;<a href="https://opensource.org/licenses/Apache-2.0" rel="nofollow"><img src="https://camo.githubusercontent.com/8c0b445c03bb9f023baece6d7b7062fbc1c09274e7adac502b2e3d97c8f3f4f8/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4170616368655f322e302d6272696768742e737667" alt="执照" data-canonical-src="https://img.shields.io/badge/License-Apache_2.0-bright.svg"></a>&nbsp;<a href="https://camo.githubusercontent.com/667c17247b46abb2e3bb36a43080282a98b383bfda63a45a0d85d76cdaaf554e/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f636f6e7465787467656d" target="_blank" rel="noopener noreferrer nofollow"><img src="https://camo.githubusercontent.com/667c17247b46abb2e3bb36a43080282a98b383bfda63a45a0d85d76cdaaf554e/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f636f6e7465787467656d" alt="PyPI" data-canonical-src="https://img.shields.io/pypi/v/contextgem"></a>&nbsp;<a href="https://www.python.org/downloads/" rel="nofollow"><img src="https://camo.githubusercontent.com/7ff91fc79dec5b71b1dfa1c53d99c5688a036ec8a95fe7b366a0644d662f45cf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f707974686f6e2d332e3130253230253743253230332e3131253230253743253230332e3132253230253743253230332e31332d626c7565" alt="Python 版本" data-canonical-src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue"></a>&nbsp;<a href="https://github.com/shcherbak-ai/contextgem/actions/workflows/codeql.yml"><img src="https://github.com/shcherbak-ai/contextgem/actions/workflows/codeql.yml/badge.svg?branch=main" alt="代码安全"></a>&nbsp;<a href="https://github.com/psf/black"><img src="https://camo.githubusercontent.com/5bf9e9fa18966df7cb5fac7715bef6b72df15e01a6efa9d616c83f9fcb527fe2/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f636f64652532307374796c652d626c61636b2d3030303030302e737667" alt="代码样式：黑色" data-canonical-src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>&nbsp;<a href="https://pycqa.github.io/isort/" rel="nofollow"><img src="https://camo.githubusercontent.com/67699ff1c668c9b011f6854466a11c31c6551c1055736bc3e26536c1c52d089f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f253230696d706f7274732d69736f72742d2532333136373462313f7374796c653d666c6174" alt="进口：isort" data-canonical-src="https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat"></a>&nbsp;<a href="https://pydantic.dev/" rel="nofollow"><img src="https://camo.githubusercontent.com/1ec3b5f774c66556456b4b855a73c1706f5454fa0ac3d2e4bcdabda9153b6b45/68747470733a2f2f696d672e736869656c64732e696f2f656e64706f696e743f75726c3d68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f707964616e7469632f707964616e7469632f6d61696e2f646f63732f62616467652f76322e6a736f6e" alt="Pydantic v2" data-canonical-src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json"></a>&nbsp;<a href="https://python-poetry.org/" rel="nofollow"><img src="https://camo.githubusercontent.com/e9de59b7d2a7896f05d977ca76c28c69c6ff163840e5526baeb18e56c532ad5f/68747470733a2f2f696d672e736869656c64732e696f2f656e64706f696e743f75726c3d68747470733a2f2f707974686f6e2d706f657472792e6f72672f62616467652f76302e6a736f6e" alt="诗" data-canonical-src="https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json"></a>&nbsp;<a href="https://github.com/pre-commit/pre-commit"><img src="https://camo.githubusercontent.com/3f29c595a2e15caa8e0729b41d0451353076f480eaeefb9b07ba68e20cccb7b2/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7072652d2d636f6d6d69742d656e61626c65642d626c75653f6c6f676f3d7072652d636f6d6d6974266c6f676f436f6c6f723d7768697465" alt="预先提交" data-canonical-src="https://img.shields.io/badge/pre--commit-enabled-blue?logo=pre-commit&amp;logoColor=white"></a>&nbsp;<a href="https://github.com/shcherbak-ai/contextgem/blob/main/CODE_OF_CONDUCT.md"><img src="https://camo.githubusercontent.com/71217453f48cd1f12ba5a720412bb92743010653a5cc69654e627fd99e2e9104/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436f6e7472696275746f72253230436f76656e616e742d322e312d3462616161612e737667" alt="贡献者契约" data-canonical-src="https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg"></a>&nbsp;<a href="https://deepwiki.com/shcherbak-ai/contextgem" rel="nofollow"><img src="https://camo.githubusercontent.com/f8d782705bcb9ce83b5e04fc86504e194e5226b10cee5e6e9794cfc8b6101cba/68747470733a2f2f696d672e736869656c64732e696f2f7374617469632f76313f6c6162656c3d4465657057696b69266d6573736167653d4368617425323077697468253230436f6465266c6162656c436f6c6f723d25323332383335393326636f6c6f723d253233374535374332267374796c653d666c61742d737175617265" alt="深度维基" data-canonical-src="https://img.shields.io/static/v1?label=DeepWiki&amp;message=Chat%20with%20Code&amp;labelColor=%23283593&amp;color=%237E57C2&amp;style=flat-square"></a></p> <p dir="auto">&nbsp;</p> <p dir="auto">ContextGem 是一个免费的开源 LLM 框架，它可以让您以最少的代码更轻松地从文档中提取结构化数据和见解。</p> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">💎 为什么选择 ContextGem？</h2> <a id="user-content--why-contextgem" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-why-contextgem" aria-label="永久链接：💎 为什么选择 ContextGem？"></a></div> <p dir="auto">大多数流行的 LLM 框架用于从文档中提取结构化数据，即使是提取基本信息，也需要大量的样板代码。这大大增加了开发时间和复杂性。</p> <p dir="auto">ContextGem 通过提供灵活直观的框架来应对这一挑战，该框架能够以最小的投入从文档中提取结构化数据和洞察。复杂且耗时的部分由<strong>强大的抽象功能</strong>处理，从而消除了样板代码并降低了开发成本。</p> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">⭐ 主要特点</h2> <a id="user-content--key-features" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-key-features" aria-label="永久链接：⭐ 主要特点"></a></div> <table> <thead> <tr> <th>内置抽象</th> <th><strong>ContextGem</strong></th> <th>其他 LLM 框架*</th> </tr> </thead> <tbody> <tr> <td>自动动态提示</td> <td>🟢</td> <td>◯</td> </tr> <tr> <td>自动化数据建模和验证器</td> <td>🟢</td> <td>◯</td> </tr> <tr> <td>精确的粒度参考映射（段落和句子）</td> <td>🟢</td> <td>◯</td> </tr> <tr> <td>理由（提取背后的推理）</td> <td>🟢</td> <td>◯</td> </tr> <tr> <td>神经分割（SaT）</td> <td>🟢</td> <td>◯</td> </tr> <tr> <td>多语言支持（无提示输入/输出）</td> <td>🟢</td> <td>◯</td> </tr> <tr> <td>单一、统一的提取管道（声明式、可重用、完全可序列化）</td> <td>🟢</td> <td>🟡</td> </tr> <tr> <td>分组法学硕士课程，包含特定角色的任务</td> <td>🟢</td> <td>🟡</td> </tr> <tr> <td>嵌套上下文提取</td> <td>🟢</td> <td>🟡</td> </tr> <tr> <td>统一的、完全可序列化的结果存储模型（文档）</td> <td>🟢</td> <td>🟡</td> </tr> <tr> <td>提取任务校准示例</td> <td>🟢</td> <td>🟡</td> </tr> <tr> <td>内置并发 I/O 处理</td> <td>🟢</td> <td>🟡</td> </tr> <tr> <td>自动使用和成本跟踪</td> <td>🟢</td> <td>🟡</td> </tr> <tr> <td>回退和重试逻辑</td> <td>🟢</td> <td>🟢</td> </tr> <tr> <td>多家 LLM 提供商</td> <td>🟢</td> <td>🟢</td> </tr> </tbody> </table> <p dir="auto">🟢 - 完全支持 - 无需额外设置<br>🟡 - 部分支持 - 需要额外设置<br>◯ - 不支持 - 需要自定义逻辑</p> <p dir="auto">* 查看ContextGem 抽象的<a href="https://contextgem.dev/motivation.html#the-contextgem-solution" rel="nofollow">描述</a>以及使用 ContextGem 和其他流行的开源 LLM 框架的具体实现示例的<a href="https://contextgem.dev/vs_other_frameworks.html" rel="nofollow">比较。</a></p> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">💡 使用<strong>最少的代码</strong>，您可以：</h2> <a id="user-content--with-minimal-code-you-can" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-with-minimal-code-you-can" aria-label="永久链接：💡 使用最少的代码，您可以："></a></div> <ul dir="auto"> <li>从文档（文本、图像）中<strong>提取结构化数据</strong></li> <li><strong>识别并分析</strong>文档中的关键方面（主题、主题、类别）</li> <li><strong>从文档中提取特定概念</strong>（实体、事实、结论、评估）</li> <li>通过简单、直观的 API<strong>构建复杂的提取工作流程</strong></li> <li><strong>创建多级提取管道</strong>（包含概念的方面、分层方面）</li> </ul> <p>&nbsp;</p> <p dir="auto"><a href="https://camo.githubusercontent.com/84c9fdd0aa6c0023582ec31ee75d304e1fc63abc15882a8092514ad4190ea616/68747470733a2f2f636f6e7465787467656d2e6465762f5f7374617469632f726561646d655f636f64655f736e69707065742e706e67" target="_blank" rel="noopener noreferrer nofollow"><img title="ContextGem 提取示例" src="https://camo.githubusercontent.com/84c9fdd0aa6c0023582ec31ee75d304e1fc63abc15882a8092514ad4190ea616/68747470733a2f2f636f6e7465787467656d2e6465762f5f7374617469632f726561646d655f636f64655f736e69707065742e706e67" alt="ContextGem 提取示例" data-canonical-src="https://contextgem.dev/_static/readme_code_snippet.png"></a></p> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">📦安装</h2> <a id="user-content--installation" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-installation" aria-label="固定链接：📦安装"></a></div> <div class="highlight highlight-source-shell notranslate position-relative overflow-auto" dir="auto"> <pre>pip install -U contextgem</pre> <div class="zeroclipboard-container">&nbsp;</div> </div> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">🚀 快速入门</h2> <a id="user-content--quick-start" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-quick-start" aria-label="永久链接：🚀 快速入门"></a></div> <div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto"> <pre><span class="pl-c"># Quick Start Example - Extracting anomalies from a document, with source references and justifications</span> <span class="pl-k">import</span> <span class="pl-s1">os</span> <span class="pl-k">from</span> <span class="pl-s1">contextgem</span> <span class="pl-k">import</span> <span class="pl-v">Document</span>, <span class="pl-v">DocumentLLM</span>, <span class="pl-v">StringConcept</span> <span class="pl-c"># Sample document text (shortened for brevity)</span> <span class="pl-s1">doc</span> <span class="pl-c1">=</span> <span class="pl-en">Document</span>( <span class="pl-s1">raw_text</span><span class="pl-c1">=</span>( <span class="pl-s">"Consultancy Agreement<span class="pl-cce">\n</span>"</span> <span class="pl-s">"This agreement between Company A (Supplier) and Company B (Customer)...<span class="pl-cce">\n</span>"</span> <span class="pl-s">"The term of the agreement is 1 year from the Effective Date...<span class="pl-cce">\n</span>"</span> <span class="pl-s">"The Supplier shall provide consultancy services as described in Annex 2...<span class="pl-cce">\n</span>"</span> <span class="pl-s">"The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...<span class="pl-cce">\n</span>"</span> <span class="pl-s">"The purple elephant danced gracefully on the moon while eating ice cream.<span class="pl-cce">\n</span>"</span> <span class="pl-c"># 💎 anomaly</span> <span class="pl-s">"This agreement is governed by the laws of Norway...<span class="pl-cce">\n</span>"</span> ), ) <span class="pl-c"># Attach a document-level concept</span> <span class="pl-s1">doc</span>.<span class="pl-c1">concepts</span> <span class="pl-c1">=</span> [ <span class="pl-en">StringConcept</span>( <span class="pl-s1">name</span><span class="pl-c1">=</span><span class="pl-s">"Anomalies"</span>, <span class="pl-c"># in longer contexts, this concept is hard to capture with RAG</span> <span class="pl-s1">description</span><span class="pl-c1">=</span><span class="pl-s">"Anomalies in the document"</span>, <span class="pl-s1">add_references</span><span class="pl-c1">=</span><span class="pl-c1">True</span>, <span class="pl-s1">reference_depth</span><span class="pl-c1">=</span><span class="pl-s">"sentences"</span>, <span class="pl-s1">add_justifications</span><span class="pl-c1">=</span><span class="pl-c1">True</span>, <span class="pl-s1">justification_depth</span><span class="pl-c1">=</span><span class="pl-s">"brief"</span>, <span class="pl-c"># see the docs for more configuration options</span> ) <span class="pl-c"># add more concepts to the document, if needed</span> <span class="pl-c"># see the docs for available concepts: StringConcept, JsonObjectConcept, etc.</span> ] <span class="pl-c"># Or use `doc.add_concepts([...])`</span> <span class="pl-c"># Define an LLM for extracting information from the document</span> <span class="pl-s1">llm</span> <span class="pl-c1">=</span> <span class="pl-en">DocumentLLM</span>( <span class="pl-s1">model</span><span class="pl-c1">=</span><span class="pl-s">"openai/gpt-4o-mini"</span>, <span class="pl-c"># or another provider/LLM</span> <span class="pl-s1">api_key</span><span class="pl-c1">=</span><span class="pl-s1">os</span>.<span class="pl-c1">environ</span>.<span class="pl-c1">get</span>( <span class="pl-s">"CONTEXTGEM_OPENAI_API_KEY"</span> ), <span class="pl-c"># your API key for the LLM provider</span> <span class="pl-c"># see the docs for more configuration options</span> ) <span class="pl-c"># Extract information from the document</span> <span class="pl-s1">doc</span> <span class="pl-c1">=</span> <span class="pl-s1">llm</span>.<span class="pl-c1">extract_all</span>(<span class="pl-s1">doc</span>) <span class="pl-c"># or use async version `await llm.extract_all_async(doc)`</span> <span class="pl-c"># Access extracted information in the document object</span> <span class="pl-en">print</span>( <span class="pl-s1">doc</span>.<span class="pl-c1">concepts</span>[<span class="pl-c1">0</span>].<span class="pl-c1">extracted_items</span> ) <span class="pl-c"># extracted items with references &amp; justifications</span> <span class="pl-c"># or `doc.get_concept_by_name("Anomalies").extracted_items`</span></pre> <div class="zeroclipboard-container">&nbsp;</div> </div> <p dir="auto"><a href="https://colab.research.google.com/github/shcherbak-ai/contextgem/blob/main/dev/notebooks/readme/quickstart_concept.ipynb" rel="nofollow"><img src="https://camo.githubusercontent.com/96889048f8a9014fdeba2a891f97150c6aac6e723f5190236b10215a97ed41f3/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667" alt="在 Colab 中打开" data-canonical-src="https://colab.research.google.com/assets/colab-badge.svg"></a></p> <hr> <p dir="auto">请参阅文档中的更多示例：</p> <div class="markdown-heading" dir="auto"> <h3 class="heading-element" dir="auto" tabindex="-1">基本使用示例</h3> <a id="user-content-basic-usage-examples" class="anchor" href="https://github.com/shcherbak-ai/contextgem#basic-usage-examples" aria-label="永久链接：基本用法示例"></a></div> <ul dir="auto"> <li><a href="https://contextgem.dev/quickstart.html#aspect-extraction-from-document" rel="nofollow">从文档中提取方面</a></li> <li><a href="https://contextgem.dev/quickstart.html#extracting-aspect-with-sub-aspects" rel="nofollow">使用子方面提取方面</a></li> <li><a href="https://contextgem.dev/quickstart.html#concept-extraction-from-aspect" rel="nofollow">从方面提取概念</a></li> <li><a href="https://contextgem.dev/quickstart.html#concept-extraction-from-document-text" rel="nofollow">从文档（文本）中提取概念</a></li> <li><a href="https://contextgem.dev/quickstart.html#concept-extraction-from-document-vision" rel="nofollow">从文档中提取概念（视觉）</a></li> <li><a href="https://contextgem.dev/quickstart.html#lightweight-llm-chat-interface" rel="nofollow">LLM聊天界面</a></li> </ul> <div class="markdown-heading" dir="auto"> <h3 class="heading-element" dir="auto" tabindex="-1">高级用法示例</h3> <a id="user-content-advanced-usage-examples" class="anchor" href="https://github.com/shcherbak-ai/contextgem#advanced-usage-examples" aria-label="永久链接：高级用法示例"></a></div> <ul dir="auto"> <li><a href="https://contextgem.dev/advanced_usage.html#extracting-aspects-with-concepts" rel="nofollow">提取包含概念的方面</a></li> <li><a href="https://contextgem.dev/advanced_usage.html#extracting-aspects-and-concepts-from-a-document" rel="nofollow">从文档中提取方面和概念</a></li> <li><a href="https://contextgem.dev/advanced_usage.html#using-a-multi-llm-pipeline-to-extract-data-from-several-documents" rel="nofollow">使用多 LLM 管道从多个文档中提取数据</a></li> </ul> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">🔄 文档转换器</h2> <a id="user-content--document-converters" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-document-converters" aria-label="永久链接：🔄 文档转换器"></a></div> <p dir="auto">要创建用于 LLM 分析的 ContextGem 文档，您可以直接传递原始文本，也可以使用处理各种文件格式的内置转换器。</p> <div class="markdown-heading" dir="auto"> <h3 class="heading-element" dir="auto" tabindex="-1">📄 DOCX 转换器</h3> <a id="user-content--docx-converter" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-docx-converter" aria-label="永久链接：📄 DOCX 转换器"></a></div> <p dir="auto">ContextGem 提供内置转换器，可轻松将 DOCX 文件转换为 LLM 就绪数据。</p> <ul dir="auto"> <li>提取其他开源工具通常无法捕获的信息：未对齐的表格、注释、脚注、文本框、页眉/页脚和嵌入图像</li> <li>保留具有丰富元数据的文档结构，以改进 LLM 分析</li> </ul> <div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto"> <pre><span class="pl-c"># Using ContextGem's DocxConverter</span> <span class="pl-k">from</span> <span class="pl-s1">contextgem</span> <span class="pl-k">import</span> <span class="pl-v">DocxConverter</span> <span class="pl-s1">converter</span> <span class="pl-c1">=</span> <span class="pl-en">DocxConverter</span>() <span class="pl-c"># Convert a DOCX file to an LLM-ready ContextGem Document</span> <span class="pl-c"># from path</span> <span class="pl-s1">document</span> <span class="pl-c1">=</span> <span class="pl-s1">converter</span>.<span class="pl-c1">convert</span>(<span class="pl-s">"path/to/document.docx"</span>) <span class="pl-c"># or from file object</span> <span class="pl-k">with</span> <span class="pl-en">open</span>(<span class="pl-s">"path/to/document.docx"</span>, <span class="pl-s">"rb"</span>) <span class="pl-k">as</span> <span class="pl-s1">docx_file_object</span>: <span class="pl-s1">document</span> <span class="pl-c1">=</span> <span class="pl-s1">converter</span>.<span class="pl-c1">convert</span>(<span class="pl-s1">docx_file_object</span>) <span class="pl-c"># You can also use it as a standalone text extractor</span> <span class="pl-s1">docx_text</span> <span class="pl-c1">=</span> <span class="pl-s1">converter</span>.<span class="pl-c1">convert_to_text_format</span>( <span class="pl-s">"path/to/document.docx"</span>, <span class="pl-s1">output_format</span><span class="pl-c1">=</span><span class="pl-s">"markdown"</span>, <span class="pl-c"># or "raw"</span> )</pre> <div class="zeroclipboard-container">&nbsp;</div> </div> <p dir="auto">在文档中了解有关<a href="https://contextgem.dev/converters/docx.html" rel="nofollow">DOCX 转换器功能的更多信息。</a></p> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">🎯 重点文档分析</h2> <a id="user-content--focused-document-analysis" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-focused-document-analysis" aria-label="永久链接：🎯 重点文档分析"></a></div> <p dir="auto">ContextGem 利用 LLM 的长上下文窗口，从单个文档中提取出卓越的准确率。与 RAG 方法（通常<a href="https://www.linkedin.com/pulse/raging-contracts-pitfalls-rag-contract-review-shcherbak-ai-ptg3f" rel="nofollow">难以处理复杂概念和细微洞察）</a>不同，ContextGem 充分利用了<a href="https://arxiv.org/abs/2502.12962" rel="nofollow">持续扩展的上下文容量</a>、不断改进的 LLM 功能以及降低的成本。这种专注的方法能够直接从完整文档中提取信息，消除检索不一致，同时针对深入的单文档分析进行优化。虽然这可以提高单个文档的准确率，但 ContextGem 目前不支持跨文档查询或全语料库检索&mdash;&mdash;对于这些用例，现代 RAG 系统（例如 LlamaIndex、Haystack）仍然更为合适。</p> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">🤖 支持</h2> <a id="user-content--supported-llms" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-supported-llms" aria-label="永久链接：🤖 支持的法学硕士"></a></div> <p dir="auto"><a href="https://github.com/BerriAI/litellm">ContextGem 通过LiteLLM</a>集成支持基于云和本地的 LLM ：</p> <ul dir="auto"> <li><strong>云端法学硕士</strong>：OpenAI、Anthropic、Google、Azure OpenAI 等</li> <li><strong>本地 LLM</strong>：使用 Ollama、LM Studio 等提供商在本地运行模型。</li> <li><strong>模型架构</strong>：适用于推理/CoT 功能（例如 o4-mini）和非推理模型（例如 gpt-4.1）</li> <li><strong>简单的 API</strong>：所有 LLM 的统一接口，可轻松切换提供商</li> </ul> <p dir="auto">在文档中了解<a href="https://contextgem.dev/llms/supported_llms.html" rel="nofollow">有关支持的 LLM 提供程序和模型</a>以及如何<a href="https://contextgem.dev/llms/llm_config.html" rel="nofollow">配置 LLM 的更多信息。</a></p> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">⚡ 优化</h2> <a id="user-content--optimizations" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-optimizations" aria-label="永久链接：⚡ 优化"></a></div> <p dir="auto">ContextGem 文档提供了有关优化策略的指导，以最大限度地提高性能、最大限度地降低成本并提高提取准确性：</p> <ul dir="auto"> <li><a href="https://contextgem.dev/optimizations/optimization_accuracy.html" rel="nofollow">优化准确性</a></li> <li><a href="https://contextgem.dev/optimizations/optimization_speed.html" rel="nofollow">优化速度</a></li> <li><a href="https://contextgem.dev/optimizations/optimization_cost.html" rel="nofollow">优化成本</a></li> <li><a href="https://contextgem.dev/optimizations/optimization_long_docs.html" rel="nofollow">处理长文档</a></li> <li><a href="https://contextgem.dev/optimizations/optimization_choosing_llm.html" rel="nofollow">选择合适的法学硕士</a></li> </ul> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">💾 序列化结果</h2> <a id="user-content--serializing-results" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-serializing-results" aria-label="永久链接：💾 序列化结果"></a></div> <p dir="auto">ContextGem 允许您使用内置序列化方法保存和加载 Document 对象、管道和 LLM 配置：</p> <ul dir="auto"> <li>保存已处理的文档以避免重复昂贵的 LLM 调用</li> <li>在系统之间传输提取结果</li> <li>保留管道和 LLM 配置以供以后重用</li> </ul> <p dir="auto">在文档中了解有关<a href="https://contextgem.dev/serialization.html" rel="nofollow">序列化选项的更多信息。</a></p> <div class="markdown-heading" dir="auto"> <h2 class="heading-element" dir="auto" tabindex="-1">📚 文档</h2> <a id="user-content--documentation" class="anchor" href="https://github.com/shcherbak-ai/contextgem#-documentation" aria-label="永久链接：📚 文档"></a></div> <p dir="auto">完整文档可在<a href="https://contextgem.dev/" rel="nofollow">contextgem.dev</a>上找到。</p> <p dir="auto">完整文档的原始文本版本可在处获取<a href="https://github.com/shcherbak-ai/contextgem/blob/main/docs/docs-raw-for-llm.txt"><code>docs/docs-raw-for-llm.txt</code></a>。此文件自动生成，包含所有文档，其格式已针对 LLM 导入进行了优化（例如，用于问答）。</p>

暴躁的教授读论文（mad-professor）

"暴躁教授读论文"是一个学术论文阅读伴侣应用程序，旨在通过富有个性的AI助手提高论文阅读效率。它集成了PDF处理、AI翻译、RAG检索、AI问答和语音交互等多种功能，为学术研究者提供一站式的论文阅读解决方案。主要特性论文自动处理：导入PDF后自动提取、翻译和结构化论文内容双语显示：支持中英文对照阅读论文 AI智能问答：与论文内容结合，提供专业的解释和分析个性化AI教授：AI以"暴躁教授"的个性回答问题，增加趣味性语音交互：支持语音提问和TTS语音回答 RAG增强检索：基于论文内容的精准检索和定位分屏界面：左侧论文内容，右侧AI问答，高效交互技术架构前端界面：PyQt6构建的现代化桌面应用核心引擎： AI问答模块：基于LLM的学术问答系统 RAG检索系统：向量检索增强的问答精准度论文处理管线：PDF转MD、自动翻译、结构化解析交互系统：语音识别：实时语音输入识别 TTS语音合成：AI回答实时播报情感识别：根据问题内容调整回答情绪安装指南环境要求 Python 3.10或更高版本 CUDA支持 6GB 以上显存

edrawmax.com

Online diagram maker for professional visuals

分类导航

Pemo的主要功能