diff --git a/examples/parsing_instructions/expense_report_document.pdf b/examples/parsing_instructions/expense_report_document.pdf new file mode 100644 index 0000000..eec74d9 Binary files /dev/null and b/examples/parsing_instructions/expense_report_document.pdf differ diff --git a/examples/parsing_instructions/expense_report_document.png b/examples/parsing_instructions/expense_report_document.png new file mode 100644 index 0000000..454d08f Binary files /dev/null and b/examples/parsing_instructions/expense_report_document.png differ diff --git a/examples/parsing_instructions/export_report_document.md b/examples/parsing_instructions/export_report_document.md deleted file mode 100644 index f359d9d..0000000 --- a/examples/parsing_instructions/export_report_document.md +++ /dev/null @@ -1,146 +0,0 @@ -QUANTUM DYNAMICS CORPORATION -EMPLOYEE EXPENSE REPORT -FISCAL YEAR 2024 - -EMPLOYEE INFORMATION: -Name: Dr. Alexandra Chen-Martinez, PhD -Employee ID: QD-2022-1457 -Department: Advanced Research & Development -Cost Center: CC-ARD-NA-003 -Project Codes: QD-QUANTUM-2024-01, QD-AI-2024-03 -Position: Principal Research Scientist -Reporting Manager: Dr. James Thompson - -TRIP/EXPENSE PERIOD: -Start Date: November 15, 2024 -End Date: December 10, 2024 -Purpose: International Conference Attendance & Client Meetings -Locations: Tokyo, Japan → Singapore → Sydney, Australia - -CURRENCY CONVERSION RATES APPLIED: -JPY (¥) → USD: 0.0068 (as of 11/15/2024) -SGD (S$) → USD: 0.74 (as of 11/28/2024) -AUD (A$) → USD: 0.65 (as of 12/03/2024) - -## ITEMIZED EXPENSES: - -## Date | Category | Description | Original | Currency | USD - -11/15/2024 | Transportation | JFK → NRT Business Class | 4,250.00 | USD | 4,250.00 -| Booking Ref: QF78956 - Corporate Rate Applied -| Project Code: QD-QUANTUM-2024-01 - ---- - -11/16/2024 | Accommodation | Hilton Tokyo - 5 nights | 225,000 | JPY | 1,530.00 -| Confirmation: HTK-2024-78956 -| Room Type: Business Executive -| Includes breakfast & WiFi - ---- - -11/17/2024 | Meals | Client Dinner - Sushi Zen | 45,600 | JPY | 310.08 -| Attendees: 4 (See attached list) -| Project Code: QD-AI-2024-03 - ---- - -11/18/2024 | Conference | Quantum Computing Summit | 2,500.00 | USD | 2,500.00 -to | Registration | Early Bird Rate -11/20/2024 | Receipt #: QCS-2024-1234 -| Includes workshop materials - ---- - -11/21/2024 | Transportation | NRT → SIN Economy Premium | 875.00 | USD | 875.00 -| Booking Ref: SQ45678 - ---- - -11/22/2024 | Accommodation | Marina Bay Sands - 4 nights | 1,780 | SGD | 1,317.20 -to | Confirmation: MBS-11224 -11/25/2024 | Room Type: Club Room -| Includes airport transfer - ---- - -11/23/2024 | Client Meeting | Meeting Room Rental | 450 | SGD | 333.00 -| Venue: Business Center -| Duration: Full Day -| Project: QD-AI-2024-03 - ---- - -11/26/2024 | Transportation | SIN → SYD Business Class | 2,150.00 | USD | 2,150.00 -| Booking Ref: QF98765 - ---- - -11/27/2024 | Accommodation | Shangri-La Sydney - 6 nights | 2,880 | AUD | 1,872.00 -to | Confirmation: SLS-45678 -12/02/2024 | Harbor View Suite - ---- - -11/29/2024 | Entertainment | Client Event - Opera House | 1,200 | AUD | 780.00 -| 5 attendees (See attached list) -| Project: QD-QUANTUM-2024-01 - ---- - -12/01/2024 | Transportation | Local Transportation | 245 | AUD | 159.25 -| Various Uber/Taxi receipts -| (Itemized in attachment A) - ---- - -12/03/2024 | Transportation | SYD → JFK Business Class | 3,875.00 | USD | 3,875.00 -| Booking Ref: QF11223 - ---- - -EXPENSE SUMMARY BY CATEGORY: -Transportation: $11,309.25 -Accommodation: $4,719.20 -Meals: $310.08 -Conference: $2,500.00 -Client Meetings: $333.00 -Entertainment: $780.00 - ---- - -Subtotal: $19,951.53 -VAT Recoverable: -$1,245.80 - ---- - -Total: $18,705.73 - -COST ALLOCATION: -Project QD-QUANTUM-2024-01: 65% ($12,158.72) -Project QD-AI-2024-03: 35% ($6,547.01) - -APPROVAL STATUS: -Submitted by: Dr. Alexandra Chen-Martinez Date: 12/11/2024 -Line Manager Approval: Dr. James Thompson Date: 12/11/2024 -Finance Review: Pending -Payment Status: Pending - -ATTACHMENTS: - -1. Original Receipts (24 pages) -2. Client Meeting Attendee Lists -3. Conference Participation Certificate -4. Local Transportation Details (Attachment A) -5. Corporate Credit Card Statements - -COMPLIANCE NOTES: - -- All expenses comply with QDC-Travel-Policy-2024 -- Business class approved for flights >6 hours -- Entertainment expenses pre-approved by Regional Director -- Per diem rates not claimed (actual expenses submitted) - -Internal Use Only -Document ID: EXP-2024-Q4-1457-089 -Generated by: Quantum Dynamics Expense Management System v3.5 diff --git a/examples/parsing_instructions/parsing_instructions.ipynb b/examples/parsing_instructions/parsing_instructions.ipynb index f979523..96f8ff7 100644 --- a/examples/parsing_instructions/parsing_instructions.ipynb +++ b/examples/parsing_instructions/parsing_instructions.ipynb @@ -12,13 +12,24 @@ "\n", "These instructions can be useful for improving the parser's performance on complex document layouts, extracting data in a specific format, or transforming the document in other ways.\n", "\n", + "### Why This Matters:\n", + "Traditional document parsing can be rigid and error-prone, often missing crucial context and nuances in complex layouts. Our instruction-based parsing allows you to:\n", + "\n", + "1. Extract specific information with pinpoint accuracy\n", + "2. Handle complex document layouts with ease\n", + "3. Transform unstructured data into structured formats effortlessly\n", + "4. Save hours of manual data entry and verification\n", + "5. Reduce errors in document processing workflows\n", + "\n", "In this demonstration, we showcase how parsing instructions can be used to extract specific information from unstructured documents. Below are the documents we use for testing:\n", "\n", "1. McDonald's Receipt - Extracting the price of each order and the final amount to be paid.\n", "\n", "2. Expense Report Document - Extracting employee name, employee ID, position, department, date ranges, individual expense items with dates, categories, and amounts.\n", "\n", - "3. Purchase Order Document - Identifying the PO number, vendor details, shipping terms, and an itemized list of products with quantities and unit prices.\n" + "3. Purchase Order Document - Identifying the PO number, vendor details, shipping terms, and an itemized list of products with quantities and unit prices.\n", + "\n", + "Let's jump into these real-world examples and see how parsing instructions can help us extract specific information." ] }, { @@ -68,6 +79,13 @@ "Here we extract the price of each order and the final amount to be paid." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Alt" + ] + }, { "cell_type": "code", "execution_count": null, @@ -77,7 +95,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Started parsing the file under job_id 0ffbb3d4-5148-47e7-a6a0-6ea0c47be0df\n" + "Started parsing the file under job_id 66643b81-e2f4-408b-890b-8e116472210b\n" ] } ], @@ -155,7 +173,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Started parsing the file under job_id af9c7ef4-e842-47f2-9a22-e99b959e8028\n" + "Started parsing the file under job_id 1a04fdbb-5415-4a36-a1bd-26bfb5d618fa\n" ] } ], @@ -202,6 +220,13 @@ "Here we extract employee name, employee ID, position, department, date ranges, individual expense items with dates, categories, and amounts." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Alt" + ] + }, { "cell_type": "code", "execution_count": null, @@ -211,14 +236,13 @@ "name": "stdout", "output_type": "stream", "text": [ - "Started parsing the file under job_id 63ff0728-bb93-421d-a093-6560050e6c22\n", - "..." + "Started parsing the file under job_id b6bcc6e1-7d30-4522-9abd-ace196781a70\n" ] } ], "source": [ "vanilaParsing = LlamaParse(result_type=\"markdown\").load_data(\n", - " \"./export_report_document.md\"\n", + " \"./expense_report_document.pdf\"\n", ")" ] }, @@ -273,14 +297,13 @@ "\n", "# ITEMIZED EXPENSES:\n", "\n", - "|Date|Category|Description|Original Currency|USD|\n", - "|---|---|---|---|---|\n", - "|11/15/2024|Transportation|JFK → NRT Business Class Booking Ref: QF78956 - Corporate Rate Applied Project Code: QD-QUANTUM-2024-01|4,250.00 USD|4,250.00|\n", - "|11/16/2024|Accommodation|Hilton Tokyo - 5 nights Confirmation: HTK-2024-78956 Room Type: Business Executive Includes breakfast & WiFi|225,000 JPY|1,530.00|\n", - "|11/17/2024|Meals|Client Dinner - Sushi Zen Attendees: 4 (See attached list) Project Code: QD-AI-2024-03|45,600 JPY|310.08|\n", - "|11/18/2024|Conference|Quantum Computing Summit Registration Early Bird Rate Receipt #: QCS-2024-1234 Includes workshop materials|2,500.00 USD|2,500.00|\n", - "|11/21/2024|Transportation|NRT → SIN Economy Premium Booking Ref: SQ45678|875.00 USD|875.00|\n", - "|11/22/2024|Accommodation|Marina Bay Sands - 4 nights|1,780 SGD|1,317.20|\n" + "|Date|Category|Description|Original|Currency|USD|\n", + "|---|---|---|---|---|---|\n", + "|11/15/2024|Transportation|JFK → NRT Business Class|4,250.00|USD|4,250.00|\n", + "|Booking Ref: QF78956 - Corporate Rate Applied|Booking Ref: QF78956 - Corporate Rate Applied|Booking Ref: QF78956 - Corporate Rate Applied|Booking Ref: QF78956 - Corporate Rate Applied|Booking Ref: QF78956 - Corporate Rate Applied|Booking Ref: QF78956 - Corporate Rate Applied|\n", + "|Project Code: QD-QUANTUM-2024-01|Project Code: QD-QUANTUM-2024-01|Project Code: QD-QUANTUM-2024-01|Project Code: QD-QUANTUM-2024-01|Project Code: QD-QUANTUM-2024-01|Project Code: QD-QUANTUM-2024-01|\n", + "|11/16/2024|Accommodation|Hilton Tokyo - 5 nights|225,000|JPY|1,530.00|\n", + "|Confirmation: HTK-2024-78956|Confirmation: HTK-2024-78956|Confirmation: HTK-2024-78956|Confirmation: HTK-2024-78956|Confirmation: HTK-2024-78956|Confirmation: HTK-2024-78956|\n" ] } ], @@ -297,7 +320,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Started parsing the file under job_id 8c953105-3684-4239-bd27-946ddb3a1943\n" + "Started parsing the file under job_id 7b0d05bb-947b-4475-8d0f-f10386f7446e\n" ] } ], @@ -307,7 +330,7 @@ "\n", "withInstructionParsing = LlamaParse(\n", " result_type=\"markdown\", parsing_instruction=parsingInstruction\n", - ").load_data(\"./export_report_document.md\")" + ").load_data(\"./expense_report_document.pdf\")" ] }, { @@ -329,50 +352,30 @@ "- **Start Date:** November 15, 2024\n", "- **End Date:** December 10, 2024\n", "\n", - "**Itemized Expenses:**\n", - "\n", + "**Expense Items:**\n", "1. **Date:** 11/15/2024\n", "- **Category:** Transportation\n", "- **Description:** JFK → NRT Business Class\n", - "- **Original Currency:** USD\n", - "- **Amount:** $4,250.00\n", + "- **Original Amount:** $4,250.00\n", + "- **Currency:** USD\n", + "- **USD Amount:** $4,250.00\n", + "- **Booking Reference:** QF78956 - Corporate Rate Applied\n", + "- **Project Code:** QD-QUANTUM-2024-01\n", "\n", "2. **Date:** 11/16/2024\n", "- **Category:** Accommodation\n", "- **Description:** Hilton Tokyo - 5 nights\n", - "- **Original Currency:** JPY\n", - "- **Amount:** ¥225,000 (Converted Amount: $1,530.00)\n", - "\n", - "3. **Date:** 11/17/2024\n", - "- **Category:** Meals\n", - "- **Description:** Client Dinner - Sushi Zen\n", - "- **Original Currency:** JPY\n", - "- **Amount:** ¥45,600 (Converted Amount: $310.08)\n", - "\n", - "4. **Date:** 11/18/2024 to 11/20/2024\n", - "- **Category:** Conference Registration\n", - "- **Description:** Quantum Computing Summit\n", - "- **Original Currency:** USD\n", - "- **Amount:** $2,500.00\n", - "\n", - "5. **Date:** 11/21/2024\n", - "- **Category:** Transportation\n", - "- **Description:** NRT → SIN Economy Premium\n", - "- **Original Currency:** USD\n", - "- **Amount:** $875.00\n", - "\n", - "6. **Date:** 11/22/2024\n", - "- **Category:** Accommodation\n", - "- **Description:** Marina Bay Sands - 4 nights\n", - "- **Original Currency:** SGD\n", - "- **Amount:** S$1,780 (Converted Amount: $1,317.20)\n", + "- **Original Amount:** ¥225,000\n", + "- **Currency:** JPY\n", + "- **USD Amount:** $1,530.00\n", + "- **Confirmation:** HTK-2024-78956\n", "\n", "**Locations:**\n", "- Tokyo, Japan\n", "- Singapore\n", "- Sydney, Australia\n", "\n", - "**Currency Conversion Rates:**\n", + "**Currency Conversion Rates Applied:**\n", "- JPY (¥) → USD: 0.0068 (as of 11/15/2024)\n", "- SGD (S$) → USD: 0.74 (as of 11/28/2024)\n", "- AUD (A$) → USD: 0.65 (as of 12/03/2024)\n" @@ -392,6 +395,13 @@ "Here we identify the PO number, vendor details, shipping terms, and an itemized list of products with quantities and unit prices." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Alt" + ] + }, { "cell_type": "code", "execution_count": null, @@ -401,13 +411,13 @@ "name": "stdout", "output_type": "stream", "text": [ - "Started parsing the file under job_id e7e389d1-bdaa-4d12-8283-b736e21ffe6b\n" + "Started parsing the file under job_id b8cb11c3-7dce-4e6a-94bb-1a4e50e45e55\n" ] } ], "source": [ "vanilaParsing = LlamaParse(result_type=\"markdown\").load_data(\n", - " \"./purchase_order_document.md\"\n", + " \"./purchase_order_document.pdf\"\n", ")" ] }, @@ -486,10 +496,7 @@ "\n", "|Line|Part Number|Description|Qty|UOM|Unit Price|Total|\n", "|---|---|---|---|---|---|---|\n", - "|1|QE-MCU-5590|Microcontroller Unit
Rev. B
32-bit, 120MHz, LQFP-144
*Temp Range: -40°C to +85°C
Lot tracking required|500|EA|$12.50|$6,250.00|\n", - "|2|QE-SENS-789|Temperature Sensor Module
v2.1
Digital Output, I2C Interface
Calibration certificates required|750|EA|$8.75|$6,562.50|\n", - "|3|QE-CAP-1123|Ceramic Capacitors
Batch: X7R
0.1μF ±5%, 50V
Each box contains 100 units|2000|BOX|$5.25|$10,500.00|\n", - "|4|QE-PCB-4567|Printed Circuit Board
Ver: 3.0
8-layer, ENIG finish
IPC Class 3 Required|300|EA|$45.00|$13,500.00|\n" + "|1|QE-MCU-5590|Microcontroller Unit|500|EA|$12.50|$6,250.00|\n" ] } ], @@ -506,7 +513,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Started parsing the file under job_id c336a4aa-1711-4648-8612-f82e63bf4127\n" + "Started parsing the file under job_id d2731305-984d-4633-8a52-0493748cf10b\n" ] } ], @@ -516,7 +523,7 @@ "\n", "withInstructionParsing = LlamaParse(\n", " result_type=\"markdown\", parsing_instruction=parsingInstruction\n", - ").load_data(\"./purchase_order_document.md\")" + ").load_data(\"./purchase_order_document.pdf\")" ] }, { @@ -528,56 +535,35 @@ "name": "stdout", "output_type": "stream", "text": [ - "**Purchase Order Details:**\n", + "Here are the details extracted from the purchase order:\n", + "\n", + "**PO Number:** PO-2024-GT-9876/REV.2\n", "\n", - "- **PO Number:** PO-2024-GT-9876/REV.2\n", - "- **Vendor Details:**\n", + "**Vendor Details:**\n", "- **Vendor Name:** Quantum Electronics Manufacturing\n", "- **DUNS:** 78-456-7890\n", "- **Tax ID:** EU8976543210\n", "- **Address:** Hoofdorp, Netherlands\n", - "- **Vendor #:** QEM-EU-2024-001\n", + "- **Vendor Number:** QEM-EU-2024-001\n", + "- **Contact Person:** Sarah Martinez, Receiving Manager\n", + "- **Phone:** +1 (512) 555-0123\n", "\n", - "- **Shipping Terms:**\n", + "**Shipping Terms:**\n", "- **Terms:** DDP (Delivered Duty Paid) - Incoterms 2020\n", "- **Insurance Required:** Yes\n", "- **Preferred Carrier:** DHL/FedEx\n", "- **Required Delivery Date:** 01/15/2025\n", "\n", - "- **Itemized List of Products:**\n", - "1. **Item 1:**\n", - "- **Part Number:** QE-MCU-5590\n", - "- **Description:** Microcontroller Unit, Rev. B, 32-bit, 120MHz, LQFP-144\n", + "**Itemized List of Products:**\n", + "1. **Part Number:** QE-MCU-5590\n", + "- **Description:** Microcontroller Unit\n", "- **Quantity:** 500 EA\n", "- **Unit Price:** $12.50\n", - "- **Total Price:** $6,250.00\n", - "- **Notes:** Temp Range: -40°C to +85°C, Lot tracking required\n", - "\n", - "2. **Item 2:**\n", - "- **Part Number:** QE-SENS-789\n", - "- **Description:** Temperature Sensor Module, v2.1, Digital Output, I2C Interface\n", - "- **Quantity:** 750 EA\n", - "- **Unit Price:** $8.75\n", - "- **Total Price:** $6,562.50\n", - "- **Notes:** Calibration certificates required\n", - "\n", - "3. **Item 3:**\n", - "- **Part Number:** QE-CAP-1123\n", - "- **Description:** Ceramic Capacitors, Batch: X7R, 0.1μF ±5%, 50V\n", - "- **Quantity:** 2000 BOX\n", - "- **Unit Price:** $5.25\n", - "- **Total Price:** $10,500.00\n", - "- **Notes:** Each box contains 100 units\n", - "\n", - "4. **Item 4:**\n", - "- **Part Number:** QE-PCB-4567\n", - "- **Description:** Printed Circuit Board, Ver: 3.0, 8-layer, ENIG finish\n", - "- **Quantity:** 300 EA\n", - "- **Unit Price:** $45.00\n", - "- **Total Price:** $13,500.00\n", - "- **Notes:** IPC Class 3 Required\n", - "\n", - "**Payment Terms:** Net 45, 2% discount if paid within 15 days\n", + "- **Total:** $6,250.00\n", + "\n", + "**Payment Terms:**\n", + "- Net 45\n", + "- 2% discount if paid within 15 days\n", "\n", "**Special Instructions:**\n", "1. All shipments must include Certificate of Conformance\n", diff --git a/examples/parsing_instructions/purchase_order_document.md b/examples/parsing_instructions/purchase_order_document.md deleted file mode 100644 index e98e291..0000000 --- a/examples/parsing_instructions/purchase_order_document.md +++ /dev/null @@ -1,85 +0,0 @@ -GLOBAL TECH SOLUTIONS, INC. -PURCHASE ORDER -Document Reference: PO-2024-GT-9876/REV.2 -[Original: PO-2024-GT-9876] -Amendment Date: 12/10/2024 - -VENDOR INFORMATION: SHIP TO: -Quantum Electronics Manufacturing Global Tech Solutions, Inc. -DUNS: 78-456-7890 Building 7A, Innovation Park -Tax ID: EU8976543210 2100 Technology Drive -Hoofdorp, Netherlands Austin, TX 78701 -Vendor #: QEM-EU-2024-001 USA -Attn: Sarah Martinez, Receiving Manager -Tel: +1 (512) 555-0123 - -PAYMENT TERMS: SHIPPING TERMS: -Net 45 DDP (Delivered Duty Paid) - Incoterms 2020 -2% discount if paid within 15 days Insurance Required: Yes -Preferred Carrier: DHL/FedEx -Required Delivery Date: 01/15/2025 - -SPECIAL INSTRUCTIONS: - -1. All shipments must include Certificate of Conformance -2. ESD-sensitive items must be properly packaged -3. Temperature logging required for items marked with \* -4. Partial shipments accepted with prior approval -5. Quote PO number on all correspondence - ---- - -## ITEM DETAILS: - -## Line | Part Number | Description | Qty | UOM | Unit Price | Total - -1 | QE-MCU-5590 | Microcontroller Unit | 500 | EA | $12.50 | $6,250.00 -| Rev. B | 32-bit, 120MHz, LQFP-144 -| \*Temp Range: -40°C to +85°C -| Lot tracking required - ---- - -2 | QE-SENS-789 | Temperature Sensor Module | 750 | EA | $8.75 | $6,562.50 -| v2.1 | Digital Output, I2C Interface -| Calibration certificates required - ---- - -3 | QE-CAP-1123 | Ceramic Capacitors | 2000| BOX | $5.25 | $10,500.00 -| Batch: X7R | 0.1µF ±5%, 50V -| Each box contains 100 units - ---- - -4 | QE-PCB-4567 | Printed Circuit Board | 300 | EA | $45.00 | $13,500.00 -| Ver: 3.0 | 8-layer, ENIG finish -| IPC Class 3 Required - ---- - - Subtotal: $36,812.50 - Volume Discount: -$1,840.63 (5%) - Shipping Est.: $2,500.00 - Insurance: $775.00 - VAT (21%): $8,001.91 - ---------------------------------- - TOTAL: $46,248.78 - -QUALITY REQUIREMENTS: - -- All components must comply with RoHS 3 (EU 2015/863) -- REACH compliance required -- ISO 9001:2015 certification required -- Components must be manufactured within last 12 months -- Moisture Sensitivity Level (MSL) must be clearly marked - -APPROVAL CHAIN: -Requested by: Michael Chen, Senior Engineer Date: 12/05/2024 -Technical Review: Dr. James Wilson Date: 12/07/2024 -Quality Approval: Maria Garcia Date: 12/08/2024 -Financial Approval: Robert Johnson, CFO Date: 12/09/2024 - -Document Control: GT-PUR-2024-Q4-0456 -Security Level: Confidential - Level 2 -Generated by: SAP ERP v7.8 diff --git a/examples/parsing_instructions/purchase_order_document.pdf b/examples/parsing_instructions/purchase_order_document.pdf new file mode 100644 index 0000000..87db840 Binary files /dev/null and b/examples/parsing_instructions/purchase_order_document.pdf differ diff --git a/examples/parsing_instructions/purchase_order_document.png b/examples/parsing_instructions/purchase_order_document.png new file mode 100644 index 0000000..b2eb0d7 Binary files /dev/null and b/examples/parsing_instructions/purchase_order_document.png differ