{"id":24805,"date":"2025-07-02T15:39:00","date_gmt":"2025-07-02T19:39:00","guid":{"rendered":"https:\/\/enterprise-knowledge.com\/?p=24805"},"modified":"2025-11-17T17:19:56","modified_gmt":"2025-11-17T22:19:56","slug":"optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup","status":"publish","type":"post","link":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/","title":{"rendered":"Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup"},"content":{"rendered":"<p>&nbsp;<\/p>\r\n\r\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile is-vertically-aligned-top\" style=\"grid-template-columns: 15% auto;\">\r\n<figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"771\" height=\"771\" class=\"wp-image-11017 size-full\" src=\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png\" alt=\"\" srcset=\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png 771w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01-336x336.png 336w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01-140x140.png 140w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01-768x768.png 768w\" sizes=\"auto, (max-width: 771px) 100vw, 771px\" \/><\/figure>\r\n<div class=\"wp-block-media-text__content\">\r\n\r\n\r\n<h2 class=\"wp-block-heading\">The Challenge<\/h2>\r\n\r\n\r\n\r\n<p>Enterprise Knowledge (EK) recently worked with a Federally Funded Research and Development Center (FFRDC) that was having difficulty retrieving relevant content in a large volume of archival scientific papers. Researchers were burdened with excessive search times and the potential for knowledge loss when target documents could not be found at all. To learn more about the client\u2019s use case and EK\u2019s initial strategy, please see the first blog in the <strong>Optimizing Historical Knowledge Retrieval <\/strong>series: <a href=\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-standardizing-metadata-for-enhanced-research-access\/\">Standardizing Metadata for Enhanced Research Access.<\/a><\/p>\r\n<p>To make these research papers more discoverable, part of EK\u2019s solution was to add \u201cabout-ness\u201d tags to the document metadata through a classification process. Many of the files in this document management system (DMS) were lower quality PDF scans of older documents, such as typewritten papers and pre-digital technical reports that often included handwritten annotations. To begin classifying the content, the team first needed to transform the scanned PDFs into machine-readable text. EK utilized an Optical Character Recognition (OCR) tool, which can \u201cread\u201d non-text file formats for recognizable language and convert it into digital text. When processing the archival documents, even the most advanced OCR tools still introduced a significant amount of noise in the extracted text. This frequently manifested as:<\/p>\r\n<ul>\r\n<li>A table, figure, or handwriting in the document being read in as random symbols and white space.<\/li>\r\n<li>Inserting random punctuation where a spot or pen mark may have been on the file, breaking up words and sentences.<\/li>\r\n<li>Excessive or misplaced line breaks separating related content.<\/li>\r\n<li>Other miscellaneous irregularities in the text that make the document less comprehensible.<\/li>\r\n<\/ul>\r\n<p><span style=\"font-weight: 400;\">The first round of text extraction using out-of-the-box OCR capabilities resulted in many of the above issues across the output text files. This starter batch of text extracts was sent to the classification model to be tagged. The results were assessed by examining the classifier\u2019s evidence within the document for tagging (or failing to tag) a concept. Through this inspection, the team found that there was enough clutter or inconsistency within the text extracts that some irrelevant concepts were misapplied and other, applicable concepts were being missed entirely. It was clear from the negative impact on classification performance that document comprehension needed to be enhanced.<\/span><\/p>\r\n<table class=\" aligncenter\" style=\"width: 100%; border-collapse: collapse; border-style: solid; border-color: #000000;\">\r\n<tbody>\r\n<tr>\r\n<td style=\"width: 100%;\">\r\n<p><strong>Auto-Classification<\/strong> <br \/>Auto-Classification (also referred to as auto-tagging) is an advanced process that automatically applies relevant terms or labels (tags) from a defined information model (such as a taxonomy) to your data. Read more about Enterprise Knowledge\u2019s auto-tagging solutions here:<\/p>\r\n<ul>\r\n<li><a href=\"https:\/\/enterprise-knowledge.com\/4-steps-content-auto-classification-high-accuracy\/\">4 Steps to Content Auto-Classification with High Accuracy<\/a><\/li>\r\n<li><a href=\"https:\/\/enterprise-knowledge.com\/when-should-my-organization-use-auto-tagging\/\">Expert Analysis: When should my organization use auto-tagging? Part One<\/a><\/li>\r\n<li><a href=\"https:\/\/enterprise-knowledge.com\/knowledge-ai-content-recommender-and-chatbot-powered-by-auto-tagging-and-an-enterprise-knowledge-graph\/\">Knowledge AI: Content Recommender and Chatbot Powered by Auto-Tagging and an Enterprise Knowledge Graph<\/a><\/li>\r\n<li><a href=\"https:\/\/enterprise-knowledge.com\/a-guide-to-selecting-the-right-auto-tagging-approach\/\">A Guide to Selecting the Right Auto-Tagging Approach<\/a><\/li>\r\n<\/ul>\r\n<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<\/div>\r\n<\/div>\r\n\r\n\r\n\r\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile is-vertically-aligned-top\" style=\"grid-template-columns: 15% auto;\">\r\n<figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"771\" height=\"771\" class=\"wp-image-11018 size-full\" src=\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Solution-01.png\" alt=\"\" srcset=\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Solution-01.png 771w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Solution-01-336x336.png 336w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Solution-01-140x140.png 140w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Solution-01-768x768.png 768w\" sizes=\"auto, (max-width: 771px) 100vw, 771px\" \/><\/figure>\r\n<div class=\"wp-block-media-text__content\">\r\n<h2 class=\"wp-block-heading\">The Solution<\/h2>\r\n\r\n\r\n\r\n<p><span style=\"font-weight: 400;\">To address this challenge, the team explored several potential solutions for cleaning up the text extracts. However, there was concern that direct text manipulation might lead to the loss of critical information if blanket applied to the entire corpus. Rather than modifying the raw text directly, the team decided to leverage a client-side Large Language Model (LLM) to generate additional text based on the extracts. The idea was that the LLM could potentially better interpret the noise from OCR processing as irrelevant and produce a refined summary of the text that could be used to improve classification.<\/span><\/p>\r\n<p><span style=\"font-weight: 400;\">The team tested various summarization strategies via careful prompt engineering to generate different kinds of summaries (such as abstractive vs. extractive) of varying lengths and levels of detail. The team performed a human-in-the-loop grading process to manually assess the effectiveness of these different approaches. To determine the prompt to be used in the application, graders evaluated the quality of summaries generated per trial prompt over a sample set of documents with particularly low-quality source PDFs. Evaluation metrics included the complexity of the prompt, summary generation time, human readability, errors, hallucinations, and of course &#8211; precision of\u00a0 auto-classification results.<\/span><\/p>\r\n<p>\r\n\r\n<\/p>\r\n<\/div>\r\n<\/div>\r\n\r\n\r\n\r\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile is-vertically-aligned-top\" style=\"grid-template-columns: 15% auto;\">\r\n<figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"771\" height=\"771\" class=\"wp-image-11019 size-full\" src=\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/EK-Difference-01.png\" alt=\"\" srcset=\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/EK-Difference-01.png 771w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/EK-Difference-01-336x336.png 336w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/EK-Difference-01-140x140.png 140w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/EK-Difference-01-768x768.png 768w\" sizes=\"auto, (max-width: 771px) 100vw, 771px\" \/><\/figure>\r\n<div class=\"wp-block-media-text__content\">\r\n<h2 class=\"wp-block-heading\">The EK Difference<\/h2>\r\n\r\n\r\n\r\n<p><span style=\"font-weight: 400;\">Through this iterative process, the team determined that the most effective summaries for this use case were abstractive summaries (summaries that paraphrase content) of around four complete sentences in length. The selected prompt generated summaries with a sufficient level of detail (for both human readers and the classifier) while maintaining brevity. To improve classification, the LLM-generated summaries are meant to supplement the full text extract, not to replace it. The team incorporated the new summaries into the classification pipeline by creating a new metadata field for the source document. The new \u2018summary\u2019 metadata field was added to the auto-classification submission along with the full text extracts to provide additional clarity and context. This required adjusting classification model configurations, such as the weights (or priority) for the new and existing fields.<\/span><\/p>\r\n<table style=\"width: 100%; border-collapse: collapse; border-style: solid; border-color: #000000;\">\r\n<tbody>\r\n<tr>\r\n<td style=\"width: 100%;\">\r\n<p><strong>Large Language Models (LLMs)<\/strong> <br \/>A Large Language Model is an advanced AI model designed to perform Natural Language Processing (NLP) tasks, including interpreting, translating, predicting, and generating coherent, contextually relevant text. Read more about how Enterprise Knowledge is leveraging LLMs in client solutions here:<\/p>\r\n<ul>\r\n<li><a href=\"https:\/\/enterprise-knowledge.com\/what-is-a-large-language-model-llm\/\">What is a Large Language Model (LLM)?<\/a><\/li>\r\n<li><a href=\"https:\/\/enterprise-knowledge.com\/choosing-the-right-approach-llms-vs-traditional-machine-learning-for-text-summarization\/\">Choosing the Right Approach: LLMs vs. Traditional Machine Learning for Text Summarization<\/a><\/li>\r\n<li><a href=\"https:\/\/enterprise-knowledge.com\/the-role-of-semantic-layers-with-llms\/\">The Role of Semantic Layers with LLMs<\/a><\/li>\r\n<\/ul>\r\n<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<p><\/p>\r\n<\/div>\r\n<\/div>\r\n\r\n\r\n\r\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile is-vertically-aligned-top is-style-default\" style=\"grid-template-columns: 15% auto;\">\r\n<figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"771\" height=\"771\" class=\"wp-image-11020 size-full\" src=\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Results-01.png\" alt=\"\" srcset=\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Results-01.png 771w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Results-01-336x336.png 336w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Results-01-140x140.png 140w, https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Results-01-768x768.png 768w\" sizes=\"auto, (max-width: 771px) 100vw, 771px\" \/><\/figure>\r\n<div class=\"wp-block-media-text__content\">\r\n<h2 class=\"wp-block-heading\">The Results<\/h2>\r\n\r\n\r\n\r\n<p>By including the LLM-generated summaries in the classification request, the team was able to provide more context and structure to the existing text. This additional information filled in previous gaps and allowed the classifier to better interpret the content, leading to more precise subject tags compared to using the original OCR text alone. As a bonus, the LLM-generated summaries were also added to the document metadata in the DMS, further improving the discoverability of the archived documents.<\/p>\r\n<p>By leveraging the power of LLMs, the team was able to clean up noisy OCR output to improve auto-tagging capabilities as well as further enriching document metadata with content descriptions. If your organization is facing similar challenges managing and archiving older or difficult to parse documents, consider how <a href=\"https:\/\/enterprise-knowledge.com\/\">Enterprise Knowledge<\/a> can assist in optimizing your content findability with advanced AI techniques.<\/p>\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n<div class=\"wp-block-stackable-button-group stk-block-button-group stk-block stk-ad7b9e7\" data-block-id=\"ad7b9e7\">\r\n<div class=\"stk-row stk-inner-blocks has-text-align-center stk-block-content stk-button-group\">\r\n<div class=\"wp-block-stackable-button stk-block-button has-text-align-left stk-block stk-9409a5c\" data-block-id=\"9409a5c\"><style>.stk-9409a5c .stk-button{padding-top:6px !important;padding-right:40px !important;padding-bottom:6px !important;padding-left:40px !important;background:#409F67 !important;border-top-left-radius:50px !important;border-top-right-radius:50px !important;border-bottom-right-radius:50px !important;border-bottom-left-radius:50px !important;}<\/style><a class=\"stk-link stk-button stk--hover-effect-darken\" href=\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2025\/07\/Optimizing-Historical-Knowledge-Retrieval-Leveraging-an-LLM-for-Content-Cleanup.pdf\" target=\"_blank\" rel=\"noopener\"><span class=\"stk-button__inner-text\">Download Flyer<\/span><\/a><\/div>\r\n<\/div>\r\n<\/div>\r\n\r\n\r\n\r\n<div class=\"wp-block-spacer\" style=\"height: 48px;\" aria-hidden=\"true\">\u00a0<\/div>\r\n<\/div>\r\n<\/div>\r\n\r\n\r\n\r\n<div class=\"wp-block-stackable-call-to-action alignfull stk-block-call-to-action stk-block stk-771a887 stk-block-background is-style-default\" data-v=\"2\" data-block-id=\"771a887\"><style>.stk-771a887 {background-color:#5A2C85 !important;}.stk-771a887:before{background-color:#5A2C85 !important;}.stk-771a887-container{background-color:#5A2C85 !important;}.stk-771a887-container:before{background-color:#5A2C85 !important;}<\/style>\r\n<div class=\"stk-block-call-to-action__content stk-content-align stk-771a887-column alignfull stk-container stk-771a887-container stk-hover-parent\">\r\n<div class=\"has-text-align-center stk-block-content stk-inner-blocks stk-771a887-inner-blocks\">\r\n<div id=\"span-style-color-ffffff-class-stk-highlight-ready-to-get-started-span\" class=\"wp-block-stackable-heading stk-block-heading stk-block-heading--v2 stk-block stk-0e3ba4b\" data-block-id=\"0e3ba4b\"><style>.stk-0e3ba4b {margin-bottom:16px !important;}.stk-0e3ba4b .stk-block-heading__text{font-size:30px !important;color:#ffffff !important;}@media screen and (max-width: 1023px){.stk-0e3ba4b .stk-block-heading__text{font-size:30px !important;}}<\/style>\r\n<h3 class=\"stk-block-heading__text has-text-color has-white-color\"><span class=\"stk-highlight\" style=\"color: #ffffff;\">Ready to Get Started?<\/span><\/h3>\r\n<\/div>\r\n\r\n\r\n\r\n<div class=\"wp-block-stackable-button-group stk-block-button-group stk-block stk-820aa2f\" data-block-id=\"820aa2f\">\r\n<div class=\"stk-row stk-inner-blocks stk-block-content stk-button-group\">\r\n<div class=\"wp-block-stackable-button stk-block-button stk-block stk-d4dce62\" data-block-id=\"d4dce62\"><style>.stk-d4dce62 .stk-button{padding-top:6px !important;padding-right:40px !important;padding-bottom:6px !important;padding-left:40px !important;background:#E7E0F3 !important;border-radius:50px !important;}.stk-d4dce62 .stk-button__inner-text{color:#000000 !important;}<\/style><a class=\"stk-link stk-button stk--hover-effect-darken\" href=\"https:\/\/enterprise-knowledge.com\/contact-us\/\" target=\"_blank\" rel=\"noreferrer noopener\"><span class=\"has-text-color stk-button__inner-text\">Get in Touch<\/span><\/a><\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n\r\n\r\n","protected":false},"excerpt":{"rendered":"<p>Enterprise Knowledge (EK) recently worked with a Federally Funded Research and Development Center (FFRDC) that was having difficulty retrieving relevant content in a large volume of archival scientific papers. Researchers were burdened with excessive search times and the potential for knowledge loss &#8230; <a href=\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/\"  class=\"with-arrow\">Continue reading<\/a><\/p>\n","protected":false},"author":17,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"_uag_custom_page_level_css":"","footnotes":""},"categories":[1282,186],"tags":[310,388,305,304,143,1239,1513,799],"article-type":[721],"solution":[1092],"ppma_author":[1391],"class_list":["post-24805","post","type-post","status-publish","format-standard","hentry","category-ai","category-software-development","tag-ai","tag-artificial-intelligence","tag-auto-classification","tag-auto-tagging","tag-km","tag-llm","tag-prompt-engineering","tag-text-extraction","article-type-case-study","solution-enterprise-ai"],"acf":[],"featured_image_urls_v2":{"full":"","thumbnail":"","medium":"","medium_large":"","large":"","1536x1536":"","2048x2048":"","slideshow":"","slideshow-2x":"","banner":"","home-large":"","home-medium":"","home-small":"","gform-image-choice-sm":"","gform-image-choice-md":"","gform-image-choice-lg":""},"post_excerpt_stackable_v2":"<p>Enterprise Knowledge (EK) recently worked with a Federally Funded Research and Development Center (FFRDC) that was having difficulty retrieving relevant content in a large volume of archival scientific papers. Researchers were burdened with excessive search times and the potential for knowledge loss &#8230;<\/p>\n","category_list_v2":"<a href=\"https:\/\/enterprise-knowledge.com\/category\/ai\/\" rel=\"category tag\">Artificial Intelligence<\/a>, <a href=\"https:\/\/enterprise-knowledge.com\/category\/software-development\/\" rel=\"category tag\">Technology Solutions<\/a>","author_info_v2":{"name":"EK Team","url":"https:\/\/enterprise-knowledge.com\/author\/enterprise-knowledge\/"},"comments_num_v2":"0 comments","yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup - Enterprise Knowledge<\/title>\n<meta name=\"description\" content=\"Continuing the Optimizing Historical Knowledge Retrieval series, a Federally Funded Research and Development Center (FFRDC) was having difficulty retrieving relevant content in a large volume of archival scientific papers. EK decided to leverage a client-side Large Language Model (LLM) to clean up the text-extracts.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup - Enterprise Knowledge\" \/>\n<meta property=\"og:description\" content=\"Continuing the Optimizing Historical Knowledge Retrieval series, a Federally Funded Research and Development Center (FFRDC) was having difficulty retrieving relevant content in a large volume of archival scientific papers. EK decided to leverage a client-side Large Language Model (LLM) to clean up the text-extracts.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/\" \/>\n<meta property=\"og:site_name\" content=\"Enterprise Knowledge\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Enterprise-Knowledge-359618484181651\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-07-02T19:39:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-17T22:19:56+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png\" \/>\n\t<meta property=\"og:image:width\" content=\"771\" \/>\n\t<meta property=\"og:image:height\" content=\"771\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"EK Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2024\/05\/Marketing_Sales-Strategy-Entity-e1716562850203.png\" \/>\n<meta name=\"twitter:creator\" content=\"@EKConsulting\" \/>\n<meta name=\"twitter:site\" content=\"@EKConsulting\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"EK Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/\"},\"author\":{\"name\":\"EK Team\",\"@id\":\"https:\/\/enterprise-knowledge.com\/#\/schema\/person\/fe4c950023b0a2d4ea9057f16c70a16c\"},\"headline\":\"Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup\",\"datePublished\":\"2025-07-02T19:39:00+00:00\",\"dateModified\":\"2025-11-17T22:19:56+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/\"},\"wordCount\":1009,\"publisher\":{\"@id\":\"https:\/\/enterprise-knowledge.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png\",\"keywords\":[\"AI\",\"artificial intelligence\",\"auto-classification\",\"auto-tagging\",\"KM\",\"LLM\",\"prompt engineering\",\"text extraction\"],\"articleSection\":[\"Artificial Intelligence\",\"Technology Solutions\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/\",\"url\":\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/\",\"name\":\"Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup - Enterprise Knowledge\",\"isPartOf\":{\"@id\":\"https:\/\/enterprise-knowledge.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png\",\"datePublished\":\"2025-07-02T19:39:00+00:00\",\"dateModified\":\"2025-11-17T22:19:56+00:00\",\"description\":\"Continuing the Optimizing Historical Knowledge Retrieval series, a Federally Funded Research and Development Center (FFRDC) was having difficulty retrieving relevant content in a large volume of archival scientific papers. EK decided to leverage a client-side Large Language Model (LLM) to clean up the text-extracts.\",\"breadcrumb\":{\"@id\":\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#primaryimage\",\"url\":\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png\",\"contentUrl\":\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png\",\"width\":771,\"height\":771},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/enterprise-knowledge.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/enterprise-knowledge.com\/#website\",\"url\":\"https:\/\/enterprise-knowledge.com\/\",\"name\":\"Enterprise Knowledge\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/enterprise-knowledge.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/enterprise-knowledge.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/enterprise-knowledge.com\/#organization\",\"name\":\"Enterprise Knowledge\",\"url\":\"https:\/\/enterprise-knowledge.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/enterprise-knowledge.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2013\/09\/favicon.jpg\",\"contentUrl\":\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2013\/09\/favicon.jpg\",\"width\":69,\"height\":69,\"caption\":\"Enterprise Knowledge\"},\"image\":{\"@id\":\"https:\/\/enterprise-knowledge.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Enterprise-Knowledge-359618484181651\/\",\"https:\/\/x.com\/EKConsulting\",\"https:\/\/www.linkedin.com\/company\/enterprise-knowledge-llc\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/enterprise-knowledge.com\/#\/schema\/person\/fe4c950023b0a2d4ea9057f16c70a16c\",\"name\":\"EK Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/enterprise-knowledge.com\/#\/schema\/person\/image\/11955f4cea9ef25d7e2fbc5bf76ce329\",\"url\":\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2025\/06\/avatar_user_17_1749066222-96x96.png\",\"contentUrl\":\"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2025\/06\/avatar_user_17_1749066222-96x96.png\",\"caption\":\"EK Team\"},\"description\":\"A services firm that integrates Knowledge Management, Information Management, Information Technology, and Agile Approaches to deliver comprehensive solutions. Our mission is to form true partnerships with our clients, listening and collaborating to create tailored, practical, and results-oriented solutions that enable them to thrive and adapt to changing needs.\",\"url\":\"https:\/\/enterprise-knowledge.com\/author\/enterprise-knowledge\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup - Enterprise Knowledge","description":"Continuing the Optimizing Historical Knowledge Retrieval series, a Federally Funded Research and Development Center (FFRDC) was having difficulty retrieving relevant content in a large volume of archival scientific papers. EK decided to leverage a client-side Large Language Model (LLM) to clean up the text-extracts.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/","og_locale":"en_US","og_type":"article","og_title":"Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup - Enterprise Knowledge","og_description":"Continuing the Optimizing Historical Knowledge Retrieval series, a Federally Funded Research and Development Center (FFRDC) was having difficulty retrieving relevant content in a large volume of archival scientific papers. EK decided to leverage a client-side Large Language Model (LLM) to clean up the text-extracts.","og_url":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/","og_site_name":"Enterprise Knowledge","article_publisher":"https:\/\/www.facebook.com\/Enterprise-Knowledge-359618484181651\/","article_published_time":"2025-07-02T19:39:00+00:00","article_modified_time":"2025-11-17T22:19:56+00:00","og_image":[{"width":771,"height":771,"url":"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png","type":"image\/png"}],"author":"EK Team","twitter_card":"summary_large_image","twitter_image":"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2024\/05\/Marketing_Sales-Strategy-Entity-e1716562850203.png","twitter_creator":"@EKConsulting","twitter_site":"@EKConsulting","twitter_misc":{"Written by":"EK Team","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#article","isPartOf":{"@id":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/"},"author":{"name":"EK Team","@id":"https:\/\/enterprise-knowledge.com\/#\/schema\/person\/fe4c950023b0a2d4ea9057f16c70a16c"},"headline":"Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup","datePublished":"2025-07-02T19:39:00+00:00","dateModified":"2025-11-17T22:19:56+00:00","mainEntityOfPage":{"@id":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/"},"wordCount":1009,"publisher":{"@id":"https:\/\/enterprise-knowledge.com\/#organization"},"image":{"@id":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#primaryimage"},"thumbnailUrl":"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png","keywords":["AI","artificial intelligence","auto-classification","auto-tagging","KM","LLM","prompt engineering","text extraction"],"articleSection":["Artificial Intelligence","Technology Solutions"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/","url":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/","name":"Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup - Enterprise Knowledge","isPartOf":{"@id":"https:\/\/enterprise-knowledge.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#primaryimage"},"image":{"@id":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#primaryimage"},"thumbnailUrl":"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png","datePublished":"2025-07-02T19:39:00+00:00","dateModified":"2025-11-17T22:19:56+00:00","description":"Continuing the Optimizing Historical Knowledge Retrieval series, a Federally Funded Research and Development Center (FFRDC) was having difficulty retrieving relevant content in a large volume of archival scientific papers. EK decided to leverage a client-side Large Language Model (LLM) to clean up the text-extracts.","breadcrumb":{"@id":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#primaryimage","url":"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png","contentUrl":"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2020\/04\/Challenge-01.png","width":771,"height":771},{"@type":"BreadcrumbList","@id":"https:\/\/enterprise-knowledge.com\/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/enterprise-knowledge.com\/"},{"@type":"ListItem","position":2,"name":"Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup"}]},{"@type":"WebSite","@id":"https:\/\/enterprise-knowledge.com\/#website","url":"https:\/\/enterprise-knowledge.com\/","name":"Enterprise Knowledge","description":"","publisher":{"@id":"https:\/\/enterprise-knowledge.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/enterprise-knowledge.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/enterprise-knowledge.com\/#organization","name":"Enterprise Knowledge","url":"https:\/\/enterprise-knowledge.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/enterprise-knowledge.com\/#\/schema\/logo\/image\/","url":"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2013\/09\/favicon.jpg","contentUrl":"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2013\/09\/favicon.jpg","width":69,"height":69,"caption":"Enterprise Knowledge"},"image":{"@id":"https:\/\/enterprise-knowledge.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Enterprise-Knowledge-359618484181651\/","https:\/\/x.com\/EKConsulting","https:\/\/www.linkedin.com\/company\/enterprise-knowledge-llc"]},{"@type":"Person","@id":"https:\/\/enterprise-knowledge.com\/#\/schema\/person\/fe4c950023b0a2d4ea9057f16c70a16c","name":"EK Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/enterprise-knowledge.com\/#\/schema\/person\/image\/11955f4cea9ef25d7e2fbc5bf76ce329","url":"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2025\/06\/avatar_user_17_1749066222-96x96.png","contentUrl":"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2025\/06\/avatar_user_17_1749066222-96x96.png","caption":"EK Team"},"description":"A services firm that integrates Knowledge Management, Information Management, Information Technology, and Agile Approaches to deliver comprehensive solutions. Our mission is to form true partnerships with our clients, listening and collaborating to create tailored, practical, and results-oriented solutions that enable them to thrive and adapt to changing needs.","url":"https:\/\/enterprise-knowledge.com\/author\/enterprise-knowledge\/"}]}},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false,"slideshow":false,"slideshow-2x":false,"banner":false,"home-large":false,"home-medium":false,"home-small":false,"gform-image-choice-sm":false,"gform-image-choice-md":false,"gform-image-choice-lg":false},"uagb_author_info":{"display_name":"EK Team","author_link":"https:\/\/enterprise-knowledge.com\/author\/enterprise-knowledge\/"},"uagb_comment_info":0,"uagb_excerpt":"Enterprise Knowledge (EK) recently worked with a Federally Funded Research and Development Center (FFRDC) that was having difficulty retrieving relevant content in a large volume of archival scientific papers. Researchers were burdened with excessive search times and the potential for knowledge loss ... Continue reading","authors":[{"term_id":1391,"user_id":17,"is_guest":0,"slug":"enterprise-knowledge","display_name":"EK Team","avatar_url":"https:\/\/enterprise-knowledge.com\/wp-content\/uploads\/2025\/06\/avatar_user_17_1749066222-96x96.png","first_name":"EK","last_name":"Team","user_url":"","job_title":"","description":"A services firm that integrates Knowledge Management, Information Management, Information Technology, and Agile Approaches to deliver comprehensive solutions. Our mission is to form true partnerships with our clients, listening and collaborating to create tailored, practical, and results-oriented solutions that enable them to thrive and adapt to changing needs."}],"_links":{"self":[{"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/posts\/24805","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/users\/17"}],"replies":[{"embeddable":true,"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/comments?post=24805"}],"version-history":[{"count":7,"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/posts\/24805\/revisions"}],"predecessor-version":[{"id":26058,"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/posts\/24805\/revisions\/26058"}],"wp:attachment":[{"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/media?parent=24805"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/categories?post=24805"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/tags?post=24805"},{"taxonomy":"article-type","embeddable":true,"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/article-type?post=24805"},{"taxonomy":"solution","embeddable":true,"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/solution?post=24805"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/enterprise-knowledge.com\/wp-json\/wp\/v2\/ppma_author?post=24805"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}