News
Archives Text Discovery Platform prototype launched: Search our Archives’ online collections through AI-transcribed full texts
15 October 2025
The Hunt Institute is excited to share a public prototype of our Archives Text Discovery Platform, which allows users to perform full-text searches of our digitized Archives collections, including typed, printed and, most notably, handwritten documents. Taking advantage of recently developed natural language processing tools and advances in artificial intelligence, we have transcribed page images for nearly all the digital items currently catalogued in our Archives Collections database and made that text discoverable through a simple search interface, now available to the public.
At the time of writing, well over 1,400 folders and items, amounting to more than 77,000 pages, are indexed. A simple keyword search will return results from across the collections, including materials that would not otherwise have been found through their titles or metadata alone.
While OCR (Optical Character Recognition) transcription for printed texts has become common, many archives are still struggling to use it to its full potential, and including handwritten pages at this breadth remains a rare undertaking. Within the botanical community, most automation efforts have so far focused on specimen labels or on single, well-defined collections. Even many modern commercial HTR systems rely on having many already transcribed pages to train their models on the handwriting found within an organization's collections. By contrast, this prototype applies a general-purpose VLM (Vision-Language Model) across all our digitized archival materials, including letters, field notebooks and journals, and makes that text fully searchable.
Searching this data will produce a list of results that contain the keywords or phrases provided. Each result will link to a detail page that includes the digital object's PDF, initially set to the page on which the search terms were found, along with various pieces of metadata about the object and the collection to which it belongs. Links on that page will take users to the corresponding levels of description in our Archives Collections database. The page transcript itself, with search terms highlighted, appears below the PDF frame and can help users locate their keywords in the PDF's page image.
The current platform is only a prototype. There may be bugs, the interface may change, and we will likely add more features soon. We are looking into other enhancements, but did not want the development time for additional capabilities to prevent the public from being able to use what is already a powerful discovery tool, and so we have released this prototype in its current state, imperfections and all. We hope to keep the prototype evolving in public so that the world can use the working functionality today rather than waiting for perfection.
We would love your feedback as you explore. If you find something particularly interesting with this tool, please let us know about it. You may uncover things that nobody at the Institute even knows about yet! If this leads you to something that becomes an integral part of your research in a project, we would be delighted to hear about it. We would also like to hear if anything is not working as expected. Please use the Contact Us form and mention "Text Discovery Platform" or "ArchSearch" in your message. We welcome you to try out the prototype today, and let us know what you find!
About the Hunt Institute for Botanical Documentation
The Hunt Institute for Botanical Documentation, a research division of Carnegie Mellon University, specializes in the history of botany and all aspects of plant science and serves the international scientific community through research and documentation. To this end, the Institute acquires and maintains authoritative collections of books, plant images, manuscripts, portraits and data files, and provides publications and other modes of information service. The Institute meets the reference needs of botanists, biologists, historians, conservationists, librarians, bibliographers and the public at large, especially those concerned with any aspect of the North American flora.
Media Contact:
Scarlett T. Townsend
412-268-7304
st19@andrew.cmu.edu