Turn unstructured clinical health records into structured data for analytics and trials.
Electronic Health Records for a patient are often dozens if not hundreds of pages of PDFs, with varying image quality over time.
Extracting named entities, such as disease/diagnosis, signs/symptoms, medications/allergies, lab results, and more, can assisnt clinical providers and CROs with rapid insight to each patient's individualized medical history, while maintaining privacy and audit requirements.
I added a state-of-the-art transformer model (BERT) and variants for named entity recognition, significantly boosting the result quality from custom ontological features in a Bi-LSTM model.
I created and evaluated voting mechanisms, using a mixture of experts approach to identify the best way to combine results from multiple models
I examined the effects of the OCR model used for transcribing the scanned PDF content, which often contains structured formatting (multiple columns) that renders standard OCR inapplicable for the clinical use case.
• BERT‑style NER + domain lexicons.
• OCR normalization; layout recovery, cost-efficiency gains via new methods
• Curation tooling + drift monitoring.
• Higher precision/recall on key entities.
• Model retraining and deploy in days, not weeks.