top of page

Automating the extraction of biomedical concepts from patient records for a biotech start-up

With nearly 80% of its data stored in the form of unstructured text, the medical sector is sitting on a gold mine. Written patient l records, prescriptions, doctor’s notes, and pathology reports… hold the potential to accelerate drug development, improve public health policies, and sustain scientific research. Whereas a human previously needed one hour to screen five documents, it is often claimed that NLP (Natural Language Processing) algorithms can now analyze and extract key insights from thousands of documents in a matter of seconds.

We designed for a biotech start-up a custom NLP pipeline that extracts biomedical concepts from unstructured patient records. In a fraction of a second, the pipeline identifies relevant keywords in patient records and maps them into a universal medical database, effectively transforming the unstructured records into actionable structured data. The ultimate goal was to significantly enhance the matchmaking between patients and clinical trials.



In practice, however, the implementation of NLP techniques is never a straightforward task. Developing a production-ready NLP solution involves addressing recurring issues:

  • Ensuring data sources are available and building robust data pipelines for parsing, cleaning, and enhancing data.

  • Investing significant time and energy in labeling datasets correctly, to effectively train supervised algorithms.

  • Assimilating application-specific vocabulary by training custom machine learning models to enhance classification accuracy and avoid semantic ambiguities.

Beyond the biomedical industry, the effectiveness of NLP in extracting data can also be leveraged in many other sectors.



Starting from the specifications of our client, we crafted the different stages of this custom NLP pipeline:

  • We identified appropriate language models for the processing of biomedical texts (such as BioBERT or SciBERT), and integrated them in the well-known spaCy framework.

  • Based on the output of the language model, we designed custom heuristics to identify and extract relevant biomedical concepts.

  • The processed and structured data is finally automatically inserted in the back-end database of the client.

risk (1).png


Our NLP pipeline yielded a good accuracy of classification, empowering our client to build a first Minimum Viable Product (MVP) of its biomedical application and enhance the matching of patients with the right trails. This pivotal step allowed them to showcase their concept and seek further funding.

However, the need for a more extensive labeled dataset became evident for further refining the machine learning model and achieving heightened classification accuracy for a full-scale production rollout. When designing AI solutions, one should not forget that machine learning models can only be as good as the data with which they are trained. The importance of gathering a sufficiently large training dataset should not be underestimated.

Intrigued by the potential of NLP in revolutionizing your operations?

Reach out to unlock the hidden gems within your data together.

bottom of page