Medical-Knowledge-Graph-Construction-Method-Based-on-EMR

When it comes to the construction of medical knowledge gragh, the step of Named Entity Recgonition is necessary. This artical is my own understanding of the whole process on how to construct a complete and useful medical kg based on the research of our research team.

Research and Implementation of Medical Knowledge Graph Construction Method Based on Electronic Medical Records

Summary
Named Entity Recogonition
Learning Relationships between Entities
Store and Application of Medical KG

Summary

The demand for medical clinical decision support and auxiliary diagnostic systems has increased significantly. Existing platforms or systems rely on knowledge bases that are manually edited by a large number of professionals or are derived using simple statistics. Constructing medical knowledge graph can extract medical knowledge from massive data and manage, share, and apply it reasonably and efficiently.

Electronic medical records generally consist of course records, examination results, medical records, nursing records, etc.. Combining recurrent neural network and conditional random field models for Chinese named entity recognition from tens of thousands of processed electronic medical records to extract medical concepts. based on the result use the maximum likelihood estimation of two probability models to automatically construct knowledge graphs: naive Bayes and a Bayesian network using noisy OR gates. The constructed medical knowledgegraphs were derived from the learned parameters and stored using the Neo4j graph database, and were evaluated based on the content extracted from the Baidu Encyclopedia and cleaned manually.

Named Entities Recogonition

1.Basic Concepts

In recent years, deep learning methods based on neural networks have made considerable progress in the field of natural language processing. As the basic task in the field of natural language processing - Named Entity Recognition (NER) is no exception, the neural network structure has also achieved good results in NER. Named entity recognition is to find the relevant entity from a natural language text and mark its location and type. It is the basis for some complex tasks in the field of natural language processing (such as relationship extraction, information retrieval, etc.).

In traditional machine learning, the Conditional Random Field (CRF) is the current mainstream model of NER. Its objective function not only considers the state feature function of the input, but also includes the tag transfer feature function. The model parameters can be learned using a stochastic gradient descent method during training. When the model is known, it is a dynamic programming problem to obtain the predicted output sequence of the input sequence, that is, to optimize the objective function. It can be decoded using the Viterbi algorithm. Next, we will show how to use neural networks for named entity recognition.

2.Design of Combined CRF and RNN for NER
CRF provides a flexible and globally optimal annotation framework for named entity recognition, but at the same time there is a problem of slow convergence and long training time. In addition, the application of the CRF model requires a lot of features to complete the learning, but these features need to be manually extracted according to different scenarios. For example, the feature of extracting a person’s name may often look at whether the first word of the word is a family name or the like. The key question is how to extract features, how to know which features are extracted, how many features to extract, and so on. Therefore, feature engineering is a cumbersome problem. The extraction of features is mostly based on experience, and the whole process is very complicated. Fortunately, deep learning is now developing rapidly, in addition to its excellent performance, it also eliminates the hassle of manually extracting features.

2.1 Word2Vec Training
Word2vec word vector training process:
a) Wiki Chinese data preprocessing
First, download the wiki Chinese data: zhwiki-latest-pages-articles.xml.bz2. Because the zhwiki data contains many traditional characters, so we want to get the simplified corpus, then we need to do the following two things:
Obtain raw text data from bz2 using WikiCorpus in the gensim module;
Use OpenCC to convert traditional characters to simplified characters.
b) Text data segmentation
Prepare a stop word dictionary and remove the interference from stop words during training.
c) Gensim word2vec training
Using the word2vec training of the gensim toolkit, it is simple and fast, and the effect is better than Google’s word2vec. The size of the word vector in this training is 300, the training window is 5, the minimum word frequency is 5, and multi-threading is used. After the word2vec training is completed, the desired model and word vector are obtained and saved to the local.

2.2 Solution
The algorithm implements the application of bidirectional cyclic neural network (Bi-RNN) + conditional random field (CRF) on NER. Anyone who knows deep learning knows that putting data into a neural network and outputting the results we want, whether the problem we want to deal with is classification, regression, or sequence problems, is essentially the learning of features through neural networks and then The result of the feature calculation, which is exactly what this algorithm needs. As long as the neural network selects the features and then puts the features into the CRF, this is the essence of the entire algorithm.

Therefore, the combined model of the cyclic neural network and the conditional random field used in this study essentially converts the text into the word2vec word vector that holds the semantic information as input, and then uses the cyclic neural network to extract the semantic features of the word vector. The output layer of the network combined with the CRF probability discriminant model uses the extracted features to predict the label of each entity word. The neural network structure of the algorithm mainly includes the input layer, the word vector embedding layer, the RNN hidden layer and the CRF layer.

2.3 Whole Process
a) Wikipedia article data processing
b) Word2vec word vector training and saving
c) Electronic medical record text pre-processing and data preparation
d) Word2vec word vector mapping and obtaining training data information
e) Build and configure neural networks and training networks
f) Test data is predicted and outputted using a trained neural network

Learning Relationships between Entities

1.Basic Concepts

The naive Bayes classifier is a weak classifier based on Bayes’ theorem. All naive Bayes classifiers assume that each feature of the sample is not related to other features.

For example, if a fruit has characteristics such as red, round, and a diameter of about 3 inches, the fruit can be judged to be an apple. Although these features are interdependent or some are determined by other features, the Naive Bayes classifier considers these attributes to be independent in determining the probability distribution of whether the fruit is apple. The naive Bayes classifier is easy to set up and is especially well suited for large data sets. As we all know, this is an efficient classification method that outperforms many complex algorithms.

2.Solution
Learning medical relationships and forming a knowledge map consists of three main steps. First, the positively mentioned diseases, symptoms, and examinations are extracted from the Chinese electronic medical record text identified by the named entity. Second, learn statistical models of relationships between diseases and symptoms, diseases, and examinations. Third, the statistical model of learning is translated into a knowledge map. The map is composed of medical concepts (diseases, symptoms, and examinations) as nodes, the relationship between disease and symptoms, and the relationship between disease and examination as edges.

3.Whole Process
a) Entity concept extraction and preparation
b) Learning the weight of entity relationships
c) Forming a knowledge graph

4.Entity relationship learning based on naive Bayesian model
a) Entity concepts such as diseases, symptoms, and examinations that have been identified in each medical record in a non-negative range are extracted.
b) The extracted entities form a co-occurrence matrix of patient-disease, patient-symptom and patient-examination.
c) First set the threshold of the number of co-occurrences of the disease entity and the symptom or check entity as a test for denoising measures, and then calculate all the results according to the co-occurrence matrix and calculate the value of the conditional probability in each pair of disease-symptoms or diseases-checked IMPT, thereby calculating A value corresponding to the measure of importance of each pair of disease-symptoms or diseases-examination.
d) A knowledge graph of the relationship between each disease formation and disease-symptoms and disease-examination.

5.Entity relationship learning based on NoisyOR model
a) Entity concepts such as diseases, symptoms, and examinations that have been identified in each medical record in a non-negative range are extracted.
b) The extracted entities form a co-occurrence matrix of patient-disease, patient-symptom and patient-examination.
c) All results were counted according to the co-occurrence matrix and the values of the conditional probabilities in each pair of disease-symptoms or diseases-tested IMPT were calculated to calculate values corresponding to the importance measure for each pair of disease-symptoms or disease-examinations.
d) A knowledge graph of the relationship between each disease formation and disease-symptoms and disease-examination.

Store and Application of Medical KG

1.Store of constructed knowledge graph
Although the storage of knowledge graph can be stored in the simplest files or general relational databases, in order to make the maps have high interpretability and intuitive understanding, and to facilitate subsequent research, the most appropriate is of course the use of special The graph database is stored. This article selects the most popular NoSQL graph database: Neo4j.

2.Application of constructed graph
The simple application of knowledge inference based on constructed medical knowledge maps is based on the diagnosis of multiple diseases for a given number of symptoms, based on the constructed disease-symptom relationship map to find the most probable or probabilistic row from the set of possible diseases. In the first few disease sets, because the disease is the cause and the symptoms are fruitful, the map-based disease diagnosis is actually caused by the fruit. This is essentially in line with Bayesian statistical theory.