Feature #1544
Changes to match a token/word with hyphen with a header variant which does not have a hyphen but is exactly same
100%
Description
This is a new feature we would like to add to our pipeline. We have observed a case where a token in the Whirlwind text file is "control-flow" and we have an exactly matching header "control flow" but they are not matched due to the presence/absence of hyphen.
We need to make changes to ensure that they are matched with high confidence.
The changes are to be made in the "get_candidates_for_variant" function inside the nested loops. We will have to add a condition that checks whether the current token contains a hyphen after a word i.e. follows the format "word1-word2" and "word1" is the same as the first word of header variant.
E.g.
the token is "word1-word2" and the header variant is "word3 word4 word5".
we will check if "word1 == word3", if it is, we will process the KP such that it becomes "word1 word2" and call "candidate_variant_distance_calculation" function for it.
Please note we will have to make changes in the candidate_variant_distance_calculation function such that for distance calculation, we will use "word1 word2" but while saving the candidates, we will use "word1-word2" format.
Files