我想在一列中找到特定的字符串和字符串组合。你能帮帮我吗?
输入:
benign,likely_pathogenic
benign,likely_pathogenic
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
uncertain_significance,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
uncertain_significance,conflicting_interpretations_of_pathogenicity,likely_benign
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
pathogenic
输出:
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
benign,likely_pathogenic
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
pathogenic
我想将包含致病性和可能致病性的每一列分开。但部分字符串 pathogenic 是 conflicting_interpretations_of_pathogenicity。 我试过了
awk -F'\t' -v OFS="\t" '{if($14=="pathogenic") print FILENAME,$0; else if($14=="likely_pathogenic") print FILENAME,$0}'
但它是针对列中的确切字符串
如果我尝试过:
awk -F'\t' -v OFS="\t" '{if($14~"pathogenic") print FILENAME,$0}'
我得到所有具有 pathogenic、likely_pathogenic 和 conflicting_interpretations_of_pathogenicity 的行。在一行中可能是相互矛盾的...和致病性或可能致病性的组合。
最佳答案
可能是这样的:
awk '{
split($0,a,/,/) # split NEEDED field on commas
for(i in a) # check each part
if(a[i]~/^(likely_)?pathogenic$/) { # if matches this regex
print # output
break # no need for more matches
}
}' file
一些输出:
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
...
显然,您需要添加 FS
等,因为您正在处理 NF==14
的示例代码。
编辑:
我想这也适用于发布的样本数据:
$ awk '/(^|,)(likely_)?pathogenic(,|$)/' file
或您假设的数据:
$ awk '$14~/(^|,)(likely_)?pathogenic(,|$)/' file
https://stackoverflow.com/questions/73053519/