Evaluation of the state of the if statement in awk using a second file

微信扫一扫,分享到朋友圈

Evaluation of the state of the if statement in awk using a second file

I am trying to write a bash script that manipulates two files:

file1

Region  Coords. RsId    Position    Alleles Disease PValue  OddsRatio   RegionID
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
2q33.2  2:204611195..204817281  rs3096851   204763882   A>C Alopecia Areata 3.58E-08    1.32    802
2q33.2  2:204611195..204817281  rs1024161   204721752   G>A Alopecia Areata 3.55E-13    1.44    802
2q33.2  2:204611195..204817281  rs231775    204732714   A>G Alopecia Areata 2.20E-20    1.39    802
4q27    4:122982314..123605528  rs7682241   123523875   C>A Alopecia Areata 4.27E-08    1.34    803
4q27    4:122982314..123605528  rs7682481   123524026   G>C Alopecia Areata 4.80E-09    1.23    803
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872

file2

CHR_A         BP_A       SNP_A  CHR_B         BP_B       SNP_B           R2
 2    204721752   rs1024161      2    204732714    rs231775     0.849535
 2    204721752   rs1024161      2    204763882   rs3096851      0.68029
 2    204732714    rs231775      2    204763882   rs3096851     0.739633
 4    123523875   rs7682241      4    123524026   rs7682481            1

I want to read file1, and if column 3 (RsId) value is not present in either column 3 (SNP_A) or column 4 (SNP_B) of file2, write that row to output. I tried the following:

#this executable file is called filter_file.sh

#!/bin/bash
file1=$1
file2=$2
outfile=$3

while read CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2; do
cat $file2 | awk "($3!~/$SNP_A|$SNP_B/) {print}"
done < $file1 > $outfile

./filter_file.sh file1 file2 out

The awk statement worked when I tested it on its own but when I added it to the bash while loop it printed all of file1, including the header, four times. What is wrong with my code for this step?

Once this is working, if file1 column 3 (RsId) value is present in column 3 (SNP_A) or column 4 (SNP_B) of file2, I want to write the row to output that has the lowest value for file1 column 7 (PValue).

I am not sure how to start this second part of the task. From reading other awk questions I was thinking I could try an if statement set up something like this:

#!/bin/bash
file=$1
file2=$2
outfile=$3

while read CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2; do
cat $file2 | awk "{
if (($3!~/$SNP_A|$SNP_B/))
    print $0;
else
    #Statement that prints only the row with the lowest value for column 7
}"
done < $file1 > $outfile

What are some approaches I can use to do this step?

If people can point to some tutorials that may be helpful for these types of problems, that is greatly appreciated.

The desired outputfile
will look like this (order does not matter):

Region  Coords. RsId    Position    Alleles Disease PValue  OddsRatio   RegionID
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
2q33.2  2:204611195..204817281  rs231775    204732714   A>G Alopecia Areata 2.20E-20    1.39    802
4q27    4:122982314..123605528  rs7682481   123524026   G>C Alopecia Areata 4.80E-09    1.23    803
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872

There’s no need to loop through the file using bash, you can do the whole thing in awk:

$ awk 'NR == FNR && NR > 1 {snp_a[$3]; snp_b[$6]; next}
       FNR > 1 && !($3 in snp_a || $3 in snp_b)' file2 file1
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872

The first block applies to the first file ( file2
) and sets keys in the arrays corresponding to the two columns of interest. next
skips any further commands, so the remainder of the script applies only to the second file ( file1
). Lines are printed when the condition is true, i.e. the line number in the file ( FNR
) is greater than 1 and the key cannot be found in either of the two arrays.

For the second part of your problem, things get a little bit more complicated…hopefully the comments explain things:

$ cat script.awk
# first file, save individual columns and pairs
NR == FNR && NR > 1 {snp_a[$3]; snp_b[$6]; pair[$3,$6]; next}

# second file, print first line
NR != FNR && FNR == 1

# second file, rest of lines
FNR > 1 {
    # print lines which aren't in either array
    if (!($3 in snp_a || $3 in snp_b)) {print}
    # save other lines and corresponding p-value
    else {s[$3] = $0; p[$3] = $8}
}
END {
    # loop through all lines
    for (i in s) {
        # empty the array f
        split("", f)
        # set initial line and min
        line = s[i]
        min = p[i]

        # locate associated lines
        if (i in snp_a) {
            for (j in snp_b) {
                # SUBSEP is a special variable used when combining keys
                # as in first block pair[$3,$6]
                if (i SUBSEP j in pair && j in s && p[j] < min) {
                    min = p[j]
                    line = s[j]
                }
            }
        }
        else if (i in snp_b) {
            for (j in snp_a) {
                if (j SUBSEP i in pair && j in s && p[j] < min) {
                    min = p[j]
                    line = s[j]
                }
            }
        }
        out[line]
    }
    for (i in out) print i
}
$ awk -f script.awk file2 file1
Region  Coords. RsId    Position    Alleles Disease PValue  OddsRatio   RegionID
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872
2q33.2  2:204611195..204817281  rs231775    204732714   A>G Alopecia Areata 2.20E-20    1.39    802
4q27    4:122982314..123605528  rs7682481   123524026   G>C Alopecia Areata 4.80E-09    1.23    803

I’m pretty certain that the logic in the END
block can be simplified but I couldn’t think of a better way to determine the "pairs". Either way, it achieves your desired output.

微信扫一扫,分享到朋友圈

Evaluation of the state of the if statement in awk using a second file

Running ActiveRecord Validations without Backup

上一篇

西安报业传媒集团 数字报刊

下一篇

你也可能喜欢

Evaluation of the state of the if statement in awk using a second file

长按储存图像,分享给朋友