Extracting names from file_a using information from 2 columns in file_b
Building on: Extract names from File_B having overlapping intervals with File_A
I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b.
There are a few problems with the code I have now.
(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350) up the list. I would like it to work the way it is set up.
(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.
(3) I would like to match the entry/row on file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.
file_a.tsv
MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1
file_b.tsv
MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350
code:
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 { if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); }' <file_a.tsv; done < file_b.tsv > output.tsv
output.tsv
mt
si
zgc
bnip4
Desired_output
MT 4050 mt-nd1
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVI 11230375 si:ch211-284e13.4
EXTRA (IF POSSIBLE): Some of the entries (of file_b) will not land within a gene but may be close to one, say 100 units away to either side. It would be nice to have seperate code which allows you to specify this proximity, as was attempted here: Extract names from File_B having overlapping intervals with File_A
Any help is VERY much appreciated!
awk sed
add a comment |
Building on: Extract names from File_B having overlapping intervals with File_A
I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b.
There are a few problems with the code I have now.
(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350) up the list. I would like it to work the way it is set up.
(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.
(3) I would like to match the entry/row on file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.
file_a.tsv
MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1
file_b.tsv
MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350
code:
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 { if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); }' <file_a.tsv; done < file_b.tsv > output.tsv
output.tsv
mt
si
zgc
bnip4
Desired_output
MT 4050 mt-nd1
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVI 11230375 si:ch211-284e13.4
EXTRA (IF POSSIBLE): Some of the entries (of file_b) will not land within a gene but may be close to one, say 100 units away to either side. It would be nice to have seperate code which allows you to specify this proximity, as was attempted here: Extract names from File_B having overlapping intervals with File_A
Any help is VERY much appreciated!
awk sed
add a comment |
Building on: Extract names from File_B having overlapping intervals with File_A
I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b.
There are a few problems with the code I have now.
(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350) up the list. I would like it to work the way it is set up.
(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.
(3) I would like to match the entry/row on file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.
file_a.tsv
MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1
file_b.tsv
MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350
code:
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 { if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); }' <file_a.tsv; done < file_b.tsv > output.tsv
output.tsv
mt
si
zgc
bnip4
Desired_output
MT 4050 mt-nd1
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVI 11230375 si:ch211-284e13.4
EXTRA (IF POSSIBLE): Some of the entries (of file_b) will not land within a gene but may be close to one, say 100 units away to either side. It would be nice to have seperate code which allows you to specify this proximity, as was attempted here: Extract names from File_B having overlapping intervals with File_A
Any help is VERY much appreciated!
awk sed
Building on: Extract names from File_B having overlapping intervals with File_A
I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b.
There are a few problems with the code I have now.
(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350) up the list. I would like it to work the way it is set up.
(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.
(3) I would like to match the entry/row on file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.
file_a.tsv
MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1
file_b.tsv
MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350
code:
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 { if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); }' <file_a.tsv; done < file_b.tsv > output.tsv
output.tsv
mt
si
zgc
bnip4
Desired_output
MT 4050 mt-nd1
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVI 11230375 si:ch211-284e13.4
EXTRA (IF POSSIBLE): Some of the entries (of file_b) will not land within a gene but may be close to one, say 100 units away to either side. It would be nice to have seperate code which allows you to specify this proximity, as was attempted here: Extract names from File_B having overlapping intervals with File_A
Any help is VERY much appreciated!
awk sed
awk sed
asked 3 mins ago
Age87Age87
1477
1477
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495683%2fextracting-names-from-file-a-using-information-from-2-columns-in-file-b%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495683%2fextracting-names-from-file-a-using-information-from-2-columns-in-file-b%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown