The mwtab Tutorial

The mwtab package provides classes and other facilities for downloading, parsing, accessing, and manipulating data stored in either the mwTab or JSON representation of mwTab files.

Also, the mwtab package provides simple command-line interface to convert between mwTab and JSON representations, download entries from Metabolomics Workbench, access the MW REST interface, validate the consistency of the mwTab files, or extract metadata and metabolites from these files.

Brief mwTab Format Overview

Note

For full official specification see the following link (mwTab file specification): http://www.metabolomicsworkbench.org/data/tutorials.php

The mwTab formatted files consist of multiple blocks. Each new block starts with #.

  • Some of the blocks contain only “key-value”-like pairs.
#METABOLOMICS WORKBENCH STUDY_ID:ST000001 ANALYSIS_ID:AN000001
VERSION              1
CREATED_ON           2016-09-17
#PROJECT
PR:PROJECT_TITLE                     FatB Gene Project
PR:PROJECT_TYPE                      Genotype treatment
PR:PROJECT_SUMMARY                   Experiment to test the consequence of a mutation at the FatB gene (At1g08510)
PR:PROJECT_SUMMARY                   the wound-response of Arabidopsis

Note

*_SUMMARY “key-value”-like pairs are typically span through multiple lines.

  • #SUBJECT_SAMPLE_FACTORS block is specially formatted, i.e. it contains header specification and tab-separated values.
#SUBJECT_SAMPLE_FACTORS:             SUBJECT(optional)[tab]SAMPLE[tab]FACTORS(NAME:VALUE pairs separated by |)[tab]Additional sample data
SUBJECT_SAMPLE_FACTORS               -       LabF_115873     Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded
SUBJECT_SAMPLE_FACTORS               -       LabF_115878     Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded
SUBJECT_SAMPLE_FACTORS               -       LabF_115883     Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded
SUBJECT_SAMPLE_FACTORS               -       LabF_115888     Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded
SUBJECT_SAMPLE_FACTORS               -       LabF_115893     Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded
SUBJECT_SAMPLE_FACTORS               -       LabF_115898     Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded
  • #MS_METABOLITE_DATA (results) block contains Samples identifiers, Factors identifiers as well as tab-separated data between *_START and *_END.
#MS_METABOLITE_DATA
MS_METABOLITE_DATA:UNITS     Peak height
MS_METABOLITE_DATA_START
Samples      LabF_115904     LabF_115909     LabF_115914     LabF_115919     LabF_115924     LabF_115929     LabF_115842     LabF_115847     LabF_115852     LabF_115857     LabF_115862     LabF_115867     LabF_115873     LabF_115878     LabF_115883     LabF_115888     LabF_115893     LabF_115898     LabF_115811     LabF_115816     LabF_115821     LabF_115826     LabF_115831     LabF_115836
Factors      Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded     Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded     Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded     Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded     Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded     Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded     Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded   Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded   Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded   Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded   Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded   Arabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded   Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded        Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded        Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded        Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded        Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded        Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded        Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded      Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded      Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded      Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded      Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded      Arabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded
1_2_4-benzenetriol   1874.0000       3566.0000       1945.0000       1456.0000       2004.0000       1995.0000       4040.0000       2432.0000       2189.0000       1931.0000       1307.0000       2880.0000       2218.0000       1754.0000       1369.0000       1201.0000       3324.0000       1355.0000       2257.0000       1718.0000       1740.0000       3472.0000       2054.0000       1367.0000
1-monostearin        987.0000        450.0000        1910.0000       549.0000        1032.0000       902.0000        393.0000        705.0000        100.0000        481.0000        265.0000        120.0000        1185.0000       867.0000        676.0000        569.0000        579.0000        387.0000        1035.0000       789.0000        875.0000        224.0000        641.0000        693.0000
...
MS_METABOLITE_DATA_END
  • #METABOLITES metadata block contains a header specifying fields and tab-separated data between *_START and *_END.
#METABOLITES
METABOLITES_START
metabolite_name      moverz_quant    ri      ri_type pubchem_id      inchi_key       kegg_id other_id        other_id_type
1,2,4-benzenetriol   239     522741  Fiehn   10787           C02814  205673  BinBase
1-monostearin        399     959625  Fiehn   107036          D01947  202835  BinBase
2-hydroxyvaleric acid        131     310750  Fiehn   98009                   218773  BinBase
3-phosphoglycerate   299     611619  Fiehn   724             C00597  217821  BinBase
...
METABOLITES_END
  • #NMR_BINNED_DATA metadata block contains a header specifying fields and tab-separated data between *_START and *_END.
#NMR_BINNED_DATA
NMR_BINNED_DATA_START
Bin range(ppm)       CDC029  CDC030  CDC032  CPL101  CPL102  CPL103  CPL201  CPL202  CPL203  CDS039  CDS052  CDS054
0.50...0.56  0.00058149      1.6592  0.039301        0       0       0       0.034018        0.0028746       0.0021478       0.013387        0       0
0.56...0.58  0       0.74267 0       0.007206        0       0       0       0       0       0       0       0.0069721
0.58...0.60  0.051165        0.8258  0.089149        0.060972        0.026307        0.045697        0.069541        0       0       0.14516 0.057489        0.042255
...
NMR_BINNED_DATA_END
  • Order of metadata and data blocks (MS)
#METABOLOMICS WORKBENCH
VERSION              1
CREATED_ON           2016-09-17
...
#PROJECT
...
#STUDY
...
#SUBJECT
...
#SUBJECT_SAMPLE_FACTORS:             SUBJECT(optional)[tab]SAMPLE[tab]FACTORS(NAME:VALUE pairs separated by |)[tab]Additional sample data
...
#COLLECTION
...
#TREATMENT
...
#SAMPLEPREP
...
#CHROMATOGRAPHY
...
#ANALYSIS
...
#MS
...
#MS_METABOLITE_DATA
MS_METABOLITE_DATA:UNITS     peak area
MS_METABOLITE_DATA_START
...
MS_METABOLITE_DATA_END
#METABOLITES
METABOLITES_START
...
METABOLITES_END
#END

Using mwtab as a Library

Importing mwtab Package

If the mwtab package is installed on the system, it can be imported:

[1]:
import mwtab

Constructing MWTabFile Generator

The fileio module provides the read_files() generator function that yields MWTabFile instances. Constructing a MWTabFile generator is easy - specify the path to a local mwTab file, directory of files, archive of files:

[2]:
import mwtab

mwfile_gen = mwtab.read_files("ST000017_AN000035.txt")  # single mwTab file
mwfiles_gen = mwtab.read_files("ST000017_AN000035.txt", "ST000040_AN000060.json")  # several mwTab files
mwdir_gen = mwtab.read_files("mwfiles_dir_mwtab")  # directory of mwTab files
mwzip_gen = mwtab.read_files("mwfiles_mwtab.zip")  # archive of mwTab files
mwanalysis_gen = mwtab.read_files("35", "60")       # ANALYSIS_ID of mwTab files
# REST callable url of mwTab file
mwurl_gen = mwtab.read_files("https://www.metabolomicsworkbench.org/rest/study/analysis_id/AN000035/mwtab/txt")

Processing MWTabFile Generator

The MWTabFile generator can be processed in several ways:

  • Feed it to a for-loop and process one file at a time:
[3]:
for mwfile in mwtab.read_files("35", "60"):
    print("STUDY_ID:", mwfile.study_id)       # print STUDY_ID
    print("ANALYSIS_ID", mwfile.analysis_id)  # print ANALYSIS_ID
    print("SOURCE", mwfile.source)            # print source
    for block_name in mwfile:                 # print names of blocks
        print("\t", block_name)
STUDY_ID: ST000017
ANALYSIS_ID AN000035
SOURCE https://www.metabolomicsworkbench.org/rest/study/analysis_id/AN000035/mwtab/txt
         METABOLOMICS WORKBENCH
         PROJECT
         STUDY
         SUBJECT
         SUBJECT_SAMPLE_FACTORS
         COLLECTION
         TREATMENT
         SAMPLEPREP
         CHROMATOGRAPHY
         ANALYSIS
         MS
         MS_METABOLITE_DATA
STUDY_ID: ST000040
ANALYSIS_ID AN000060
SOURCE https://www.metabolomicsworkbench.org/rest/study/analysis_id/AN000060/mwtab/txt
         METABOLOMICS WORKBENCH
         PROJECT
         STUDY
         SUBJECT
         SUBJECT_SAMPLE_FACTORS
         COLLECTION
         TREATMENT
         SAMPLEPREP
         CHROMATOGRAPHY
         ANALYSIS
         MS
         MS_METABOLITE_DATA

Note

Once the generator is consumed, it becomes empty and needs to be created again.

  • Since the MWTabFile generator behaves like an iterator, we can call the next() built-in function:
[4]:
mwfiles_generator = mwtab.read_files("35", "60")

mwfile1 = next(mwfiles_generator)
mwfile2 = next(mwfiles_generator)

Note

Once the generator is consumed, StopIteration will be raised.

[5]:
mwfiles_generator = mwtab.read_files("35", "60")
mwfiles_list = list(mwfiles_generator)

Warning

Do not convert the MWTabFile generator into a list if the generator can yield a large number of files, e.g. several thousand, otherwise it can consume all available memory.

Accessing Data From a Single MWTabFile

Since a MWTabFile is a Python collections.OrderedDict, data can be accessed and manipulated as with any regular Python dict object using bracket accessors.

  • Accessing top-level “keys” in MWTabFile:
[7]:
mwfile = next(mwtab.read_files("ST000017_AN000035.txt"))

# list MWTabFile-level keys, i.e. saveframe names
list(mwfile.keys())
[7]:
['METABOLOMICS WORKBENCH',
 'PROJECT',
 'STUDY',
 'SUBJECT',
 'SUBJECT_SAMPLE_FACTORS',
 'COLLECTION',
 'TREATMENT',
 'SAMPLEPREP',
 'CHROMATOGRAPHY',
 'ANALYSIS',
 'MS',
 'MS_METABOLITE_DATA']
[8]:
# access "PROJECT" block
mwfile["PROJECT"]
[8]:
OrderedDict([('PROJECT_TITLE', 'Rat Stamina Studies'),
             ('PROJECT_TYPE', 'Feeding'),
             ('PROJECT_SUMMARY', 'Stamina in rats'),
             ('INSTITUTE', 'University of Michigan'),
             ('DEPARTMENT', 'Internal Medicine'),
             ('LABORATORY', 'Burant Lab'),
             ('LAST_NAME', 'Beecher'),
             ('FIRST_NAME', 'Chris'),
             ('ADDRESS', '-'),
             ('EMAIL', 'chrisbee@med.umich.edu'),
             ('PHONE', '734-232-0815'),
             ('FUNDING_SOURCE', 'NIH: R01 DK077200')])
  • Accessing individual “key-value” pairs within blocks:
[9]:
# access "INSTITUTE" field within "PROJECT" block
mwfile["PROJECT"]["INSTITUTE"]
[9]:
'University of Michigan'
  • Accessing data in #SUBJECT_SAMPLE_FACTORS block:
[10]:
# access "SUBJECT_SAMPLE_FACTORS" block and print first three
mwfile["SUBJECT_SAMPLE_FACTORS"][:3]
[10]:
[OrderedDict([('Subject ID', '-'),
              ('Sample ID', 'S00009477'),
              ('Factors',
               {'Feeeding': 'Ad lib', 'Running Capacity': 'High'})]),
 OrderedDict([('Subject ID', '-'),
              ('Sample ID', 'S00009478'),
              ('Factors',
               {'Feeeding': 'Ad lib', 'Running Capacity': 'High'})]),
 OrderedDict([('Subject ID', '-'),
              ('Sample ID', 'S00009479'),
              ('Factors',
               {'Feeeding': 'Ad lib', 'Running Capacity': 'High'})])]
[11]:
# access individual factors (by index)
mwfile["SUBJECT_SAMPLE_FACTORS"][0]
[11]:
OrderedDict([('Subject ID', '-'),
             ('Sample ID', 'S00009477'),
             ('Factors', {'Feeeding': 'Ad lib', 'Running Capacity': 'High'})])
[12]:
# access individual fields within factors
mwfile["SUBJECT_SAMPLE_FACTORS"][0]["Sample ID"]
[12]:
'S00009477'
  • Accessing data in #MS_METABOLITE_DATA block:
[13]:
# access data block keys
list(mwfile["MS_METABOLITE_DATA"].keys())
[13]:
['Units', 'Data', 'Metabolites']
[14]:
# access units field
mwfile["MS_METABOLITE_DATA"]["Units"]
[14]:
'peak area'
[15]:
# access samples field (by index)
mwfile["MS_METABOLITE_DATA"]["Data"][0].keys()
[15]:
odict_keys(['Metabolite', 'S00009477', 'S00009478', 'S00009479', 'S00009480', 'S00009481', 'S00009500', 'S00009501', 'S00009502', 'S00009503', 'S00009470', 'S00009471', 'S00009472', 'S00009473', 'S00009474', 'S00009475', 'S00009494', 'S00009495', 'S00009496', 'S00009497', 'S00009498', 'S00009499', 'S00009488', 'S00009489', 'S00009490', 'S00009491', 'S00009492', 'S00009493', 'S00009509', 'S00009510', 'S00009511', 'S00009512', 'S00009513', 'S00009514', 'S00009482', 'S00009483', 'S00009484', 'S00009486', 'S00009504', 'S00009505', 'S00009506', 'S00009507', 'S00009508'])
[16]:
# access metabolite data and print first three
mwfile["MS_METABOLITE_DATA"]["Metabolites"][:3]
[16]:
[OrderedDict([('Metabolite', '11BETA,21-DIHYDROXY-5BETA-PREGNANE-3,20-DIONE'),
              ('moverz_quant', ''),
              ('ri', ''),
              ('ri_type', ''),
              ('pubchem_id', '44263339'),
              ('inchi_key', ''),
              ('kegg_id', 'C05475'),
              ('other_id', '775216_UNIQUE'),
              ('other_id_type', 'UM_Target_ID')]),
 OrderedDict([('Metabolite', '11-BETA-HYDROXYANDROST-4-ENE-3,17-DIONE'),
              ('moverz_quant', ''),
              ('ri', ''),
              ('ri_type', ''),
              ('pubchem_id', '94141'),
              ('inchi_key', ''),
              ('kegg_id', 'C05284'),
              ('other_id', '771312_PRIMARY'),
              ('other_id_type', 'UM_Target_ID')]),
 OrderedDict([('Metabolite', '13(S)-HPODE'),
              ('moverz_quant', ''),
              ('ri', ''),
              ('ri_type', ''),
              ('pubchem_id', '1426'),
              ('inchi_key', ''),
              ('kegg_id', 'C04717'),
              ('other_id', '775541_UNIQUE'),
              ('other_id_type', 'UM_Target_ID')])]

Manipulating Data From a Single MWTabFile

In order to change values within MWTabFile, descend into the appropriate level using square bracket accessors and set a new value.

  • Change regular “key-value” pairs:
[17]:
# access phone number information
mwfile["PROJECT"]["PHONE"]
[17]:
'734-232-0815'
[18]:
# change phone number information
mwfile["PROJECT"]["PHONE"] = "1-530-754-8258"
[19]:
# check that it has been modified
mwfile["PROJECT"]["PHONE"]
[19]:
'1-530-754-8258'
  • Change #SUBJECT_SAMPLE_FACTORS values:
[20]:
# access the first subject sample factor by index
mwfile["SUBJECT_SAMPLE_FACTORS"][0]
[20]:
OrderedDict([('Subject ID', '-'),
             ('Sample ID', 'S00009477'),
             ('Factors', {'Feeeding': 'Ad lib', 'Running Capacity': 'High'})])
[21]:
# provide additional details to the first subject sample factor
mwfile["SUBJECT_SAMPLE_FACTORS"][0]["Additional sample data"] = {"Additional detail key": "Additional detail value"}
[22]:
# check that it has been modified
mwfile["SUBJECT_SAMPLE_FACTORS"][0]
[22]:
OrderedDict([('Subject ID', '-'),
             ('Sample ID', 'S00009477'),
             ('Factors', {'Feeeding': 'Ad lib', 'Running Capacity': 'High'}),
             ('Additional sample data',
              {'Additional detail key': 'Additional detail value'})])

Printing a MWTabFile and its Components

MWTabFile objects provide the print_file() method which can be used to output the file in either mwTab or JSON format. The method takes a file_format keyword argument which specifices the output format to be displayed.

The MWTabFile can be printed to output in mwTab format in its entirety using:

  • mwfile.print_file(file_format=”mwtab”)
  • Print the first 20 lines in mwTab format.
[23]:
from io import StringIO
mwtab_file_str = StringIO()
mwfile.print_file(file_format="mwtab", f=mwtab_file_str)

# print out first 20 lines
print("\n".join(mwtab_file_str.getvalue().split("\n")[:20]))
#METABOLOMICS WORKBENCH STUDY_ID:ST000017 ANALYSIS_ID:AN000035 PROJECT_ID:PR000016
VERSION                 1
CREATED_ON              2016-09-17
#PROJECT
PR:PROJECT_TITLE                        Rat Stamina Studies
PR:PROJECT_TYPE                         Feeding
PR:PROJECT_SUMMARY                      Stamina in rats
PR:INSTITUTE                            University of Michigan
PR:DEPARTMENT                           Internal Medicine
PR:LABORATORY                           Burant Lab
PR:LAST_NAME                            Beecher
PR:FIRST_NAME                           Chris
PR:ADDRESS                              -
PR:EMAIL                                chrisbee@med.umich.edu
PR:PHONE                                1-530-754-8258
PR:FUNDING_SOURCE                       NIH: R01 DK077200
#STUDY
ST:STUDY_TITLE                          Rat HCR/LCR Stamina Study
ST:STUDY_TYPE                           LC-MS analysis
ST:STUDY_SUMMARY                        To determine the basis of running capacity and health differences in outbread

The MWTabFile can be printed to output in JSON format in its entirety using:

  • mwfile.print_file(file_format=”json”)
  • Print the first 20 lines in JSON format.
[24]:
from io import StringIO
mwtab_file_str = StringIO()
mwfile.print_file(file_format="json", f=mwtab_file_str)

# print out first 20 lines
print("\n".join(mwtab_file_str.getvalue().split("\n")[:20]))
{
    "METABOLOMICS WORKBENCH": {
        "STUDY_ID": "ST000017",
        "ANALYSIS_ID": "AN000035",
        "PROJECT_ID": "PR000016",
        "VERSION": "1",
        "CREATED_ON": "2016-09-17"
    },
    "PROJECT": {
        "PROJECT_TITLE": "Rat Stamina Studies",
        "PROJECT_TYPE": "Feeding",
        "PROJECT_SUMMARY": "Stamina in rats",
        "INSTITUTE": "University of Michigan",
        "DEPARTMENT": "Internal Medicine",
        "LABORATORY": "Burant Lab",
        "LAST_NAME": "Beecher",
        "FIRST_NAME": "Chris",
        "ADDRESS": "-",
        "EMAIL": "chrisbee@med.umich.edu",
        "PHONE": "1-530-754-8258",
  • Print single block in mwTab format.
[25]:
mwfile.print_block("STUDY", file_format="mwtab")
ST:STUDY_TITLE                          Rat HCR/LCR Stamina Study
ST:STUDY_TYPE                           LC-MS analysis
ST:STUDY_SUMMARY                        To determine the basis of running capacity and health differences in outbread
ST:STUDY_SUMMARY                        N/NIH rats selected for high capacity (HCR) and low capacity (LCR) running (a for
ST:STUDY_SUMMARY                        VO2max) (see:Science. 2005 Jan 21;307(5708):418-20). Plasma collected at 12 of
ST:STUDY_SUMMARY                        age in generation 28 rats after ad lib feeding or 40% caloric restriction at week
ST:STUDY_SUMMARY                        8 of age. All animals fasted 4 hours prior to collection between 5-8
ST:INSTITUTE                            University of Michigan
ST:DEPARTMENT                           Internal Medicine
ST:LABORATORY                           Burant Lab (MMOC)
ST:LAST_NAME                            Qi
ST:FIRST_NAME                           Nathan
ST:ADDRESS                              -
ST:EMAIL                                nathanqi@med.umich.edu
ST:PHONE                                734-232-0815
ST:NUM_GROUPS                           2
ST:TOTAL_SUBJECTS                       42
  • Print single block in JSON format.
[26]:
mwfile.print_block("STUDY", file_format="json")
{
    "STUDY_TITLE": "Rat HCR/LCR Stamina Study",
    "STUDY_TYPE": "LC-MS analysis",
    "STUDY_SUMMARY": "To determine the basis of running capacity and health differences in outbread N/NIH rats selected for high capacity (HCR) and low capacity (LCR) running (a for VO2max) (see:Science. 2005 Jan 21;307(5708):418-20). Plasma collected at 12 of age in generation 28 rats after ad lib feeding or 40% caloric restriction at week 8 of age. All animals fasted 4 hours prior to collection between 5-8",
    "INSTITUTE": "University of Michigan",
    "DEPARTMENT": "Internal Medicine",
    "LABORATORY": "Burant Lab (MMOC)",
    "LAST_NAME": "Qi",
    "FIRST_NAME": "Nathan",
    "ADDRESS": "-",
    "EMAIL": "nathanqi@med.umich.edu",
    "PHONE": "734-232-0815",
    "NUM_GROUPS": "2",
    "TOTAL_SUBJECTS": "42"
}

Writing data from a MWTabFile object into a file

Data from a MWTabFile can be written into file in original mwTab format or in equivalent JSON format using write():

  • Writing into a mwTab formatted file:
[27]:
with open("out/ST000017_AN000035_modified.txt", "w") as outfile:
    mwfile.write(outfile, file_format="mwtab")
  • Writing into a JSON file:
[28]:
with open("out/ST000017_AN000035_modified.json", "w") as outfile:
    mwfile.write(outfile, file_format="json")

Extracting Metadata and Metabolites from mwTab Files

The mwtab.mwextract module can be used to extract metadata from mwTab files. The module contains two main methods: 1) extract_metadata() which can be used to parse metadata values from a mwTab file, and 2) extract_metabolites() which can be used to gather a list of metabolites and samples containing the found metabolites from multiple mwTab files which contain a given metadata key value pair.

Extracting Metadata Values

  • Extracting metadata values from a given mwTab file:
[29]:
from mwtab.mwextract import extract_metadata

extract_metadata(mwfile, ["STUDY_TYPE", "SUBJECT_TYPE"])
[29]:
{'STUDY_TYPE': {'LC-MS analysis'}, 'SUBJECT_TYPE': {'Animal'}}

Extracting Metabolites Values

  • Extracting metabolite information from multiple mwTab files and outputing the first three metabolites:
[30]:
from mwtab.mwextract import extract_metabolites, generate_matchers
from mwtab import read_files

mwtab_gen = read_files(
    "ST000017_AN000035.txt",
    "ST000040_AN000060.txt"
)

matchers = generate_matchers([
    ("ST:STUDY_TYPE",
    "LC-MS analysis")
])
list(extract_metabolites(mwtab_gen, matchers).keys())[:3]
[30]:
['11BETA_21-DIHYDROXY-5BETA-PREGNANE-3_20-DIONE',
 '11-BETA-HYDROXYANDROST-4-ENE-3_17-DIONE',
 '13(S)-HPODE']
  • Extracting metabolite information from multiple mwTab files using regualar expressions and outputing the first three metabolites:
[31]:
from mwtab.mwextract import extract_metabolites, generate_matchers
from mwtab import read_files
from re import compile

mwtab_gen = read_files(
    "ST000017_AN000035.txt",
    "ST000040_AN000060.txt"
)

matchers = generate_matchers([
    ("ST:STUDY_TYPE",
    compile("(LC-MS)"))
])
list(extract_metabolites(mwtab_gen, matchers).keys())[:3]
[31]:
['11BETA_21-DIHYDROXY-5BETA-PREGNANE-3_20-DIONE',
 '11-BETA-HYDROXYANDROST-4-ENE-3_17-DIONE',
 '13(S)-HPODE']

Converting mwTab Files

mwTab files can be converted between the mwTab file format and their JSON representation using the mwtab.converter module.

One-to-one file conversions

  • Converting from the mwTab file format into its equivalent JSON file format:
[32]:
from mwtab.converter import Converter

# Using valid ANALYSIS_ID to access file from URL: from_path="1"
converter = Converter(from_path="35", to_path="out/ST000017_AN000035.json",
                      from_format="mwtab", to_format="json")
converter.convert()
  • Converting from JSON file format back to mwTab file format:
[33]:
from mwtab.converter import Converter

converter = Converter(from_path="out/ST000017_AN000035.json", to_path="out/ST000017_AN000035.txt",
                      from_format="json", to_format="mwtab")
converter.convert()

Many-to-many files conversions

  • Converting from the directory of mwTab formatted files into their equivalent JSON formatted files:
[34]:
from mwtab.converter import Converter

converter = Converter(from_path="mwfiles_dir_mwtab",
                      to_path="out/mwfiles_dir_json",
                      from_format="mwtab",
                      to_format="json")
converter.convert()
  • Converting from the directory of JSON formatted files into their equivalent mwTab formatted files:
[35]:
from mwtab.converter import Converter

converter = Converter(from_path="out/mwfiles_dir_json",
                      to_path="out/mwfiles_dir_mwtab",
                      from_format="json",
                      to_format="mwtab")
converter.convert()

Note

Many-to-many files and one-to-one file conversions are available. See mwtab.converter for full list of available conversions.

Command-Line Interface

The mwtab Command-Line Interface provides the following functionality:
  • Convert from the mwTab file format into its equivalent JSON file format and vice versa.
  • Download files through Metabolomics Workbench’s REST API.
  • Validate the mwTab formatted file.
  • Extract metadata and metabolite information from downloaded files.
[36]:
! mwtab --help
The mwtab command-line interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Usage:
    mwtab -h | --help
    mwtab --version
    mwtab convert (<from-path> <to-path>) [--from-format=<format>] [--to-format=<format>] [--validate] [--mw-rest=<url>] [--verbose]
    mwtab validate <from-path> [--mw-rest=<url>] [--verbose]
    mwtab download url <url> [--to-path=<path>] [--verbose]
    mwtab download study all [--to-path=<path>] [--input-item=<item>] [--output-format=<format>] [--mw-rest=<url>] [--validate] [--verbose]
    mwtab download study <input-value> [--to-path=<path>] [--input-item=<item>] [--output-item=<item>] [--output-format=<format>] [--mw-rest=<url>] [--validate] [--verbose]
    mwtab download (study | compound | refmet | gene | protein) <input-item> <input-value> <output-item> [--output-format=<format>] [--to-path=<path>] [--mw-rest=<url>] [--verbose]
    mwtab download moverz <input-item> <m/z-value> <ion-type-value> <m/z-tolerance-value> [--to-path=<path>] [--mw-rest=<url>] [--verbose]
    mwtab download exactmass <LIPID-abbreviation> <ion-type-value> [--to-path=<path>] [--mw-rest=<url>] [--verbose]
    mwtab extract metadata <from-path> <to-path> <key> ... [--to-format=<format>] [--no-header]
    mwtab extract metabolites <from-path> <to-path> (<key> <value>) ... [--to-format=<format>] [--no-header]

Options:
    -h, --help                      Show this screen.
    --version                       Show version.
    --verbose                       Print what files are processing.
    --validate                      Validate the mwTab file.
    --from-format=<format>          Input file format, available formats: mwtab, json [default: mwtab].
    --to-format=<format>            Output file format [default: json].
                                    Available formats for convert:
                                        mwtab, json.
                                    Available formats for extract:
                                        json, csv.
    --mw-rest=<url>                 URL to MW REST interface
                                    [default: https://www.metabolomicsworkbench.org/rest/].
    --context=<context>             Type of resource to access from MW REST interface, available contexts: study,
                                    compound, refmet, gene, protein, moverz, exactmass [default: study].
    --input-item=<item>             Item to search Metabolomics Workbench with.
    --output-item=<item>            Item to be retrieved from Metabolomics Workbench.
    --output-format=<format>        Format for item to be retrieved in, available formats: mwtab, json.
    --no-header                     Include header at the top of csv formatted files.

    For extraction <to-path> can take a "-" which will use stdout.

Converting mwTab files in bulk

CLI one-to-one file conversions

  • Convert from a local file in mwTab format to a local file in JSON format:
[37]:
! mwtab convert ST000017_AN000035.txt out/ST000017_AN000035.json \
          --from-format=mwtab --to-format=json
  • Convert from a local file in JSON format to a local file in mwTab format:
[38]:
! mwtab convert ST000017_AN000035.json out/ST000017_AN000035.txt \
          --from-format=json --to-format=mwtab
  • Convert from a compressed local file in mwTab format to a compressed local file in JSON format:
[39]:
! mwtab convert ST000017_AN000035.txt.gz out/ST000017_AN000035.json.gz \
          --from-format=mwtab --to-format=json
  • Convert from a compressed local file in JSON format to a compressed local file in mwTab format:
[40]:
! mwtab convert ST000017_AN000035.json.gz out/ST000017_AN000035.txt.gz \
          --from-format=json --to-format=mwtab
  • Convert from an uncompressed URL file in mwTab format to a compressed local file in JSON format:
[41]:
! mwtab convert 35 out/ST000017_AN000035.json.bz2 \
          --from-format=mwtab --to-format=json

Note

See mwtab.converter for full list of available conversions.

CLI Many-to-many files conversions

  • Convert from a directory of files in mwTab format to a directory of files in JSON format:
[42]:
! mwtab convert mwfiles_dir_mwtab out/mwfiles_dir_json \
          --from-format=mwtab --to-format=json
  • Convert from a directory of files in JSON format to a directory of files in mwTab format:
[43]:
! mwtab convert mwfiles_dir_json out/mwfiles_dir_mwtab \
          --from-format=json --to-format=mwtab
  • Convert from a directory of files in mwTab format to a zip archive of files in JSON format:
[44]:
! mwtab convert mwfiles_dir_mwtab out/mwfiles_json.zip \
          --from-format=mwtab --to-format=json
  • Convert from a compressed tar archive of files in JSON format to a directory of files in mwTab format:
[45]:
! mwtab convert mwfiles_json.tar.gz out/mwfiles_dir_mwtab \
          --from-format=json --to-format=mwtab
  • Convert from a zip archive of files in mwTab format to a compressed tar archive of files in JSON format:
[46]:
! mwtab convert mwfiles_mwtab.zip out/mwfiles_json.tar.bz2 \
          --from-format=mwtab --to-format=json

Note

See mwtab.converter for full list of available conversions.

Download files through Metabolomics Workbenchs REST API

The mwtab package provides the mwtab.mwrest module, which contains a number of functions and classes for working with Metabolomics Workbenchs REST API.

Note

For full official REST API specification see the following link (MW REST API (v1.0, 5/7/2019)): https://www.metabolomicsworkbench.org/tools/MWRestAPIv1.0.pdf

Download by URL

  • To download a file based on a given url, simply call the download url command with the desired URL and provide an output path:
[47]:
! mwtab download url "https://www.metabolomicsworkbench.org/rest/study/analysis_id/AN000035/mwtab/txt" --to-path=out/ST000017_AN000035.txt
  • To download single analysis mwTab files, simply call download study and specifiy the analysis ID:
[48]:
! mwtab download study AN000035 --to-path=out/ST000017_AN000035.txt
  • To download an entire study mwTab file, simply call download study and specifiy the study ID:
[49]:
! mwtab download study ST000017 --to-path=out/ST000017_AN000035.txt

Note

It is possible to validate downloaded files by adding the --validate option to the command line.

Download study, compound, refmet, gene, and protein files

  • To download study, compound, refmet, gene, and protein context files, call the download command and specify the context, input iten, input value, and output item (optionally specifiy the output format).
  • Download a study:
[50]:
! mwtab download study analysis_id AN000035 mwtab --output-format=txt --to-path=out/ST000017_AN000035.txt
  • Download compound:
[51]:
! mwtab download compound regno 11 name --to-path=out/tmp.txt
  • Download refmet:
[52]:
! mwtab download refmet name Cholesterol all --to-path=out/tmp.txt
  • Download gene:
[53]:
! mwtab download gene gene_symbol acaca all --to-path=out/tmp.txt
  • Download protein:
[54]:
! mwtab download protein uniprot_id Q13085 all --to-path=out/tmp.txt

Download all mwTab formatted files

The mwTab package provides contains a number of command line functions for downloading Metabolomics mwtab formatted files through the Workbenchs REST API.

  • To download all available analysis files, simply call the download study all command:

! mwtab download study all

  • It is also possible to download all study files by calling the download study all command and providing an input item and output path:

! mwtab download study all –input-item=study_id

Download moverz and exactmass

  • To download moverz files, call the download moverz command and specify the input value (LIPIDS, MB, or REFMET), m/z value, ion type value, and m/z tolerance value.
[55]:
! mwtab download moverz MB 635.52 M+H 0.5 --to-path=out/tmp.txt
  • To download exactmass files, call the download exactmass command and specify the LIPID abbreviation and ion type value.
[56]:
! mwtab download exactmass "PC(34:1)" M+H --to-path=out/tmp.txt

Note

It is not necessary to specify an output format for exactmass files.

Extracting metabolite data and metadata from mwTab files

The mwtab package provides the extract_metabolites() and extract_metadata() functions that can parse mwTab formatted files. The extract_metabolites() takes a source (list of mwTab file) and list of metadata key-value pairs that are used to search for mwTab files which contain the given metadata pairs. The extract_metadata() takes a source (list of mwTab file) and list of metadata keys which are used to search the mwTab files for possible values to the given keys.

  • To extract metabolite from mwTab files in a directory, call the extract metabolites command and provide a list of metadata key value pairs along with an output path and output format:
[57]:
! mwtab extract metabolites mwfiles_dir_mwtab out/output_file.csv SU:SUBJECT_TYPE Plant --to-format=csv

Note

It is possible to use ReGeXs to match the metadata value (eg. … SU:SUBJECT_TYPE “r’(Plant)’”).

  • To extract metadata from mwTab files in a directory call the extract metadata command and provide a list of metadata keys along with an output path and output format:
[58]:
! mwtab extract metadata mwfiles_dir_json out/output_file.json SUBJECT_TYPE --to-format=json

Validating mwTab files

The mwtab package provides the validate_file() function that can validate files based on a JSON schema definition. The mwtab.mwschema contains schema definitions for every block of mwTab formatted file, i.e. it lists the types of attributes (e.g. str as well as specifies which keys are optional and which are required).

  • To validate file(s), simply call the validate command and provide path to file(s):
[59]:
! mwtab validate 35

Using the mwtab Python Package to Find Analyses Involving a Specific Disease or Condition

The Metabolomics Workbench data repository stores mass spectroscopy and nuclear magnetic resonanse experimental data and metadata in mwTab formatted files. Metabolomics Workbench also provides a number of tools for searching or analyzing mwTab files. The mwtab Python package can also be used to perform similar functions through both a programmatic API and command-line interface, which has more search flexibility.

In order to search the repository of mwTab files for analyses associated with a specific disease, Metabolomics Workbench provides a web-based interface:

The mwtab Python package can be used in a number of ways to similar effect. The package provides the extract_metabolites() method to extract and organize metabolites from multiple mwTab files through both Python scripts and a command-line interface. This method has more search flexibility, since it can take either a search string or a regular expression.

Using mwtab package API to extract study IDs, analysis IDs, and metabolites

The extract_metabolites() method takes two parameters: 1) a iterable of MWTabFile instances and 2) an iterable of ItemMatcher or ReGeXMatcher instances. The iterable of MWTabFile instances can be created using byt passing mwTab file sources (filenames, analysis IDs, etc.) to the read_files() method. The iterable of matcher instances can be created using the generate_matchers() method.

  • An example of using the mwtab package API to extract data from analyses associated with diabetes and output the first three metabolites:
[60]:
from mwtab.mwextract import extract_metabolites, generate_matchers
from mwtab import read_files
import re

mwtab_gen = read_files("diabetes/")

matchers = generate_matchers([
    ("ST:STUDY_SUMMARY",
    re.compile("(diabetes)"))
])
list(extract_metabolites(mwtab_gen, matchers).keys())[:3]
[60]:
['1_5-anhydroglucitol', '1-monopalmitin', '1-monostearin']

Using mwtab CLI to extract study IDs, analysis IDs, and metabolites

The mwtab command line interface includes a mwtab extract metabolites method which takes a directory of mwTab files, an output path to save the extracted data in, and a series of mwTab section item keys and values to be matched (either string values or regular expressions). Additionally an output format can be specified.

mwtab extract metabolites <from-path> <to-path> (<key> <value>) … [–to-format=<format>] [–no-header]
  • An example of using the mwtab CLI to extract data from analyses associated with diabetes:
[61]:
! mwtab extract metabolites diabetes/ out/output_file.json ST:STUDY_SUMMARY "r'(?i)(diabetes)'" --to-format=json