{ "cells": [ { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The mwtab Tutorial\n", "==================\n", "\n", "The :mod:`mwtab` package provides classes and other facilities for downloading,\n", "parsing, accessing, and manipulating data stored in either the ``mwTab`` or\n", "``JSON`` representation of ``mwTab`` files.\n", "\n", "Also, the :mod:`mwtab` package provides simple command-line interface to convert\n", "between ``mwTab`` and ``JSON`` representations, download entries from\n", "Metabolomics Workbench, access the MW REST interface, validate the consistency\n", "of the ``mwTab`` files, or extract metadata and metabolites from these files." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Brief mwTab Format Overview\n", "~~~~~~~~~~~~~~~~~~~~~~~~~~~\n", "\n", "\n", ".. note::\n", "\n", " For full official specification see the following link (``mwTab file specification``):\n", " http://www.metabolomicsworkbench.org/data/tutorials.php\n", "\n", "\n", "The ``mwTab`` formatted files consist of multiple blocks. Each new block starts with ``#``.\n", "\n", "* Some of the blocks contain only \"key-value\"-like pairs.\n", "\n", ".. code-block:: none\n", "\n", " #METABOLOMICS WORKBENCH STUDY_ID:ST000001 ANALYSIS_ID:AN000001\n", " VERSION \t1\n", " CREATED_ON \t2016-09-17\n", " #PROJECT\n", " PR:PROJECT_TITLE \tFatB Gene Project\n", " PR:PROJECT_TYPE \tGenotype treatment\n", " PR:PROJECT_SUMMARY \tExperiment to test the consequence of a mutation at the FatB gene (At1g08510)\n", " PR:PROJECT_SUMMARY \tthe wound-response of Arabidopsis\n", "\n", ".. note::\n", "\n", " ``*_SUMMARY`` \"key-value\"-like pairs are typically span through multiple lines.\n", "\n", "\n", "* ``#SUBJECT_SAMPLE_FACTORS`` block is specially formatted, i.e. it contains header\n", " specification and tab-separated values.\n", "\n", ".. code-block:: none\n", "\n", " #SUBJECT_SAMPLE_FACTORS: \tSUBJECT(optional)[tab]SAMPLE[tab]FACTORS(NAME:VALUE pairs separated by |)[tab]Additional sample data\n", " SUBJECT_SAMPLE_FACTORS \t-\tLabF_115873\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\n", " SUBJECT_SAMPLE_FACTORS \t-\tLabF_115878\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\n", " SUBJECT_SAMPLE_FACTORS \t-\tLabF_115883\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\n", " SUBJECT_SAMPLE_FACTORS \t-\tLabF_115888\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\n", " SUBJECT_SAMPLE_FACTORS \t-\tLabF_115893\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\n", " SUBJECT_SAMPLE_FACTORS \t-\tLabF_115898\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\n", "\n", "\n", "* ``#MS_METABOLITE_DATA`` (results) block contains ``Samples`` identifiers, ``Factors`` identifiers\n", " as well as tab-separated data between ``*_START`` and ``*_END``.\n", "\n", ".. code-block:: none\n", "\n", " #MS_METABOLITE_DATA\n", " MS_METABOLITE_DATA:UNITS\tPeak height\n", " MS_METABOLITE_DATA_START\n", " Samples\tLabF_115904\tLabF_115909\tLabF_115914\tLabF_115919\tLabF_115924\tLabF_115929\tLabF_115842\tLabF_115847\tLabF_115852\tLabF_115857\tLabF_115862\tLabF_115867\tLabF_115873\tLabF_115878\tLabF_115883\tLabF_115888\tLabF_115893\tLabF_115898\tLabF_115811\tLabF_115816\tLabF_115821\tLabF_115826\tLabF_115831\tLabF_115836\n", " Factors\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded\tArabidopsis Genotype:fatb-ko KD; At1g08510 | Plant Wounding Treatment:Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Control - Non-Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded\tArabidopsis Genotype:Wassilewskija (Ws) | Plant Wounding Treatment:Wounded\n", " 1_2_4-benzenetriol\t1874.0000\t3566.0000\t1945.0000\t1456.0000\t2004.0000\t1995.0000\t4040.0000\t2432.0000\t2189.0000\t1931.0000\t1307.0000\t2880.0000\t2218.0000\t1754.0000\t1369.0000\t1201.0000\t3324.0000\t1355.0000\t2257.0000\t1718.0000\t1740.0000\t3472.0000\t2054.0000\t1367.0000\n", " 1-monostearin\t987.0000\t450.0000\t1910.0000\t549.0000\t1032.0000\t902.0000\t393.0000\t705.0000\t100.0000\t481.0000\t265.0000\t120.0000\t1185.0000\t867.0000\t676.0000\t569.0000\t579.0000\t387.0000\t1035.0000\t789.0000\t875.0000\t224.0000\t641.0000\t693.0000\n", " ...\n", " MS_METABOLITE_DATA_END\n", "\n", "* ``#METABOLITES`` metadata block contains a header specifying fields and\n", " tab-separated data between ``*_START`` and ``*_END``.\n", "\n", ".. code-block:: none\n", "\n", " #METABOLITES\n", " METABOLITES_START\n", " metabolite_name\tmoverz_quant\tri\tri_type\tpubchem_id\tinchi_key\tkegg_id\tother_id\tother_id_type\n", " 1,2,4-benzenetriol\t239\t522741\tFiehn\t10787\t\tC02814\t205673\tBinBase\n", " 1-monostearin\t399\t959625\tFiehn\t107036\t\tD01947\t202835\tBinBase\n", " 2-hydroxyvaleric acid\t131\t310750\tFiehn\t98009\t\t\t218773\tBinBase\n", " 3-phosphoglycerate\t299\t611619\tFiehn\t724\t\tC00597\t217821\tBinBase\n", " ...\n", " METABOLITES_END\n", "\n", "* ``#NMR_BINNED_DATA`` metadata block contains a header specifying fields and\n", " tab-separated data between ``*_START`` and ``*_END``.\n", "\n", ".. code-block:: none\n", "\n", " #NMR_BINNED_DATA\n", " NMR_BINNED_DATA_START\n", " Bin range(ppm)\tCDC029\tCDC030\tCDC032\tCPL101\tCPL102\tCPL103\tCPL201\tCPL202\tCPL203\tCDS039\tCDS052\tCDS054\n", " 0.50...0.56\t0.00058149\t1.6592\t0.039301\t0\t0\t0\t0.034018\t0.0028746\t0.0021478\t0.013387\t0\t0\n", " 0.56...0.58\t0\t0.74267\t0\t0.007206\t0\t0\t0\t0\t0\t0\t0\t0.0069721\n", " 0.58...0.60\t0.051165\t0.8258\t0.089149\t0.060972\t0.026307\t0.045697\t0.069541\t0\t0\t0.14516\t0.057489\t0.042255\n", " ...\n", " NMR_BINNED_DATA_END\n", "\n", "* Order of metadata and data blocks (MS)\n", "\n", ".. code-block:: none\n", "\n", " #METABOLOMICS WORKBENCH\n", " VERSION \t1\n", " CREATED_ON \t2016-09-17\n", " ...\n", " #PROJECT\n", " ...\n", " #STUDY\n", " ...\n", " #SUBJECT\n", " ...\n", " #SUBJECT_SAMPLE_FACTORS: \tSUBJECT(optional)[tab]SAMPLE[tab]FACTORS(NAME:VALUE pairs separated by |)[tab]Additional sample data\n", " ...\n", " #COLLECTION\n", " ...\n", " #TREATMENT\n", " ...\n", " #SAMPLEPREP\n", " ...\n", " #CHROMATOGRAPHY\n", " ...\n", " #ANALYSIS\n", " ...\n", " #MS\n", " ...\n", " #MS_METABOLITE_DATA\n", " MS_METABOLITE_DATA:UNITS\tpeak area\n", " MS_METABOLITE_DATA_START\n", " ...\n", " MS_METABOLITE_DATA_END\n", " #METABOLITES\n", " METABOLITES_START\n", " ...\n", " METABOLITES_END\n", " #END" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Using mwtab as a Library\n", "~~~~~~~~~~~~~~~~~~~~~~~~\n", "\n", "\n", "Importing mwtab Package\n", "-----------------------\n", "\n", "If the :mod:`mwtab` package is installed on the system, it can be imported:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import mwtab" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Constructing MWTabFile Generator\n", "--------------------------------\n", "\n", "The :mod:`~mwtab.fileio` module provides the :func:`~mwtab.fileio.read_files`\n", "generator function that yields :class:`~mwtab.mwtab.MWTabFile` instances. Constructing a\n", ":class:`~mwtab.mwtab.MWTabFile` generator is easy - specify the path to a local ``mwTab`` file,\n", "directory of files, archive of files:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import mwtab\n", "\n", "mwfile_gen = mwtab.read_files(\"ST000017_AN000035.txt\") # single mwTab file\n", "mwfiles_gen = mwtab.read_files(\"ST000017_AN000035.txt\", \"ST000040_AN000060.json\") # several mwTab files\n", "mwdir_gen = mwtab.read_files(\"mwfiles_dir_mwtab\") # directory of mwTab files\n", "mwzip_gen = mwtab.read_files(\"mwfiles_mwtab.zip\") # archive of mwTab files\n", "mwanalysis_gen = mwtab.read_files(\"35\", \"60\") # ANALYSIS_ID of mwTab files\n", "# REST callable url of mwTab file\n", "mwurl_gen = mwtab.read_files(\"https://www.metabolomicsworkbench.org/rest/study/analysis_id/AN000035/mwtab/txt\")" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Processing MWTabFile Generator\n", "------------------------------\n", "\n", "The :class:`~mwtab.mwtab.MWTabFile` generator can be processed in several ways:\n", "\n", "* Feed it to a for-loop and process one file at a time:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "STUDY_ID: ST000017\n", "ANALYSIS_ID AN000035\n", "SOURCE https://www.metabolomicsworkbench.org/rest/study/analysis_id/AN000035/mwtab/txt\n", "\t METABOLOMICS WORKBENCH\n", "\t PROJECT\n", "\t STUDY\n", "\t SUBJECT\n", "\t SUBJECT_SAMPLE_FACTORS\n", "\t COLLECTION\n", "\t TREATMENT\n", "\t SAMPLEPREP\n", "\t CHROMATOGRAPHY\n", "\t ANALYSIS\n", "\t MS\n", "\t MS_METABOLITE_DATA\n", "STUDY_ID: ST000040\n", "ANALYSIS_ID AN000060\n", "SOURCE https://www.metabolomicsworkbench.org/rest/study/analysis_id/AN000060/mwtab/txt\n", "\t METABOLOMICS WORKBENCH\n", "\t PROJECT\n", "\t STUDY\n", "\t SUBJECT\n", "\t SUBJECT_SAMPLE_FACTORS\n", "\t COLLECTION\n", "\t TREATMENT\n", "\t SAMPLEPREP\n", "\t CHROMATOGRAPHY\n", "\t ANALYSIS\n", "\t MS\n", "\t MS_METABOLITE_DATA\n" ] } ], "source": [ "for mwfile in mwtab.read_files(\"35\", \"60\"):\n", " print(\"STUDY_ID:\", mwfile.study_id) # print STUDY_ID\n", " print(\"ANALYSIS_ID\", mwfile.analysis_id) # print ANALYSIS_ID\n", " print(\"SOURCE\", mwfile.source) # print source\n", " for block_name in mwfile: # print names of blocks\n", " print(\"\\t\", block_name)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note:: Once the generator is consumed, it becomes empty and needs to be created again." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Since the :class:`~mwtab.mwtab.MWTabFile` generator behaves like an iterator,\n", " we can call the :py:func:`next` built-in function:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "mwfiles_generator = mwtab.read_files(\"35\", \"60\")\n", "\n", "mwfile1 = next(mwfiles_generator)\n", "mwfile2 = next(mwfiles_generator)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note:: Once the generator is consumed, :py:class:`StopIteration` will be raised." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Convert the :class:`~mwtab.mwtab.MWTabFile` generator into a :py:class:`list` of\n", " :class:`~mwtab.mwtab.MWTabFile` objects:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "mwfiles_generator = mwtab.read_files(\"35\", \"60\")\n", "mwfiles_list = list(mwfiles_generator)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. warning:: Do not convert the :class:`~mwtab.mwtab.MWTabFile` generator into a\n", " :py:class:`list` if the generator can yield a large number of files, e.g.\n", " several thousand, otherwise it can consume all available memory." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Accessing Data From a Single MWTabFile\n", "--------------------------------------\n", "\n", "Since a :class:`~mwtab.mwtab.MWTabFile` is a Python :py:class:`collections.OrderedDict`,\n", "data can be accessed and manipulated as with any regular Python :py:class:`dict` object\n", "using bracket accessors.\n", "\n", "* Accessing top-level \"keys\" in :class:`~mwtab.mwtab.MWTabFile`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "import os\n", "os.chdir('_static/mwfiles')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['METABOLOMICS WORKBENCH',\n", " 'PROJECT',\n", " 'STUDY',\n", " 'SUBJECT',\n", " 'SUBJECT_SAMPLE_FACTORS',\n", " 'COLLECTION',\n", " 'TREATMENT',\n", " 'SAMPLEPREP',\n", " 'CHROMATOGRAPHY',\n", " 'ANALYSIS',\n", " 'MS',\n", " 'MS_METABOLITE_DATA']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mwfile = next(mwtab.read_files(\"ST000017_AN000035.txt\"))\n", "\n", "# list MWTabFile-level keys, i.e. saveframe names\n", "list(mwfile.keys())" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Accessing individual blocks in :class:`~mwtab.mwtab.MWTabFile`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OrderedDict([('PROJECT_TITLE', 'Rat Stamina Studies'),\n", " ('PROJECT_TYPE', 'Feeding'),\n", " ('PROJECT_SUMMARY', 'Stamina in rats'),\n", " ('INSTITUTE', 'University of Michigan'),\n", " ('DEPARTMENT', 'Internal Medicine'),\n", " ('LABORATORY', 'Burant Lab'),\n", " ('LAST_NAME', 'Beecher'),\n", " ('FIRST_NAME', 'Chris'),\n", " ('ADDRESS', '-'),\n", " ('EMAIL', 'chrisbee@med.umich.edu'),\n", " ('PHONE', '734-232-0815'),\n", " ('FUNDING_SOURCE', 'NIH: R01 DK077200')])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access \"PROJECT\" block\n", "mwfile[\"PROJECT\"]" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Accessing individual \"key-value\" pairs within blocks:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'University of Michigan'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access \"INSTITUTE\" field within \"PROJECT\" block\n", "mwfile[\"PROJECT\"][\"INSTITUTE\"]" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Accessing data in ``#SUBJECT_SAMPLE_FACTORS`` block:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[OrderedDict([('Subject ID', '-'),\n", " ('Sample ID', 'S00009477'),\n", " ('Factors',\n", " {'Feeeding': 'Ad lib', 'Running Capacity': 'High'})]),\n", " OrderedDict([('Subject ID', '-'),\n", " ('Sample ID', 'S00009478'),\n", " ('Factors',\n", " {'Feeeding': 'Ad lib', 'Running Capacity': 'High'})]),\n", " OrderedDict([('Subject ID', '-'),\n", " ('Sample ID', 'S00009479'),\n", " ('Factors',\n", " {'Feeeding': 'Ad lib', 'Running Capacity': 'High'})])]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access \"SUBJECT_SAMPLE_FACTORS\" block and print first three\n", "mwfile[\"SUBJECT_SAMPLE_FACTORS\"][:3]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OrderedDict([('Subject ID', '-'),\n", " ('Sample ID', 'S00009477'),\n", " ('Factors', {'Feeeding': 'Ad lib', 'Running Capacity': 'High'})])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access individual factors (by index)\n", "mwfile[\"SUBJECT_SAMPLE_FACTORS\"][0]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'S00009477'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access individual fields within factors\n", "mwfile[\"SUBJECT_SAMPLE_FACTORS\"][0][\"Sample ID\"]" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Accessing data in ``#MS_METABOLITE_DATA`` block:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Units', 'Data', 'Metabolites']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access data block keys\n", "list(mwfile[\"MS_METABOLITE_DATA\"].keys())" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'peak area'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access units field\n", "mwfile[\"MS_METABOLITE_DATA\"][\"Units\"]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "odict_keys(['Metabolite', 'S00009477', 'S00009478', 'S00009479', 'S00009480', 'S00009481', 'S00009500', 'S00009501', 'S00009502', 'S00009503', 'S00009470', 'S00009471', 'S00009472', 'S00009473', 'S00009474', 'S00009475', 'S00009494', 'S00009495', 'S00009496', 'S00009497', 'S00009498', 'S00009499', 'S00009488', 'S00009489', 'S00009490', 'S00009491', 'S00009492', 'S00009493', 'S00009509', 'S00009510', 'S00009511', 'S00009512', 'S00009513', 'S00009514', 'S00009482', 'S00009483', 'S00009484', 'S00009486', 'S00009504', 'S00009505', 'S00009506', 'S00009507', 'S00009508'])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access samples field (by index)\n", "mwfile[\"MS_METABOLITE_DATA\"][\"Data\"][0].keys()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[OrderedDict([('Metabolite', '11BETA,21-DIHYDROXY-5BETA-PREGNANE-3,20-DIONE'),\n", " ('moverz_quant', ''),\n", " ('ri', ''),\n", " ('ri_type', ''),\n", " ('pubchem_id', '44263339'),\n", " ('inchi_key', ''),\n", " ('kegg_id', 'C05475'),\n", " ('other_id', '775216_UNIQUE'),\n", " ('other_id_type', 'UM_Target_ID')]),\n", " OrderedDict([('Metabolite', '11-BETA-HYDROXYANDROST-4-ENE-3,17-DIONE'),\n", " ('moverz_quant', ''),\n", " ('ri', ''),\n", " ('ri_type', ''),\n", " ('pubchem_id', '94141'),\n", " ('inchi_key', ''),\n", " ('kegg_id', 'C05284'),\n", " ('other_id', '771312_PRIMARY'),\n", " ('other_id_type', 'UM_Target_ID')]),\n", " OrderedDict([('Metabolite', '13(S)-HPODE'),\n", " ('moverz_quant', ''),\n", " ('ri', ''),\n", " ('ri_type', ''),\n", " ('pubchem_id', '1426'),\n", " ('inchi_key', ''),\n", " ('kegg_id', 'C04717'),\n", " ('other_id', '775541_UNIQUE'),\n", " ('other_id_type', 'UM_Target_ID')])]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access metabolite data and print first three\n", "mwfile[\"MS_METABOLITE_DATA\"][\"Metabolites\"][:3]" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Manipulating Data From a Single MWTabFile\n", "-----------------------------------------\n", "\n", "In order to change values within :class:`~mwtab.mwtab.MWTabFile`, descend into\n", "the appropriate level using square bracket accessors and set a new value.\n", "\n", "* Change regular \"key-value\" pairs:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'734-232-0815'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access phone number information\n", "mwfile[\"PROJECT\"][\"PHONE\"]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# change phone number information\n", "mwfile[\"PROJECT\"][\"PHONE\"] = \"1-530-754-8258\"" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1-530-754-8258'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check that it has been modified\n", "mwfile[\"PROJECT\"][\"PHONE\"]" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Change ``#SUBJECT_SAMPLE_FACTORS`` values:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OrderedDict([('Subject ID', '-'),\n", " ('Sample ID', 'S00009477'),\n", " ('Factors', {'Feeeding': 'Ad lib', 'Running Capacity': 'High'})])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access the first subject sample factor by index\n", "mwfile[\"SUBJECT_SAMPLE_FACTORS\"][0]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# provide additional details to the first subject sample factor\n", "mwfile[\"SUBJECT_SAMPLE_FACTORS\"][0][\"Additional sample data\"] = {\"Additional detail key\": \"Additional detail value\"}" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OrderedDict([('Subject ID', '-'),\n", " ('Sample ID', 'S00009477'),\n", " ('Factors', {'Feeeding': 'Ad lib', 'Running Capacity': 'High'}),\n", " ('Additional sample data',\n", " {'Additional detail key': 'Additional detail value'})])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check that it has been modified\n", "mwfile[\"SUBJECT_SAMPLE_FACTORS\"][0]" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Printing a MWTabFile and its Components\n", "---------------------------------------\n", "\n", "``MWTabFile`` objects provide the ``print_file()`` method which can be used to output the file in either `mwTab` or JSON format. The method takes a ``file_format`` keyword argument which specifices the output format to be displayed.\n", "\n", "The MWTabFile can be printed to output in `mwTab` format in its entirety using:\n", "\n", "* mwfile.print_file(file_format=\"mwtab\")\n", "\n", "\n", "* Print the first 20 lines in ``mwTab`` format." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#METABOLOMICS WORKBENCH STUDY_ID:ST000017 ANALYSIS_ID:AN000035 PROJECT_ID:PR000016\n", "VERSION \t1\n", "CREATED_ON \t2016-09-17\n", "#PROJECT\n", "PR:PROJECT_TITLE \tRat Stamina Studies\n", "PR:PROJECT_TYPE \tFeeding\n", "PR:PROJECT_SUMMARY \tStamina in rats\n", "PR:INSTITUTE \tUniversity of Michigan\n", "PR:DEPARTMENT \tInternal Medicine\n", "PR:LABORATORY \tBurant Lab\n", "PR:LAST_NAME \tBeecher\n", "PR:FIRST_NAME \tChris\n", "PR:ADDRESS \t-\n", "PR:EMAIL \tchrisbee@med.umich.edu\n", "PR:PHONE \t1-530-754-8258\n", "PR:FUNDING_SOURCE \tNIH: R01 DK077200\n", "#STUDY\n", "ST:STUDY_TITLE \tRat HCR/LCR Stamina Study\n", "ST:STUDY_TYPE \tLC-MS analysis\n", "ST:STUDY_SUMMARY \tTo determine the basis of running capacity and health differences in outbread\n" ] } ], "source": [ "from io import StringIO\n", "mwtab_file_str = StringIO()\n", "mwfile.print_file(file_format=\"mwtab\", f=mwtab_file_str)\n", "\n", "# print out first 20 lines\n", "print(\"\\n\".join(mwtab_file_str.getvalue().split(\"\\n\")[:20]))" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The MWTabFile can be printed to output in JSON format in its entirety using:\n", "\n", "* mwfile.print_file(file_format=\"json\")\n", "\n", "\n", "* Print the first 20 lines in ``JSON`` format." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"METABOLOMICS WORKBENCH\": {\n", " \"STUDY_ID\": \"ST000017\",\n", " \"ANALYSIS_ID\": \"AN000035\",\n", " \"PROJECT_ID\": \"PR000016\",\n", " \"VERSION\": \"1\",\n", " \"CREATED_ON\": \"2016-09-17\"\n", " },\n", " \"PROJECT\": {\n", " \"PROJECT_TITLE\": \"Rat Stamina Studies\",\n", " \"PROJECT_TYPE\": \"Feeding\",\n", " \"PROJECT_SUMMARY\": \"Stamina in rats\",\n", " \"INSTITUTE\": \"University of Michigan\",\n", " \"DEPARTMENT\": \"Internal Medicine\",\n", " \"LABORATORY\": \"Burant Lab\",\n", " \"LAST_NAME\": \"Beecher\",\n", " \"FIRST_NAME\": \"Chris\",\n", " \"ADDRESS\": \"-\",\n", " \"EMAIL\": \"chrisbee@med.umich.edu\",\n", " \"PHONE\": \"1-530-754-8258\",\n" ] } ], "source": [ "from io import StringIO\n", "mwtab_file_str = StringIO()\n", "mwfile.print_file(file_format=\"json\", f=mwtab_file_str)\n", "\n", "# print out first 20 lines\n", "print(\"\\n\".join(mwtab_file_str.getvalue().split(\"\\n\")[:20]))" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Print single block in ``mwTab`` format." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ST:STUDY_TITLE \tRat HCR/LCR Stamina Study\n", "ST:STUDY_TYPE \tLC-MS analysis\n", "ST:STUDY_SUMMARY \tTo determine the basis of running capacity and health differences in outbread\n", "ST:STUDY_SUMMARY \tN/NIH rats selected for high capacity (HCR) and low capacity (LCR) running (a for\n", "ST:STUDY_SUMMARY \tVO2max) (see:Science. 2005 Jan 21;307(5708):418-20). Plasma collected at 12 of\n", "ST:STUDY_SUMMARY \tage in generation 28 rats after ad lib feeding or 40% caloric restriction at week\n", "ST:STUDY_SUMMARY \t8 of age. All animals fasted 4 hours prior to collection between 5-8\n", "ST:INSTITUTE \tUniversity of Michigan\n", "ST:DEPARTMENT \tInternal Medicine\n", "ST:LABORATORY \tBurant Lab (MMOC)\n", "ST:LAST_NAME \tQi\n", "ST:FIRST_NAME \tNathan\n", "ST:ADDRESS \t-\n", "ST:EMAIL \tnathanqi@med.umich.edu\n", "ST:PHONE \t734-232-0815\n", "ST:NUM_GROUPS \t2\n", "ST:TOTAL_SUBJECTS \t42\n" ] } ], "source": [ "mwfile.print_block(\"STUDY\", file_format=\"mwtab\")" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Print single block in ``JSON`` format." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"STUDY_TITLE\": \"Rat HCR/LCR Stamina Study\",\n", " \"STUDY_TYPE\": \"LC-MS analysis\",\n", " \"STUDY_SUMMARY\": \"To determine the basis of running capacity and health differences in outbread N/NIH rats selected for high capacity (HCR) and low capacity (LCR) running (a for VO2max) (see:Science. 2005 Jan 21;307(5708):418-20). Plasma collected at 12 of age in generation 28 rats after ad lib feeding or 40% caloric restriction at week 8 of age. All animals fasted 4 hours prior to collection between 5-8\",\n", " \"INSTITUTE\": \"University of Michigan\",\n", " \"DEPARTMENT\": \"Internal Medicine\",\n", " \"LABORATORY\": \"Burant Lab (MMOC)\",\n", " \"LAST_NAME\": \"Qi\",\n", " \"FIRST_NAME\": \"Nathan\",\n", " \"ADDRESS\": \"-\",\n", " \"EMAIL\": \"nathanqi@med.umich.edu\",\n", " \"PHONE\": \"734-232-0815\",\n", " \"NUM_GROUPS\": \"2\",\n", " \"TOTAL_SUBJECTS\": \"42\"\n", "}\n" ] } ], "source": [ "mwfile.print_block(\"STUDY\", file_format=\"json\")" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Writing data from a MWTabFile object into a file\n", "------------------------------------------------\n", "Data from a :class:`~mwtab.mwtab.MWTabFile` can be written into file\n", "in original ``mwTab`` format or in equivalent JSON format using\n", ":meth:`~mwtab.mwtab.MWTabFile.write()`:\n", "\n", "* Writing into a ``mwTab`` formatted file:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "with open(\"out/ST000017_AN000035_modified.txt\", \"w\") as outfile:\n", " mwfile.write(outfile, file_format=\"mwtab\")" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Writing into a ``JSON`` file:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "with open(\"out/ST000017_AN000035_modified.json\", \"w\") as outfile:\n", " mwfile.write(outfile, file_format=\"json\")" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Extracting Metadata and Metabolites from mwTab Files\n", "----------------------------------------------------\n", "\n", "The :mod:`mwtab.mwextract` module can be used to extract metadata from ``mwTab``\n", "files. The module contains two main methods: 1)\n", ":meth:`~mwtab.mwtab.mwextract.extract_metadata()` which can be used to parse metadata\n", "values from a ``mwTab`` file, and 2)\n", ":meth:`~mwtab.mwtab.mwextract.extract_metabolites()` which can be used to gather a\n", "list of metabolites and samples containing the found metabolites from multiple \n", "``mwTab`` files which contain a given metadata key value pair.\n", "\n", "Extracting Metadata Values\n", "**************************\n", "\n", "* Extracting metadata values from a given ``mwTab`` file:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'STUDY_TYPE': {'LC-MS analysis'}, 'SUBJECT_TYPE': {'Animal'}}" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mwtab.mwextract import extract_metadata\n", "\n", "extract_metadata(mwfile, [\"STUDY_TYPE\", \"SUBJECT_TYPE\"])" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Extracting Metabolites Values\n", "*****************************\n", "\n", "* Extracting metabolite information from multiple ``mwTab`` files and outputing the first three metabolites:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['11BETA_21-DIHYDROXY-5BETA-PREGNANE-3_20-DIONE',\n", " '11-BETA-HYDROXYANDROST-4-ENE-3_17-DIONE',\n", " '13(S)-HPODE']" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mwtab.mwextract import extract_metabolites, generate_matchers\n", "from mwtab import read_files\n", "\n", "mwtab_gen = read_files(\n", " \"ST000017_AN000035.txt\",\n", " \"ST000040_AN000060.txt\"\n", ")\n", "\n", "matchers = generate_matchers([\n", " (\"ST:STUDY_TYPE\",\n", " \"LC-MS analysis\")\n", "])\n", "list(extract_metabolites(mwtab_gen, matchers).keys())[:3]" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Extracting metabolite information from multiple ``mwTab`` files using regualar expressions and outputing the first three metabolites:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['11BETA_21-DIHYDROXY-5BETA-PREGNANE-3_20-DIONE',\n", " '11-BETA-HYDROXYANDROST-4-ENE-3_17-DIONE',\n", " '13(S)-HPODE']" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mwtab.mwextract import extract_metabolites, generate_matchers\n", "from mwtab import read_files\n", "from re import compile\n", "\n", "mwtab_gen = read_files(\n", " \"ST000017_AN000035.txt\",\n", " \"ST000040_AN000060.txt\"\n", ")\n", "\n", "matchers = generate_matchers([\n", " (\"ST:STUDY_TYPE\",\n", " compile(\"(LC-MS)\"))\n", "])\n", "list(extract_metabolites(mwtab_gen, matchers).keys())[:3]" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Converting mwTab Files\n", "----------------------\n", "\n", "``mwTab`` files can be converted between the ``mwTab`` file format and their ``JSON``\n", "representation using the :mod:`mwtab.converter` module.\n", "\n", "One-to-one file conversions\n", "***************************\n", "\n", "* Converting from the ``mwTab`` file format into its equivalent ``JSON`` file format:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "from mwtab.converter import Converter\n", "\n", "# Using valid ANALYSIS_ID to access file from URL: from_path=\"1\"\n", "converter = Converter(from_path=\"35\", to_path=\"out/ST000017_AN000035.json\",\n", " from_format=\"mwtab\", to_format=\"json\")\n", "converter.convert()" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Converting from JSON file format back to ``mwTab`` file format:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "from mwtab.converter import Converter\n", "\n", "converter = Converter(from_path=\"out/ST000017_AN000035.json\", to_path=\"out/ST000017_AN000035.txt\",\n", " from_format=\"json\", to_format=\"mwtab\")\n", "converter.convert()" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Many-to-many files conversions\n", "******************************\n", "\n", "* Converting from the directory of ``mwTab`` formatted files into their equivalent\n", " ``JSON`` formatted files:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "from mwtab.converter import Converter\n", "\n", "converter = Converter(from_path=\"mwfiles_dir_mwtab\",\n", " to_path=\"out/mwfiles_dir_json\",\n", " from_format=\"mwtab\",\n", " to_format=\"json\")\n", "converter.convert()" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Converting from the directory of ``JSON`` formatted files into their equivalent\n", " ``mwTab`` formatted files:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "from mwtab.converter import Converter\n", "\n", "converter = Converter(from_path=\"out/mwfiles_dir_json\",\n", " to_path=\"out/mwfiles_dir_mwtab\",\n", " from_format=\"json\",\n", " to_format=\"mwtab\")\n", "converter.convert()" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note:: Many-to-many files and one-to-one file conversions are available.\n", " See :mod:`mwtab.converter` for full list of available conversions." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Command-Line Interface\n", "~~~~~~~~~~~~~~~~~~~~~~\n", "\n", "The mwtab Command-Line Interface provides the following functionality:\n", " * Convert from the ``mwTab`` file format into its equivalent ``JSON`` file format and vice versa.\n", " * Download files through Metabolomics Workbench's REST API.\n", " * Validate the ``mwTab`` formatted file.\n", " * Extract metadata and metabolite information from downloaded files." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The mwtab command-line interface\r\n", "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n", "\r\n", "Usage:\r\n", " mwtab -h | --help\r\n", " mwtab --version\r\n", " mwtab convert ( ) [--from-format=] [--to-format=] [--validate] [--mw-rest=] [--verbose]\r\n", " mwtab validate [--mw-rest=] [--verbose]\r\n", " mwtab download url [--to-path=] [--verbose]\r\n", " mwtab download study all [--to-path=] [--input-item=] [--output-format=] [--mw-rest=] [--validate] [--verbose]\r\n", " mwtab download study [--to-path=] [--input-item=] [--output-item=] [--output-format=] [--mw-rest=] [--validate] [--verbose]\r\n", " mwtab download (study | compound | refmet | gene | protein) [--output-format=] [--to-path=] [--mw-rest=] [--verbose]\r\n", " mwtab download moverz [--to-path=] [--mw-rest=] [--verbose]\r\n", " mwtab download exactmass [--to-path=] [--mw-rest=] [--verbose]\r\n", " mwtab extract metadata ... [--to-format=] [--no-header]\r\n", " mwtab extract metabolites ( ) ... [--to-format=] [--no-header]\r\n", "\r\n", "Options:\r\n", " -h, --help Show this screen.\r\n", " --version Show version.\r\n", " --verbose Print what files are processing.\r\n", " --validate Validate the mwTab file.\r\n", " --from-format= Input file format, available formats: mwtab, json [default: mwtab].\r\n", " --to-format= Output file format [default: json].\r\n", " Available formats for convert:\r\n", " mwtab, json.\r\n", " Available formats for extract:\r\n", " json, csv.\r\n", " --mw-rest= URL to MW REST interface\r\n", " [default: https://www.metabolomicsworkbench.org/rest/].\r\n", " --context= Type of resource to access from MW REST interface, available contexts: study,\r\n", " compound, refmet, gene, protein, moverz, exactmass [default: study].\r\n", " --input-item= Item to search Metabolomics Workbench with.\r\n", " --output-item= Item to be retrieved from Metabolomics Workbench.\r\n", " --output-format= Format for item to be retrieved in, available formats: mwtab, json.\r\n", " --no-header Include header at the top of csv formatted files.\r\n", "\r\n", " For extraction can take a \"-\" which will use stdout.\r\n" ] } ], "source": [ "! mwtab --help" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Converting ``mwTab`` files in bulk\n", "----------------------------------\n", "\n", "CLI one-to-one file conversions\n", "*******************************\n", "\n", "* Convert from a local file in ``mwTab`` format to a local file in ``JSON`` format:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "! mwtab convert ST000017_AN000035.txt out/ST000017_AN000035.json \\\n", " --from-format=mwtab --to-format=json" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Convert from a local file in ``JSON`` format to a local file in ``mwTab`` format:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "! mwtab convert ST000017_AN000035.json out/ST000017_AN000035.txt \\\n", " --from-format=json --to-format=mwtab" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Convert from a compressed local file in ``mwTab`` format to a compressed local file in ``JSON`` format:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "! mwtab convert ST000017_AN000035.txt.gz out/ST000017_AN000035.json.gz \\\n", " --from-format=mwtab --to-format=json" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Convert from a compressed local file in ``JSON`` format to a compressed local file in ``mwTab`` format:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "! mwtab convert ST000017_AN000035.json.gz out/ST000017_AN000035.txt.gz \\\n", " --from-format=json --to-format=mwtab" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Convert from an uncompressed URL file in ``mwTab`` format to a compressed local file in ``JSON`` format:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "! mwtab convert 35 out/ST000017_AN000035.json.bz2 \\\n", " --from-format=mwtab --to-format=json" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note:: See :mod:`mwtab.converter` for full list of available conversions." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "CLI Many-to-many files conversions\n", "**********************************\n", "\n", "* Convert from a directory of files in ``mwTab`` format to a directory of files in ``JSON`` format:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "! mwtab convert mwfiles_dir_mwtab out/mwfiles_dir_json \\\n", " --from-format=mwtab --to-format=json" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Convert from a directory of files in ``JSON`` format to a directory of files in ``mwTab`` format:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "! mwtab convert mwfiles_dir_json out/mwfiles_dir_mwtab \\\n", " --from-format=json --to-format=mwtab" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Convert from a directory of files in ``mwTab`` format to a zip archive of files in ``JSON`` format:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "! mwtab convert mwfiles_dir_mwtab out/mwfiles_json.zip \\\n", " --from-format=mwtab --to-format=json" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Convert from a compressed tar archive of files in ``JSON`` format to a directory of files in ``mwTab`` format:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "! mwtab convert mwfiles_json.tar.gz out/mwfiles_dir_mwtab \\\n", " --from-format=json --to-format=mwtab" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Convert from a zip archive of files in ``mwTab`` format to a compressed tar archive of files in ``JSON`` format:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "! mwtab convert mwfiles_mwtab.zip out/mwfiles_json.tar.bz2 \\\n", " --from-format=mwtab --to-format=json" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note:: See :mod:`mwtab.converter` for full list of available conversions." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Download files through Metabolomics Workbenchs REST API\n", "------------------------------------------------------\n", "\n", "The :mod:`mwtab` package provides the :mod:`mwtab.mwrest` module, which contains a number of functions and classes for working with Metabolomics Workbenchs REST API.\n", "\n", ".. note::\n", " For full official REST API specification see the following link (``MW REST API (v1.0, 5/7/2019)``):\n", " https://www.metabolomicsworkbench.org/tools/MWRestAPIv1.0.pdf\n", "\n", "Download by URL\n", "***************\n", "\n", "* To download a file based on a given url, simply call the ``download url`` command with the desired URL and provide an output path:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "! mwtab download url \"https://www.metabolomicsworkbench.org/rest/study/analysis_id/AN000035/mwtab/txt\" --to-path=out/ST000017_AN000035.txt" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* To download single analysis ``mwTab`` files, simply call ``download study`` and specifiy the analysis ID:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "! mwtab download study AN000035 --to-path=out/ST000017_AN000035.txt" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* To download an entire study ``mwTab`` file, simply call ``download study`` and specifiy the study ID:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "! mwtab download study ST000017 --to-path=out/ST000017_AN000035.txt" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", " It is possible to validate downloaded files by adding the ``--validate`` option to the command line." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Download study, compound, refmet, gene, and protein files\n", "*********************************************************\n", "\n", "* To download study, compound, refmet, gene, and protein context files, call the ``download`` command and specify the context, input iten, input value, and output item (optionally specifiy the output format).\n", "\n", "* Download a study:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "! mwtab download study analysis_id AN000035 mwtab --output-format=txt --to-path=out/ST000017_AN000035.txt" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Download compound:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "! mwtab download compound regno 11 name --to-path=out/tmp.txt" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Download refmet:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "! mwtab download refmet name Cholesterol all --to-path=out/tmp.txt" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Download gene:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "! mwtab download gene gene_symbol acaca all --to-path=out/tmp.txt" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* Download protein:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "! mwtab download protein uniprot_id Q13085 all --to-path=out/tmp.txt" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Download all ``mwTab`` formatted files\n", "**********************************\n", "\n", "The :mod:`mwTab` package provides contains a number of command line functions for downloading Metabolomics ``mwtab`` formatted files through the Workbenchs REST API.\n", "\n", "* To download all available analysis files, simply call the ``download study all`` command:\n", "\n", "! mwtab download study all\n", "\n", ".. note:\n", " If an output directory is not specified the command will download to the current working directory. It is recommend to either run the command in the desired output directory or specify an output directory with the ``--to-path`` argument.\n", "\n", "* It is also possible to download all study files by calling the ``download study all`` command and providing an input item and output path:\n", "\n", "! mwtab download study all --input-item=study_id" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Download moverz and exactmass\n", "*****************************\n", "\n", "* To download moverz files, call the ``download moverz`` command and specify the input value (LIPIDS, MB, or REFMET), m/z value, ion type value, and m/z tolerance value." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "! mwtab download moverz MB 635.52 M+H 0.5 --to-path=out/tmp.txt" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "* To download exactmass files, call the ``download exactmass`` command and specify the LIPID abbreviation and ion type value." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "! mwtab download exactmass \"PC(34:1)\" M+H --to-path=out/tmp.txt" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", " It is not necessary to specify an output format for exactmass files.\n", "\n", "Extracting metabolite data and metadata from ``mwTab`` files\n", "------------------------------------------------------------\n", "\n", "The :mod:`mwtab` package provides the :func:`~mwtab.mwextract.extract_metabolites` and :func:`~mwtab.mwextract.extract_metadata` functions that can parse ``mwTab`` formatted files. The :func:`~mwtab.mwextract.extract_metabolites` takes a source (list of ``mwTab`` file) and list of metadata key-value pairs that are used to search for ``mwTab`` files which contain the given metadata pairs. The :func:`~mwtab.mwextract.extract_metadata` takes a source (list of ``mwTab`` file) and list of metadata keys which are used to search the ``mwTab`` files for possible values to the given keys.\n", "\n", "* To extract metabolite from ``mwTab`` files in a directory, call the ``extract metabolites`` command and provide a list of metadata key value pairs along with an output path and output format:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "! mwtab extract metabolites mwfiles_dir_mwtab out/output_file.csv SU:SUBJECT_TYPE Plant --to-format=csv" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", " It is possible to use ReGeXs to match the metadata value (eg. ... SU:SUBJECT_TYPE \"r'(Plant)'\").\n", "\n", "* To extract metadata from ``mwTab`` files in a directory call the ``extract metadata`` command and provide a list of metadata keys along with an output path and output format:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "! mwtab extract metadata mwfiles_dir_json out/output_file.json SUBJECT_TYPE --to-format=json" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Validating ``mwTab`` files\n", "--------------------------\n", "\n", "The :mod:`mwtab` package provides the :func:`~mwtab.validator.validate_file` function\n", "that can validate files based on a ``JSON`` schema definition. The :mod:`mwtab.mwschema`\n", "contains schema definitions for every block of ``mwTab`` formatted file, i.e.\n", "it lists the types of attributes (e.g. :py:class:`str` as well as specifies which keys are\n", "optional and which are required).\n", "\n", "* To validate file(s), simply call the ``validate`` command and provide path to file(s):" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "! mwtab validate 35" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Using the mwtab Python Package to Find Analyses Involving a Specific Disease or Condition\n", "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n", "\n", "The Metabolomics Workbench data repository stores mass spectroscopy and nuclear magnetic resonanse experimental data and metadata in ``mwTab`` formatted files. Metabolomics Workbench also provides a number of tools for searching or analyzing ``mwTab`` files. The mwtab Python package can also be used to perform similar functions through both a programmatic API and command-line interface, which has more search flexibility.\n", "\n", "In order to search the repository of ``mwTab`` files for analyses associated with a specific disease, Metabolomics Workbench provides a web-based interface:\n", " * https://www.metabolomicsworkbench.org/data/metsearch_MS_form2.php\n", "\n", "The mwtab Python package can be used in a number of ways to similar effect. The package provides the :meth:`~mwtab.mwextract.extract_metabolites()` method to extract and organize metabolites from multiple ``mwTab`` files through both Python scripts and a command-line interface. This method has more search flexibility, since it can take either a search string or a regular expression.\n", "\n", "Using mwtab package API to extract study IDs, analysis IDs, and metabolites\n", "---------------------------------------------------------------------------\n", "\n", "The :meth:`~mwtab.mwextract.extract_metabolites()` method takes two parameters: 1) a iterable of :class:`~mwtab.mwtab.MWTabFile` instances and 2) an iterable of :class:`~mwtab.mwextract.ItemMatcher` or :class:`~mwtab.mwextract.ReGeXMatcher` instances. The iterable of :class:`~mwtab.mwtab.MWTabFile` instances can be created using byt passing ``mwTab`` file sources (filenames, analysis IDs, etc.) to the :meth:`~mwtab.fileio.read_files()` method. The iterable of matcher instances can be created using the :meth:`~mwtab.mwextract.generate_matchers()` method.\n", "\n", "* An example of using the mwtab package API to extract data from analyses associated with diabetes and output the first three metabolites:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['1_5-anhydroglucitol', '1-monopalmitin', '1-monostearin']" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mwtab.mwextract import extract_metabolites, generate_matchers\n", "from mwtab import read_files\n", "import re\n", "\n", "mwtab_gen = read_files(\"diabetes/\")\n", "\n", "matchers = generate_matchers([\n", " (\"ST:STUDY_SUMMARY\",\n", " re.compile(\"(diabetes)\"))\n", "])\n", "list(extract_metabolites(mwtab_gen, matchers).keys())[:3]" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Using mwtab CLI to extract study IDs, analysis IDs, and metabolites\n", "-------------------------------------------------------------------\n", "\n", "The mwtab command line interface includes a ``mwtab extract metabolites`` method which takes a directory of ``mwTab`` files, an output path to save the extracted data in, and a series of ``mwTab`` section item keys and values to be matched (either string values or regular expressions). Additionally an output format can be specified.\n", "\n", " mwtab extract metabolites ( ) ... [--to-format=] [--no-header]\n", "\n", "* An example of using the mwtab CLI to extract data from analyses associated with diabetes:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "! mwtab extract metabolites diabetes/ out/output_file.json ST:STUDY_SUMMARY \"r'(?i)(diabetes)'\" --to-format=json" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 2 }