Database schema #3

averagehat · 2016-05-03T15:29:31Z

I propose the following schema for the input CSV files:

https://github.com/averagehat/pux-starter-app/blob/sequence-db/schema.md

InaMBerry · 2016-05-03T16:40:49Z

Looks good. Here are some suggestions:

Type and subtype for influenza.
Infection# for Dengue (primary, secondary...). I am not sure how many samples have this info though...
Also, location in influenza is often in the name itself. Some people put city, some put state, some country. It would be useful to have Country, State or Region, City options as well. A lot of times I have to go and manually change locations for flu samples so I have them all in country format, so I can report how many samples/countries I had... It's a pain. Would it be possible to derive country name automatically from the name of the city or the state (if it is not in the GenBank entry)? Sometimes I need city info, however. For instance, Russia is big and some cities belong more to Europe and some more to Asia...
You wrote host twice.

averagehat · 2016-05-03T17:13:30Z

I think Infection# was listed as Disease, so I changed its name to Infection#. I added type and subtype as optional fields.
As for looking up the country via state/city--yes, that's possible. I can make the country field optional and add "city" and "state" columns that the country would be derived from (#4).

I noticed that one (more?) of the Influenza metadata files has accession entries for each segment (see below). Is this the preferred way to store it, or should I also allow only one accession? It looks like I should maybe make two schemas, one for Dengue and one for Influenza, because they have a lot of differences--but that is really up to you.

SequenceName    DatabaseName    Sampling Year   SamplingDate    Country Continent   Subtype Acc# HA SegmentHA   Acc# MP SegmentMP   Acc# NA SegmentNA   Acc# NP SegmentNP   Acc# NS SegmentNS   Acc# PA SegmentPA   Acc# PB1    SegmentPB1  Acc# PB2    SegmentPB2
>A/Alabama/01/2015  >A/Alabama/01/2015  2015        US  N.America   H3N2    EPI_ISL_173217  HA_4_567327 EPI_ISL_173217  MP_7_567322 EPI_ISL_173217  NA_6_567326 EPI_ISL_173217  NP_5_567320 EPI_ISL_173217  NS_8_567321 EPI_ISL_173217  PA_3_567323 EPI_ISL_173217  PB1_2_567325    EPI_ISL_173217  PB2_1_567324

InaMBerry · 2016-05-03T17:28:51Z

Disease is different, that for Dengue could be DF, DHF1, DHF2, DHF3, and for flu could be severe or mild possibly... So Infection# and Disease options are different.

Yes, influenza is submitted to GenBank by segment, so each segment has its own accession number. This is different in the GISAID database (EpiFlu). Here each virus has a unique number (segments are same).

averagehat · 2016-05-04T01:02:05Z

Got it, thanks for clarifying.

averagehat added a commit that referenced this issue May 3, 2016

Integrating #3 (comment)

f100573

averagehat pushed a commit to averagehat/sequence-db that referenced this issue May 31, 2016

added new fields to Seq (see VDBWRAIR/pux-starter-app#3 (comment))

5a74492

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database schema #3

Database schema #3

averagehat commented May 3, 2016

InaMBerry commented May 3, 2016

averagehat commented May 3, 2016 •

edited

Loading

InaMBerry commented May 3, 2016

averagehat commented May 4, 2016

Database schema #3

Database schema #3

Comments

averagehat commented May 3, 2016

InaMBerry commented May 3, 2016

averagehat commented May 3, 2016 • edited Loading

InaMBerry commented May 3, 2016

averagehat commented May 4, 2016

averagehat commented May 3, 2016 •

edited

Loading