Data Extraction

Using Madeline as part of a linkage data "pipeline" in your lab requires "feeding" data to Madeline in a Madeline-ready format.

With the exception of those who enjoy the tedious and error-prone methods of preparing data files manually, getting data ready for Madeline usually requires writing a script to extract the data from a database system. Fortunately for you, this is not difficult to do, especially since we provide an example script below which you can use as a model or basis for your own.

Of the three input formats that Madeline currently understands, we recommend using Madeline's ASCII Flat File table format. This format is by far the simplest to create using common database and scripting tools, and it is also designed to be readable and editable by humans. The other two input formats, dBase and SAS transport file format, are proprietary binary file formats that cannot be manipulated as conveniently as a simple flat file format, and certainly are not human-readable in ordinary text editors.

Madeline's Flat File Format is described below.

Madeline's Flat File Format

Madeline's flat file format is extremely simple. It consists of two parts:

The header consists of a sequence of column labels separated by white space. The amount of white space is unimportant, and column labels can span as many lines as necessary. Labels can be followed by single-letter column type designators. These column type designators are conditionally optional, but are usually specified even when not strictly required.

What does "conditionally optional" mean? It means that Madeline can usually figure out the type of a column even when you forget to specify it. However, there are some specific cases where you need to tell Madeline exactly how a column should be interpreted. So, the best policy is to always include the column type designators. Then you don't have to worry about anything at all!

Common column type designators are C for a character column, N for a numeric column, and G for a genotype column. If you need to import data from a LINKAGE format, then you may also encounter the letter A used to designate allele columns.

//
// ===============
// RetrieveData():
// ===============
//
// An example function written in the PHP scripting 
// language for extracting marker data from a 
// database and formatting it into a human-readable
// format directly usable by the program Madeline.
//
// The function pulls data from two tables. The FAMILY 
// table stores family structure and affection status 
// information about the sampled individuals:
//
//          UNIQUE
//           KEY
//            |
//            v
//
// FAMILYID STUDYID FATHER MOTHER GENDER AFFECTION_STATUS
// -------- ------- ------ ------ ------ ----------------
// F0001    G00001  .      .      M      AFF
// F0001    G00002  .      .      F      UNA
// F0001    G00003  G00001 G00002 F      AFF
// F0001    G00004  G00001 G00002 F      UNA
// F0001    G00005  G00001 G00002 F      AFF
// ...      ...     ...    ...    ...    ...
//
// Note that STUDYIDs are assumed to be UNIQUE across families
// so that STUDYID by itself is used as a KEY.  
//
// If this is not the case in your data, you will have to
// use both FAMILYID and STUDYID together as the KEY for 
// identifying unique individuals, and modify the 
// function accordingly.
//
// Note that the FAMILY table could also contain additional
// PHENOTYPE variables, but for simplicity only the core
// attribute of AFFECTION_STATUS is shown in this example.
//
// The function first pulls the family structure and
// affection status information from the FAMILY table.
// After that, the marker data are pulled from the DATA
// table.  Then the function iterates over each individual
// in the FAMILY table result set and looks for corresponding
// marker data from the DATA table result set.
//
// The DATA table stores genotypes for each marker for 
// each individual.  For each sampled individual, this 
// table has a row for each marker, i.e.:
//
// STUDYID  MARKERNAME  ALLELE1  ALLELE2
// -------  ----------  -------  -------
// G00004   D10S171     110      112
// G00004   D10S712      97      103
// G00004   D10S455     154      158
// G00004   D10S102     131      137
// G00004   D10S152     122      130
// G00004   D10S814     189      193
// ...      ...         ...      ...
//
// This example function makes use of a library of generalized database
// access functions: db_connect(), db_query(), db_fetch_object(), etc.
// 
// 
// Retrieve user-requested data:
//
//
//
//
//
//
//
//
//
//
//
function  RetrieveData($WhereClause,$){

  global $DB;
  
  //
  // Form was submitted, so retrieve requested data for the user:
  //
  db_connect($DB->hostname, $DB->database, $DB->systemuser, $DB->password);

  //
  // Prepare the SQL WHERE clause based on passed parameters:
  //
  
  //
  // Latest vs. All records:
  //
  if($Latest_vs_All=="L"){
      $data_where=" latest='1'";
  }
  
  //
  // Choose study set based on first character of the StudyID:
  //
  if($ID_Type!=""){
      //
      // Remember that in the Cicada system, a letter like "G"
      // prefaces StudyIDs who were sampled, while the _*_next_*_ letter
      // in the alphabet, in this case the letter "H", represents
      // _*_unsampled_*_ individuals --usually parents and grandparents, etc.
      // 
      // In order to achieve complete pedigree reconstruction, unsampled
      // individuals often need to be included in the database in order
      // to show the connections between siblings in the parental or grand
      // parental generations.  These individuals will not have genotype
      // data in the database.  They will only be present in the familyTable.
      // Nevertheless, they need to be included in the output query.
      //
      // This is why we create the $ID_Type_Unsampled variable here and include
      // it in the where query below:
      //
      $ID_Type_Unsampled=chr(ord($ID_Type)+1);
      
      if($data_where) $data_where .= " and ";
      $add           = "(left(studyid,1)='$ID_Type' or left(studyid,1)='$ID_Type_Unsampled') ";
      $data_where   .= $add;
      $family_where  = $add; 
  }
  
  //
  // Additional selection criteria: FAMILY PARAMETERS
  //
  if($FamilyWhereClause){
      $FamilyWhereClause=Strip_HTML_Entities($FamilyWhereClause);
      if($family_where) $family_where .= " and ";
      $family_where .= $FamilyWhereClause;
  }
  
  //
  // Additional selection criteria: GENOTYPE DATA PARAMETERS
  //
  if($DataWhereClause){
      $DataWhereClause=Strip_HTML_Entities($DataWhereClause);
      if($data_where  ) $data_where   .= " and ";
      $data_where   .= $DataWhereClause;
  }
  
  //
  // Fix up empty where clauses:
  //
  if($family_where=="") $family_where="1";
  if($data_where=="") $data_where="1";
  
  
  //
  // FamilyTable Query:
  //
  if($NewPhenotypeFields){
     $FamilyQuery  = "select familyid,studyid,sex,father,mother,mztwin,dztwin,affected,char_affected,age," .
                     "affected_broad,char_affected_broad,affected_restricted,char_affected_restricted," .
                     "dob,age_dx " .
                     "from familyTable where latest=1 and " . $family_where . " group by 1,2";  
  }else{
     $FamilyQuery  = "select familyid,studyid,sex,father,mother,mztwin,affected,age from familyTable where latest=1 and " . $family_where . " group by 1,2";
  }
  $FamilyResult = db_query($FamilyQuery);
  
  //
  // DataTable Query:
  //
  $DataQuery  = "select studyid,markername,allele1,allele2 from dataTable where " . $data_where . " group by 1,2";
  $DataResult = db_query($DataQuery);
  //
  // PUT MARKER DATA INTO ASSOCIATIVE ARRAY (MAP):
  //
  while ($row = db_fetch_object ($DataResult)){
     $key    = $row->studyid . ":" . $row->markername;
     $val    = $row->allele1 . "/" . $row->allele2;
     $genotype[$key] = $val ;
  }
  
  if($AllMarkers){
     //
     // In this case, we don't use the $marker[] or $markername[] arrays passed in
     // from the form -- those arrays contain all markers across all studies, not just those appropriate to this study
     //
     
     //
     // New, Precise Marker Query:
     //
     $MarkerQuery    = "select distinct markername, count(*) as cases from dataTable where " . $data_where . " group by 1";
     $MarkerResult   = db_query($MarkerQuery);
     $MarkerCount=0;
     while ($row = db_fetch_object ($MarkerResult)) {
         $markername[$MarkerCount++]=$row->markername;
     }
  }

   //
   // PRINT MADELINE-STYLED DATA HEADER:
   //
   if($NewPhenotypeFields){
   
      echo "FAMID C  STUDYID C  SEX X  FATHER C  MOTHER C  MZTWIN C  DZTWIN C  AFFECTED N  CAFFECTED C  AGE N\n";
      echo "AFF_BROAD N  CAFF_BROAD C  AFF_RESTR N  CAFF_RESTR C\n";
      //echo "DOB D\n";
      echo "AGE_DX N\n";
      
   }else{

      echo "FAMID C  STUDYID C  SEX X  FATHER C  MOTHER C  MZTWIN C  AFFECTED N  AGE N\n";

   }
   if($AllMarkers){
      //
      // Print the names of all the markers in the markername array:
      //
      for($i=0;$i<$MarkerCount;$i++){
         echo "$markername[$i] G ";
         if($i%10==0) echo "\n";
      }
   
   }else{
      //
      // Print only the names of the selected markers:
      //
      for($j=$i=0;$i<$MarkerCount;$i++){
         if($marker[$i]){
            echo "$markername[$i] G ";
            ++$j;
            if($j%10==0) echo "\n";
         }
      }
   }
   
   //
   // Separate header from data:
   //
   echo "\n\n\n";
   
   //
   // LOOP TO SPIT OUT THE DATA:
   //
   while ($row_i = db_fetch_object ($FamilyResult)) {

      if($NewPhenotypeFields){

         //
         // FAMID C  STUDYID C  SEX C  FATHER C  MOTHER C  MZTWIN C  DZTWIN C  AFFECTED N  CAFFECTED C  AGE N
         // AFF_BROAD N  CAFF_BROAD C  AFF_RESTR N  CAFF_RESTR C
         // DOB D  AGE_DX N
         //
         echo PadFamilyId($row_i->familyid) . " ";
         echo PadStudyId($row_i->studyid)   . " ";
         echo PadOne($row_i->sex)           . " ";
         echo PadStudyId($row_i->father)    . " ";
         echo PadStudyId($row_i->mother)    . " ";
         echo PadOne($row_i->mztwin)        . " ";
         echo PadOne($row_i->dztwin)        . " ";
         echo PadFour($row_i->affected)     . " ";
         echo PadOne($row_i->char_affected) . " ";
         echo PadFour($row_i->age)          . " ";
         
         echo PadFour($row_i->affected_broad)          . " ";
         echo PadOne($row_i->char_affected_broad)      . " ";
         echo PadFour($row_i->affected_restricted)     . " ";
         echo PadOne($row_i->char_affected_restricted) . " ";
         
         // echo PadDate($row_i->dob)                     . " ";
         echo PadFour($row_i->age_dx)                  . " ";
         
      }else{
         echo PadFamilyId($row_i->familyid) . " ";
         echo PadStudyId($row_i->studyid)   . " ";
         echo PadOne($row_i->sex)           . " ";
         echo PadStudyId($row_i->father)    . " ";
         echo PadStudyId($row_i->mother)    . " ";
         echo PadOne($row_i->mztwin)        . " ";
         echo PadFour($row_i->affected)     . " ";
         echo PadFour($row_i->age)          . " ";
      }
      
      if($AllMarkers){
        //
        // PRINT DATA FOR ALL MARKERS IN ARRAY:
        //
        for($i=0;$i<$MarkerCount;$i++){
            // GET THE KEY:
            $key= $row_i->studyid . ":" . $markername[$i];
            // PRINT THE GENOTYPE:
            echo PadGenotype($genotype[$key]);
        }   
      }else{
        //
        // PRINT DATA ONLY FOR SELECTED MARKERS:
        //
        for($i=0;$i<$MarkerCount;$i++){
            if($marker[$i]){
               // GET THE KEY:
               $key= $row_i->studyid . ":" . $markername[$i];
               // PRINT THE GENOTYPE:
               echo PadGenotype($genotype[$key]);
            }
        }   
      
      }
      //
      // Terminate rows with new lines:
      //
      echo "\n";
   }

}