Using Edith to Prepare Data Files for Madeline

Introduction

This document provides a very brief overview of some of the features of the Edith editor that make it particularly useful for preparing data files when used in conjunction with Madeline or with other programs in statisitcal genetics.

Examples of using the program to prepare data files for use with Madeline and Pedcheck are provided. These examples are similar to what you might encounter in your own work.

Using the Search Engine

One very useful feature of Edith's search engine is the ability to highlight all occurrences of a search target simultaneously.

Example: Madeline was used to investigate a data file received by email from a colleague. On opening the file, Madeline reported conflicting gender roles for individual M05726 (Fig. 1).

Fig. 1. A data file erroneously shows conflicting gender roles for an individual.

 

In Edith, CTRL + F was pressed to bring up the search engine dialogue box (Fig. 2). In the dialog box, the option to mark All targets was chosen. This quickly revealed that the mother and father columns had been accidently reversed for one of the two offspring, either M05725 or M05727. The error most likely occurred in the entry for M05727 since M05726 is clearly listed as a female in the table.

Fig. 2. Edith's ability to highlight all occurrences of a search target simultaneously makes short work of locating an error in a data file. See text for explanation.

 

Selecting Columns of Information

Another feature of Edith which is extremely useful but present in only a few editors is the ability to select columns of information spanning many rows.

Example: The Pedcheck program by Jeff O'Connell requires a locus file that lists the order of markers in the data file. In this example, a map file formatted for use in Madeline had already been prepared. However, the map file contained additional columns of information. By holding down the CTRL key and pressing the left mousing button, the first column of this file was quickly selected (Fig. 3). Using the standard CTRL-C keys allowed this column to be copied and then pasted (CTRL-V) into a new file which was used as the locus file for running Pedcheck.

Fig. 3. Holding down the CTRL key while pressing the left mouse button allows one to select columns of information from a file in Edith.

 

Having the ability to cut and paste individual columns of information in a flat file in Edith means that you can painlessly rearrange columns, add extra white space between columns for readability, and do many other things which would be quite tedious, painful, and error-prone in most editors!

 

Verifying the Rectangularity of a Data File Used By Madeline

Here's a final example showing how useful Edith can be. A flat-file data table in Madeline consists of two parts. The "top" part consists of a data header containing column labels (also known as field names) and column type designators. The "bottom" part consists of the actual data table. Madeline reads from this data table directly. However, in order to do so efficiently (by taking advantage of pointer arithmetic in the C code), the data table must be perfectly rectangular. Although it is permissable to have any number of columns of white space between columns and even after the last column, all lines of data must be exactly the same length.

The problem that often occurs in real life, however, and especially for hand-edited data files (as opposed to those generated directly from your database system), is that there are often extra spaces or tabs appended to the end of various lines of the data -- but quite invisible when viewed in the editor. The problem manifests itself in a bewildering manner when you run Madeline's RECOGNIZE command: Madeline reports the wrong number of header and data lines, an incorrect number of columns in the data file, reports that the format is incorrect, can't or won't read the data file, and similar mysterious errors.

The first thing you should do in such a case is open the data file using Edith and use the column-highlighting feature (i.e., CTRL+<Left Mouse Button>) to select white space after the last data column. This quickly reveals whether the lines are all the same length or --as is quite likely going to be the case-- not (Fig. 4). If they are not, a simple CTRL-X trims off all of the selected segments. In Fig. 4, also note how true tabs are represented as a series of small dots, "......" which differentiates them visually from the space characters.

Fig. 4. Having trouble getting Madeline to RECOGNIZE your data file? More often than not the culprit is hidden spaces and tab characters trailing after the last column of data, making it non-rectangular. Hold down the CTRL key while pressing the left mouse button to select the trailing white space and remove it (CTRL-X).

 

Conclusion

This document has introduced you to just a few useful features of Edith. The program has many other useful features. You'll only need to download it and try it for yourself.

2002.03.09.ET

<-- Back to Eyegene main page