expert media and data
conversion since 1984
...a Disc Interchange technical article

The Four Issues of Platform Conversion


A platform conversion -- converting data between different computers and operating systems. This usually involves the following four separate issues.


 


#1 Media

   This is certainly the most obvious issue. If you have an LTO Ultrium tape drive and you get a 8mm tape, you need a conversion. Slightly less obvious but in the same category is conversion between two densities of the same media, such as 3590B to 3590H.

#2 Tape Format or "File Structure"
   This is the main reason for incompatibilities, and the most misunderstood issue.

   In almost all cases, what is recorded on the tape or disk is not just your raw data, but additional information that aids the computer in retrieving your files. For instance, in addition to the content of your files, the tape may contain the names of the files, the subdirectory (or folder) they came from, the size and date of the files, the type of files, the owner and file permissions, the date they were backed up, the operator, and possibly much more information. Additionally, backup programs often have "pointers" to the location of the data on the tape. Each tape program stores this information differently, and that's why you often can't read a foreign tape even if you have the right drive.

   If you try to read a tape on a computer that doesn't understand exactly how all that information is placed on the tape, you will not be able to retrieve your files from the tape, even if the file itself is in a compatible format. For example, a UNIX system can't read a simple text file on a 9-track tape recorded in VAX VMS Backup format, even though both computers use ASCII and can read 9-track tapes. This is because UNIX doesn't understand the structure of VMS Backup tapes.

   The issues with disks can be even more complex. Computer engineers have devised dozens of ways of putting data onto disks, and few of them are interchangeable. Just as you can't read a Macintosh floppy on MSDOS, you can't interchange disks between most other operating systems.


#3 File Type
   After you have retrieved the files from the tape's file structure, you still have to deal with the type of file. This takes on different meanings under different platforms. For example, UNIX, MSDOS, and Windows operating systems treat all files as a stream of bytes and therefore don't have the concept of structured files. On the other hand, mainframes and VAX operating systems retain detailed information about the structure of the file.

   The "file type" and "file content" (below) are interrelated, and their relationship depends on the operating system. On a VAX, for example, indexed files are handled by the operating system, but on a PC the indexing is handled entirely by the application program, as the PC operating system only sees the file as bytes on a disk, without any structure.

   Some common file types are indexed, sequential, and relative. Each can have several sub types or parameters. For example, a sequential file may be fixed-length, with or without a record delimiter, or may be variable length, with a variety of delimiters or length codes. Files may have several types of field delimiters.


#4 File Content
   Once you can access the tape or disk and can retrieve the files from it, you still need to consider the content of those files.

   There are multiple issues to consider, but the two general issues are the code set of the computer (ASCII or EBCDIC), and the application program (word processor, database, spreadsheet, accounting, etc.) that created the file.

   The code set: There are two primary code sets in use today: ASCII and EBCDIC; and variations on them, such as 8-bit and 7-bit coding.

   IBM mainframes and mid-range computers store characters using EBCDIC character coding, and most other computers, including IBM PCs, store characters using the ASCII code set. In any code set, every letter, number and symbol must be represented in the computer by a binary value. Most computers use 8 bits to store a character, so there are 256 possible values to represent the alphabet and punctuation. How characters are assigned to these values varies between code sets.

   For example, in ASCII the letter "A" is represented by the decimal value 65 (41 hex), while in the EBCDIC code set the letter "A" is represented by the value 193 (C1 hex). Because of these different coding assignments, you can't view EBCDIC text on an ASCII computer, or vice versa. For example, an EBCDIC space displays as an "@" if viewed on an ASCII computer, and the letters "A", "B", "C", display as graphics characters. .

   The application program: Most application programs store information in unique and proprietary ways. Microsoft Word stores the page formatting differently than WordPerfect does; dBASE, Paradox, Access, FileMaker, etc., all store the database structure and content differently. Many programs can convert between their native format and those of other programs, but there are still almost as many ways to store data as there are programs.

   Additionally, most programming languages -- FORTRAN, BASIC, c, COBOL, Pascal, etc. -- have many data types to represent numbers: 8, 16, and 32-bit signed and unsigned integers, various floating-point representations, and so on. Due to the prevalence of COBOL and IBM mainframe computers, COBOL "packed" and "signed" (also called "Zoned") field types are common on mainframe tapes. Converting these data types requires knowledge of the record layout, and involves writing a program to convert the data to a field type compatible with the new application.


In order to use data from a different computer you need to address all four of the above issues.