The Supply of Information Technology Workers in the United States

Chapter 9: Data Issues

What Are the Sources of Data on IT Workers?

Data are available from three primary sources: the federal government, professional societies, and the private sector. In a separate project, the National Science Foundation (NSF) funded the Commission on Professionals in Science and Technology (CPST) to develop an Internet-based Guide to Data on Scientists and Engineers.90

Federal data. The most relevant federal data on information technology (IT) workers are provided by NSF's Science Resources Studies (SRS) division, the National Center for Education Statistics (NCES; Department of Education), and the Bureau of Labor Statistics (BLS; Department of Labor).

Data on the employment of those working in the U.S. on temporary visas are very limited and were not particularly helpful to this study. The relevant agencies include the U.S. Department of Justice, Immigration and Naturalization Services (issuance of H1-B visas to individuals) and the U.S. Department of Labor (employer applications for H1-B visas).91

NSF has a wealth of data on scientists and engineers.92 The largest amount of worker-related NSF data is on doctoral scientists, since this has long been considered the principal professional degree for independent researchers. Data also are collected on those who hold a baccalaureate or a master's in science or engineering. The data are maintained at the direction of Congress to provide an inventory of human capital for national prosperity and security, and a source of information for policy formulation by other agencies.

NSF's Science Resources Studies division oversees the Survey of Doctorate Recipients (SDR), a biennial survey of a representative sample of those who have received doctorates in the United States in a science or engineering field.
In addition, NSF oversees the Survey of Earned Doctorates (SED), which is an annual population survey of individuals receiving doctorates that year in science or engineering. SED data are stored in a Doctoral Records File (DRF), from which the sample for the SDR can be pulled for longitudinal follow-ups and for replenishments and retirements every two years.
NSF also collects workforce-related data on individuals with bachelor's and master's degrees in science and engineering fields, as part of the biennial National Survey of Recent College Graduates (NSRCG) and the National Survey of College Graduates (NSCG).93 Data at all degree levels have been merged recently into a database called SESTAT, which is accessible from the Web site.
In addition to the surveys of individuals described thus far, NSF annually collects institutional-level data on enrollments and degrees from the baccalaureate and above in science and engineering at U.S. universities. These data are available in a database called CASPAR, which is accessible from the Web site.
The National Center for Education Statistics focuses on enrollments, degrees, faculty counts, and numerous other statistics on the status of the educational enterprise in the United States.94 The annual Digest of Education Statistics, available on their Web site, contains voluminous data. NCES also sponsors periodic longitudinal surveys that follow up on the work experiences of samples of selected cohorts of high school and college students.
The Bureau of Labor Statistics provides employment data and outlooks based primarily on data collected from employers and extrapolated into the future BLS data are not usually sorted by degree level or degree field, since the reporting burden on employers would be somewhat onerous.95 BLS is planning a national job vacancy survey in 1999 for the first time in this country.
Professional society data. The Computing Research Association (CRA) conducts an annual survey of Ph.D.-granting departments in computer science and computer engineering to ascertain enrollments, degrees, faculty size and salaries, as well as the placement of doctoral graduates as reported by departments.96
The Engineering Workforce Commission of the American Association of Engineering Societies collects data on enrollments, degrees, and salaries of engineers at all degree levels.97
The American Society for Engineering Education surveyed the employment experiences of recent engineering doctorates as part of an NSF-funded CPST project on outcomes of doctorates.98 CPST also coordinated a comparable survey of computer science doctorates.99 The National Association of Colleges and Employers (formerly the College Placement Council) collects data from college and university placement/career centers covering job and salary offers made to new graduates, primarily at the baccalaureate level.100
The Council of Graduate Schools conducts an annual Survey of Graduate Enrollments, which includes data on applications to graduate schools, as well as enrollments and degrees granted.101
The International Association of Managerial Education collects data on information technology graduates from business schools.
Private sector data. Private data sources include Abbott, Langer & Associates,102 Computerworld,103 and Datamation.104 The first is a professional survey research organization that specializes in salary surveys of employers; the latter two organizations publish IT-related magazines and periodically survey their individual readers. These salary surveys are particularly interesting for a project such as this because they reflect the diversity, inconsistencies, and change among job position titles in organizations that employ IT workers.
Other data sources. The Information Technology Association of America (ITAA) is a trade association that conducts an annual compensation survey with the help of William M. Mercer, Inc.105 As with the private sector surveys, the ITAA survey includes paragraph-length descriptions of an extensive inventory of more than 80 IT and related job descriptions.
As part of the Cooperative Institutional Research Program of the Higher Education Research Institute (HERI) at the University of California, Los Angeles, data have been collected annually since 1966 from freshmen in the U.S. on their intended majors and career plans 106 The American Council on Education is a sponsor. Periodic, longitudinal follow-ups of selected cohorts have been performed as research funding has allowed.
What Are the Limitations of Existing Data on the IT Workforce?

Numerous and serious problems with the supply and demand data make it difficult to establish a sound basis for making policy decisions. Much of the data coming from non-federal sources exhibit the following problems:

The domain of workers studied was too narrow. A number of data studies-some of which are well done-are available on specific regions (IT workers in Georgia,107 software workers in Washington State,108 etc.) or specific categories of workers (National Software Alliance study,109 U.S. International Trade Commission study of software and service workers, etc.). Unfortunately, it is not clear that the geographical regions or categories of workers covered in these studies provide results that are representative of the national IT worker situation.
There were methodological problems with the gathering or analysis of data. A number of studies are flawed because the sample size or the response rate was too small, questions were framed in ambiguous or misleading ways, or final results were calculated from raw data in a questionable fashion. The U.S. General Accounting Office has rightly criticized the low response rates of the ITAA studies, for example. This does not mean, however, that the conclusions ITAA reached are false; it means only that they cannot be assumed to be true. The statistical results do not have the reliability or weight of evidence that one could draw from a survey with much higher response rates.
Considering both domain and methodology, data collected and processed by federal organizations such as BLS, NSF, and the Department of Education typically have the highest reliability. However, even these data sets have serious problems from the standpoint of being useful to decision-makers interested in IT worker issues.
The data are not timely. It takes time to collect high-quality survey data, to achieve high response rates, and to clean and analyze data properly. Also, because of the expense involved, some surveys are only conducted every other year. As a result, in the case of NSF workforce data, the most recent data available in late 1998 were from 1995. In a field that is changing as rapidly as information technology, data that are three years old have limited value. Data that are at most one year old are needed in order to understand the current status of this rapidly paced IT industry. While the delays are in part understandable, ways need to be found to speed up data collection and analysis.
Worker categories have not been updated sufficiently to reflect the current state of IT work. Survey researchers and policymakers must make a trade-off between updating surveys to be more responsive to today's marketplace versus maintaining consistency from survey to survey so that results can be compared historically. The Standard Occupational Classification (SOC), which is used by all federal agencies, had not been changed from 1980 until 1998. Older occupational categories have been preserved over time.
Even apparently good occupational categories, such as computer programmers, are too broad and ill defined to allow policymakers to understand key issues well. Different kinds of programmers, for example, are not necessarily interchangeable because their work involves different skill sets. Companies have had only modest success at retraining COBOL programmers to be effective Java programmers because of the wide differences in the programming methodology of the two languages. It is not clear that there is much chance of convincing the federal data collectors to break their categories into smaller ones to resolve this issue; but if the professional community collects data, it should keep this issue in mind.
The levels of data aggregation, in either what data were collected or what data were reported publicly, are often problematic. For example, computer science is sometimes combined with mathematics, other times combined with the physical sciences, and still other times with engineering. In most cases, based on the purposes for which the data were originally collected, this is understandable. But it posed a challenge for researching this study.
Data have generally been collected by the federal government only in relation to earlier issues of policy concern. These data do not provide an adequate basis for making current policy decisions. For example, non-degree training such as certificate programs, short courses, and corporate universities are an increasingly important part of the supply system for IT workers. Because these training venues have not been important to national policy in the past, federal statistics about them do not exist.
Even in cases where there had been previous policy issues, data collection is sometimes inadequate. One example concerns H-1B visas. There are no statistics available on how many temporary workers are actually in the United States on H-1B visas, only on the number of approved labor certificates and the number of visas that were awarded. Nor are data available on how many of the visa-holders are doing IT work.
Data on the demand for IT workers are especially poor. Companies are reluctant to report data about their need for workers or disclose the effects of a shortage on their business. This is considered proprietary information. Companies worry that if it were made public, it could harm their public reputation or be exploited by their competitors.
The mismatch between supply and demand data, as collected by different federal agencies, is problematic. One major frustration in researching this report was trying to match data broken out by educational degree fields (computer science, information systems, software engineering, etc.) and levels (bachelor's, master's, etc.), with data based on occupational classifications or job position descriptions (programmer, systems analyst, computer scientist, for example). The study group's attempts to cross-tabulate by education and occupation had limited success. NSF was the only source that had useful occupational data broken out by educational background. BLS data are occupationally oriented, while NCES data are educationally oriented.
Using online, publicly available databases is too challenging. The Internet can provide access to microdata or other "raw" data. Researchers can analyze data collected to answer one set of questions in an attempt to answer another new set of questions. However, most current systems are not yet very user-friendly in terms of search engines, time required to process requests, variable naming conventions, table or report generation, etc. In most cases, users have to take considerable time to learn the systems and cannot easily retrieve data.

Footnotes

Copyright © 2004 Computing Research Association. All Rights Reserved. Questions? E-mail: webmaster@cra.org.

Document last modified on Wednesday, 04-Apr-2012 06:51:20 PDT.