Data user guide
- 1. Introduction
- 2. What is LSAC?
- 3. Instruments
- 4. The LSAC data release
- 5. File structure
- 6. Variable naming conventions
- 7. Documentation
- 8. Data transformations
- 9. Confidentialisation
- 10. Data imputation
- 11. Survey methodology
- 12. Important issues for data analysis
- 13. User support and training
- Appendix: LSAC variable naming conventions
A number of tools can be used to navigate the LSAC dataset:
- online LSAC data dictionary
- rationale document
- Excel spreadsheets of the data dictionary (good for creating hard copies).
Users should also consider which documents they want to print out and which they want to look at electronically. We have found that the marked-up questionnaires and interview specifications are best printed and provide the easiest method of browsing to familiarise yourself with the data available. The data dictionary is best used for searching for specific items and mapping items from wave to wave.
These tools are described in more detail below.
The associated variable name has been added beside each question in the questionnaires and interview specifications. Derived variables are also included. See Figure 3 for an example.
Figure 3: Examples of the marked-up questionnaires
A mock questionnaire (interview specifications) has also been generated for the CASI and CAI instruments used in waves 2-6. Figure 4 is a sample of this.
Figure 4: Example of wave 2 interview specification
The frequencies are a listing of the response categories for each question and the number of cases in each category. Table 12 provides an example of the listing.
Example of the weighted frequencies
|14/15 - SC - ACASK 33.1.3 - Main activity - sought help from patient|
|hhs55c||Frequency||Percentage (%)||Cumulative frequency||Cumulative percentage (%)|
The frequencies are useful for simple queries related to particular questions (e.g. how many of the births had a normal delivery, or what are the codes used for wave 1 question A15). Variables for which there were a wide variety of responses, meaning unaltered frequencies would run for several pages (e.g. study child weight), have been rounded off to enable the grouping of responses.
This is available as both an online version and in Excel. Both versions of the data dictionary are searchable and can be sorted. Each record describes a single variable and has the following fields:
- variable name
- variable name without age (useful for sorting)
- topic number (allows derived items to be sorted in with the input variables they come from)
- question id (i.e. variable name without age or subject/informant, useful for sorting)
- file (each of the main datasets are allocated a file name that denotes the cohort and age of the study child at each wave (i.e. wave 1 = files B0 and K4, wave 2 = files B2 and K6, wave 3 = files B4 and K8, etc.))
- position in file order (the order of the variables in the files)
- position of question in questionnaires
- person label
- child's age
- variable label briefly describing each data item
- question as found in the survey instruments
- response categories
- population with data
- SAS format
- notes field indicating other information about the data item users should know.
7.3.1 Excel data dictionary
The Excel data dictionary contains two spreadsheets, one with the complete detailed listing of variable attributes, another with a shorter listing in a print-ready format. The print-ready format contains the variable name, question, responses and population fields; however, it is not a difficult task for users to make their own printable versions if they prefer other fields.
The Excel version can be easily filtered using the drop-down menus in the first row of the spreadsheet. For example, to find all of the items on teacher practices in the lsacgr6 file (K cohort at wave 2) first click on the drop-down menu in the 'File' field as shown in Figure 5 and select 'B2'. Next, repeat the process for the 'Topic' field, selecting 'Teaching practices'.
After the search is finished all variables can be displayed by either clicking the 'show all' option in each of the fields that have been filtered (see Figure 5) or by selecting 'Data > Filter > Show All' from the menus.
More advanced searches can be performed using the 'Custom Filter' option, which produces a dialogue box to assist with your searching. For example, to find all the questions that contain the word 'internet', go to the 'question' column and open up the filter menu and click on 'Custom filter', in the dialogue box change 'equals' to 'contains' and type 'internet' next to this.
Figure 5: Example of filtering in Excel
7.3.2 Using wildcards for filtering
A good understanding of the variable naming convention is valuable for using the data dictionary. Both the online and Excel versions of the data dictionary can be searched and filtered using wildcards, which can be used to return thematically linked sets of variables. Two wildcard characters are used by both these programs:
* represents any combination of letters and characters
? represents any single character
Some examples of the use of these characters are as follows:
apw23a* returns a range of variables apw23a1a through to apw23a4b.
apw23a4? returns two variables apw23a4a and apw23a4b.
?pw23a4a shows if this variable exists over different waves.
apw23?4a shows if this variable exists for different people in the same wave.
?pw23?4a shows if this variable exists for different people in different waves.
7.3.3 Some useful tips for navigating the data dictionary
- Only items currently on the main datasets are included in the data dictionary. The User Guide provides information on the composition of other datasets.
- Items on the data dictionary are in the same order as on the data files but can easily be sorted into other orders; for example, grouping topics.
- Searching the online data dictionary finds whole words (e.g. searching for 'child' won't find 'children' as well). However, an asterisk will represent any combination of characters. So, searching for 'child*' will find 'child', 'children', 'childcare', etc.
- The introduction page for the data dictionary contains a list of topics and constructs that can be used for finding the information you want.
- The 'Question ID' field gives the variable name without any wave or person indicators. Filtering by this field is the best way to tell which questions were asked of or about which people at which wave.
- The 'Topic ID' field gives the topic and associated two-digit question number for each item where this is appropriate. It can be used to link derived items with their associated input items.
Please contact the LSAC Data Management team if you need any help with using the data dictionaries.