Data user guide

The Longitudinal Study of Australian Children: An Australian Government Initiative
Data User Guide – August 2018

12. Important issues for data analysis

A Data issues paper for waves 1 to 7 is available and details issues that have been identified over the course of the study. Data users should be aware of these issues when using LSAC data for their analysis. Other important issues are addressed below.

12.1 Weighting and external validity

The LSAC study design, based on a complex probability sample, is specifically designed to produce valid estimates at the population level. Unlike clinically based or convenience samples, the LSAC sample is population based by design. By properly accounting for the survey design when analysing the data, it is possible not only to make inferences about the children and families participating in the study but to make valid inferences about the entire population of children in the relevant age groups.

The LSAC sampling strategy has three important elements that distinguish it from a simple random sample (SRS):

  • stratification to ensure proportional representation of all states and both capital city and ex-metropolitan areas
  • clustering by postcode to both reduce field enumeration costs and allow the study of community-level effects on children's development and wellbeing
  • weighting to adjust for potential non-response bias and to provide population estimates.

It is the responsibility of data users to determine when and how each of these needs to be accounted for when developing their analyses.

12.1.1 Stratification

Stratification, by state and part of state, was employed to ensure that all geographic areas within Australia are represented in the sample in proportion to their population. This produces a more even distribution of the sample across geographic areas than could be expected from a simple random sample.

The use of stratification can be expected to reduce standard errors compared with a simple random sample with no control over the geographic spread of the sample. As such, when trying to extrapolate to the population, the stratification should be incorporated in the analysis of results from the survey in order to correctly calculate standard errors and confidence intervals.

12.1.2 Clustering

The use of clustering in the sample design has important consequences for the analysis of data from the study. Clustering is useful in reducing the field costs associated with the survey enumeration. Clustering also has the added benefit of making possible the analysis of community-level effects, by ensuring that a sufficient sample is selected from each postcode included in the survey.

However, the use of clustering violates the standard assumption of independence of the observations that is fundamental to many statistical routines in major statistical packages. When children or carers have more similar characteristics within a given postcode than children or carers selected purely at random, the responses within postcodes will be correlated. This correlation will lead to an increase in the standard errors and size of the confidence intervals. The extent of this increase is measured by the design effect, which is the ratio of the variance of an estimate from the survey to the variance that would have been achieved by a simple random sample of the same size.

Failure to account for clustering in the analysis can lead to under-estimating the size of standard errors and confidence intervals. In some circumstances, this can result in misleading conclusions of statistical significance.

12.1.3 Weighting

The wave 1 weights provided in the LSAC data files take into account both the probability of selecting each child in the study and an adjustment for non-response. An analysis of possible differences in the characteristics of respondents and non-respondents was undertaken and identified two factors associated with the probability of participating in the survey - whether the mother speaks a language other than English at home, and whether the mother has completed Year 12. Both of these factors were incorporated into the wave 1 survey weighting so that, to the best extent possible, the use of the sample weights offset the bias that may be introduced into the data due to differential non-response patterns.

At each subsequent wave of data collection, weights have been adjusted to account for the differential probability of response as estimated by regression. The weights are then calibrated back to the stratum benchmarks and a small number of cases have their weights top or bottom coded to prevent any case having too great or small an effect on the data.

From wave 3 onwards, it was required to produce longitudinal as well as cross-sectional weights for the first time. Cross-sectional weights adjust the sample attained at current wave to be representative of the population at the time of selection (i.e. when first interviewed), while longitudinal weights do the same for the sample that has responded to all waves of the survey.

More detailed information on the weighting variables can be found in LSAC Technical Papers No. 3, 5, 6, 9, 10, 15 and 16.

Three types of weight are included in the LSAC datasets:

  • Child population weights - these weights are used to produce population estimates based on the LSAC data (e.g. based on LSAC data there are 22,464 children born in March 2003-February 2004 in Australia that were never breastfed).

    The sum of the responding B cohort child population weights is 243,026 and the sum of the K cohort child population weights is 253,202, which are the ABS-estimated resident population counts of children aged 0 years and 4 years, respectively, at end March 2004, adjusted for the remote parts of Australia that were excluded from the study design.

  • Child sample weight - this is the child population weight rescaled such that the sum of the weights matches the number of children in the sample (e.g. 5,107 B cohort and 4,983 K cohort in wave 1).

    This weight is used in analyses that expect the weights to sum to the sample size rather than the population, particularly when tests of statistical significance are involved.

  • Time use data day weight (for waves 1, 2 and 3 only) - this is the sample weight adjusted so that each day of the week receives equal weight in analyses of time use data.

Data files for wave 1 and wave 2 each have one population weight and one sample weight. Given that there are no cases that responded to wave 2 and didn't respond to wave 1, these weights can be used for both longitudinal and cross-sectional analyses.

At wave 3, two sample weights and two population weights are necessary as this is the first time that respondents could return to the study after missing a wave. The first of these weights the full wave 3 sample and should be used for cross-sectional analyses. The second weights the sample that has responded to all waves, and should be used for longitudinal analyses.

A complete list of LSAC weighting variables is given in Tables 15 (B cohort) and 16 (K cohort).

Table 15: Weighting variables for B cohort
Variable name Cohort   Waves cases responded to Used for
aweight B Population 1 Wave 1 cross-sectional analyses
aweights B Sample 1 Wave 1 cross-sectional analyses
aweightd B Day 1 Wave 1 cross-sectional analyses
bweight B Population 1 & 2 Wave 2 cross-sectional analyses and longitudinal analyses involving waves 1 & 2
bweights B Sample 1 & 2 Wave 2 cross-sectional analyses and longitudinal analyses involving waves 1 & 2
bweightd B Day 1 & 2 Wave 2 cross-sectional analyses and longitudinal analyses involving waves 1 & 2
cweight B Population 1 & 3 Wave 3 cross-sectional analyses and longitudinal analyses involving waves 1 & 3
cweights B Sample 1 & 3 Wave 3 cross-sectional analyses and longitudinal analyses involving waves 1 & 3
cweightd B Day 1 & 3 Wave 3 cross-sectional analyses and longitudinal analyses involving waves 1 & 3
bcwt B Population 1, 2 & 3 Longitudinal analyses involving all waves up to wave 3
bcwts B Sample 1, 2 & 3 Longitudinal analyses involving all waves up to wave 3
bcwtd B Day 1, 2 & 3 Longitudinal analyses involving all waves up to wave 3
dweight B Population 1 & 4 Wave 4 cross-sectional analyses and longitudinal analyses involving waves 1 & 4
dweights B Sample 1 & 4 Wave 4 cross-sectional analyses and longitudinal analyses involving waves 1 & 4
eweight B Population 1 & 5 Wave 5 cross-sectional analyses and longitudinal analyses involving waves 1 & 5
eweights B Sample 1 & 5 Wave 5 cross-sectional analyses and longitudinal analyses involving waves 1 & 5
bdwt B Population 1, 2 & 4 Longitudinal analyses involving waves 2 & 4, or waves 1, 2 & 4
bdwts B Sample 1, 2 & 4 Longitudinal analyses involving waves 2 & 4, or waves 1, 2 & 4
cdwt B Population 1, 3 & 4 Longitudinal analyses involving waves 3 & 4, or waves 1, 3 & 4
cdwts B Sample 1, 3 & 4 Longitudinal analyses involving waves 3 & 4, or waves 1, 3 & 4
bcdwt B Population 1, 2, 3 & 4 Longitudinal analyses involving all waves up to wave 4
bcdwts B Sample 1, 2, 3 & 4 Longitudinal analyses involving all waves up to wave 4
bcdewt B Population 1, 2, 3, 4 & 5 Longitudinal analyses involving all waves up to wave 5
bcdewts B Sample 1, 2, 3, 4 & 5 Longitudinal analyses involving all waves up to wave 5
fweight B Population 1 & 6 Wave 6 cross-sectional analyses and longitudinal analyses involving waves 1 & 6
fweights B Sample 1 & 6 Wave 6 cross-sectional analyses and longitudinal analyses involving waves 1 & 6
bcdefwt B Population 1, 2, 3, 4, 5, & 6 Longitudinal analyses involving all waves up to wave 6
bcdefwts B Sample 1, 2, 3, 4, 5, & 6 Longitudinal analyses involving all waves up to wave 6
gweight B Population 1 & 7 Wave 7 cross-sectional analyses and longitudinal analyses involving waves 1 & 7
gweights B Sample 1 & 7 Wave 7 cross-sectional analyses and longitudinal analyses involving waves 1 & 7
bcdefgwt B Population 1, 2, 3, 4, 5, 6 & 7 Longitudinal analyses involving all waves up to wave 7
bcdefgwts B Sample 1, 2, 3, 4, 5, 6 & 7 Longitudinal analyses involving all waves up to wave 7
Table 16: Weighting variables for K cohort
Variable name Cohort Type Waves cases responded to Used for
cweight K Population 1 Wave 1 cross-sectional analyses
cweights K Sample 1 Wave 1 cross-sectional analyses
cweightd K Day 1 Wave 1 cross-sectional analyses
dweight K Population 1 & 2 Wave 2 cross-sectional analyses and longitudinal analyses involving waves 1 & 2
dweights K Sample 1 & 2 Wave 2 cross-sectional analyses and longitudinal analyses involving waves 1 & 2
dweightd K Day 1 & 2 Wave 2 cross-sectional analyses and longitudinal analyses involving waves 1 & 2
eweight K Population 1 & 3 Wave 3 cross-sectional analyses and longitudinal analyses involving waves 1 & 3
eweights K Sample 1 & 3 Wave 3 cross-sectional analyses and longitudinal analyses involving waves 1 & 3
eweightd K Day 1 & 3 Wave 3 cross-sectional analyses and longitudinal analyses involving waves 1 & 3
dewt K Population 1, 2 & 3 Longitudinal analyses involving all waves up to wave 3
dewts K Sample 1, 2 & 3 Longitudinal analyses involving all waves up to wave 3
dewtd K Day 1, 2 & 3 Longitudinal analyses involving all waves up to wave 3
fweight K Population 1 & 4 Wave 4 cross-sectional analyses and longitudinal analyses involving waves 1 & 4
fweights K Sample 1 & 4 Wave 4 cross-sectional analyses and longitudinal analyses involving waves 1 & 4
dfwt K Population 1, 2 & 4 Longitudinal analyses involving waves 2 & 4, or waves 1, 2 & 4
dfwts K Sample 1, 2 & 4 Longitudinal analyses involving waves 2 & 4, or waves 1, 2 & 4
efwt K Population 1, 3 & 4 Longitudinal analyses involving waves 3 & 4, or waves 1, 3 & 4
efwts K Sample 1, 3 & 4 Longitudinal analyses involving waves 3 & 4, or waves 1, 3 & 4
defwt K Population 1, 2, 3 & 4 Longitudinal analyses involving all waves up to wave 4
defwts K Sample 1, 2, 3 & 4 Longitudinal analyses involving all waves up to wave 4
gweight K Population 1 & 5 Wave 5 cross-sectional analyses and longitudinal analyses involving waves 1 & 5
gweights K Sample 1 & 5 Wave 5 cross-sectional analyses and longitudinal analyses involving waves 1 & 5
defgwt K Population 1,2, 3, 4 & 5 Longitudinal analyses involving all waves up to wave 5
defgwts K Sample 1,2, 3, 4 & 5 Longitudinal analyses involving all waves up to wave 5
hweight K Population 1 & 6 Wave 6 cross-sectional analyses and longitudinal analyses involving waves 1 & 6
hweights K Sample 1 & 6 Wave 6 cross-sectional analyses and longitudinal analyses involving waves 1 & 6
defghwt 

defghwts

K

Population 

Sample

1, 2, 3, 4, 5 & 6 

1, 2, 3, 4, 5 & 6

Longitudinal analyses involving all waves up to wave 6 

Longitudinal analyses involving all waves up to wave 6

iweight K Population 1 & 7 Wave 7 cross-sectional analyses and longitudinal analyses involving waves 1 & 7
iweights 

defghiwt

defghiwts

K

K

Sample 

Population

Sample

1 & 7 

1, 2, 3, 4, 5, 6 & 7

1, 2, 3, 4, 5, 6 & 7

Wave 7 cross-sectional analyses and longitudinal analyses involving waves 1 & 7 

Longitudinal analyses involving all waves up to wave 7

Longitudinal analyses involving all waves up to wave 7

12.1.4 Survey estimation and analysis techniques

Survey estimation and analysis techniques are available that can take all three key features of the study design into account, and many of these techniques are now included in commercially available software. Incorporating the study design features into analyses of the study can produce externally valid results at the full population level. Estimates of means, proportions and totals incorporating the study design provide the best estimate of the true means, proportions and totals within the total population.

Analytic techniques, particularly modelling, aim at exploring relationships within the data and are able to estimate the best fitting model for the underlying population, not just the best fitting model for the sample, when properly applied to account for the study design.

12.1.5 Useful references

An overview of population survey methods is given by Levy and Lemeshow (1999). They discuss the use of stratification, weighting and clustering in survey design and the impact it has on the analysis of sample survey data.

For a thorough discussion of the mathematical techniques used to analyse data from complex surveys, see Chambers and Skinner (2003).

12.1.6 Software

There is now a range of software available from a number of vendors that supports the analysis of data from complex survey designs incorporating stratification, clustering and weighting. These include SAS (using the SURVEYMEANS and SURVEYREG procedures), STATA (using the svy commands), and SPSS (through the SPSS Complex Samples add-on module), as well as software packages specifically designed for the analysis of sample survey data such as WesVar and SUDAAN.

Use of the appropriate analytic techniques from one or more of these packages is recommended for researchers analysing the LSAC data. Results that properly account for the sample design features will have the greatest external validity and should be appropriate for drawing inferences about the total population of children from which the sample was taken.

The following shows a template for using the SURVEYREG and SURVEYMEANS procedures in SAS.

proc surveyreg data=<filename> total=<stratumfile>;

stratum stratum;

cluster pcodes;

model <standard SAS model details>;

weight weights;

run;

proc surveymeans data=<filename> total=<stratumfile>;

stratum stratum;

cluster pcodes;

var <variable names>;

weight weights;

run;

Where:

stratum: is a variable you can calculate for lsac0 using the formula:

stratum=int(mod(hicid,10000000)/100000);

pcodes: is the postcode of selection (already on the data file)

weights: is the sample weight (preferred to the population weight for this analysis)

<stratumfile> is a file that contains the number of Primary Sampling Units (in this case postcode clusters) in each stratum. It is included on the data CD or can be set up using the following code.

data stratum;

input stratum _total_; datalines;

11 295

13 168

14 160

21 202

22 58

23 95

24 316

31 116

33 121

34 108

41 110

43 34

44 131

51 82

52 86

53 32

54 103

61 28

63 38

71 9

73 3

74 1

81 23

;

run;

12.2 Unit of analysis

The child is the unit of selection in LSAC and estimates produced from this survey are of children, not of parents or families. It is important that this point is understood when producing population estimates from this survey.

Using the estimates to count families/parents will produce an over-count of the number of families/parents, due to the multiple (or over) counting of children from multiple births. Although this will not make a huge difference to the actual numbers, it may be important in the interpretation of the information and in comparing data from other sources.

Although it is possible to produce 'family' weights, it is not considered a worthwhile use of resources given the small number of analyses this could possibly meaningfully affect.

12.3 Age at interview

Different ages of children should be accounted for in any analyses focused on age-dependent measures such as cognitive and motor development. Figures 7 and 8 show the age distribution of the two cohorts at each wave. The figures show the age of the child as a base figure (i.e. 0, 2, 4, 6, 8, 10, 12, 14 and 16 years) plus a number of months. For example, a B cohort child aged three years and one month at time of interview in wave 2 is shown against '13' on the x-axis (see the red line).

Figure 7: Age distribution of B cohort sample at each wave

Figure 7: Age distribution of B cohort sample at each wave 

Figure 8: Age distribution of K cohort sample at each wave

Figure 8: Age distribution of K cohort sample at each wave

12.4 Time between interviews

Effort is made to ensure that the time between interviews is close to two years; however, in some cases this is not possible. Figures 9 and 10 show the distribution of the intervals between waves.

Figure 9: Distribution of time between interviews, B cohort, waves 1-7

Figure 9: Distribution of time between interviews, B cohort, waves 1–7

Figure 10: Distribution of time between interviews, K cohort, waves 1-7

Figure 10: Distribution of time between interviews, K cohort, waves 1–7

12.5 Cross-cohort comparisons

It should be noted that the two cohorts of LSAC were selected and weighted to represent similar but different populations. For the B cohort, the reference population is '0-year-old children in Australia in 2004 excluding those from certain remote postcodes', while for the K cohort the reference population is '4-year-old children in Australia in 2004 excluding those from certain remote postcodes'. One implication of this is that the K cohort will have a greater number of children born overseas as there was more time for families to immigrate to Australia between the birth of their child and selection into the study. The 2001 census contained 4.4% of four year olds born overseas compared with 0.8% of 0 year olds. In comparison, the weighted percentages for these figures in LSAC at wave 1 were 4.2% vs 0.4%.

However, there are other demographic differences between the populations that are reflected in the benchmarks used to weight the two cohorts. Figure 11 shows the population percentages in each state by part of state and by gender stratum for the B and K cohorts. The B and K cohort figures generally match closely; however, the population from which the K cohort was selected was a little more likely to live in capital cities (66.5% vs 63.6%). Figure 12 shows the population proportions for mothers having completed Year 12 by state and part of state for each cohort. The B cohort population was more likely to have completed Year 12 in every part of the country, with the ABS census figures nationally being 56.6% for the B cohort against 48.3% for the K cohort. Figure 13 shows the population proportions for mothers speaking a language other than English at home by state and part of state for each cohort. These proportions were more closely matched between the B and K cohorts.

The implications of this are that even though the two cohorts have been weighted using similar variables, it does not mean that the variables that they have been weighted on are not responsible for the differences observed between the two. For example, because the two cohorts have had non-response due to maternal education adjusted for, it does not mean they will have equal proportions of mothers who had completed Year 12 when the weights are applied. Therefore, different levels of maternal education could explain differences observed between the two samples in the educational attainment of children.

Figure 11: Cohort benchmarks by state, part of state and gender

Figure 11: Cohort benchmarks by state, part of state and gender

Note: There are no respondents from non-metropolitan ACT.

Figure 12: Proportion of mothers who completed Year 12, cohort benchmarks by state and part of state

Figure 12: Proportion of mothers who completed Year 12, cohort benchmarks by state and part of state

Note: There are no respondents from non-metropolitan ACT.

Figure 13: Proportion of mothers who speak a language other than English at home, cohort benchmarks by state and part of state

Figure 13: Proportion of mothers who speak a language other than English at home, cohort benchmarks by state and part of state

Note: There are no respondents from non-metropolitan ACT.

12.6 Sample characteristics

To assist in the assessment of the representativeness of the wave 1 sample, selected characteristics were compared with ABS estimates: gender, state and region were compared with the ABS September 2004 Estimated Resident Population figures; the other characteristics were compared with (previously unpublished) population data from the ABS 2001 Census of Population and Housing (see Table 17).

Table 17: Wave 1 sample characteristics compared with ABS data
  B cohort K cohort
Characteristics LSAC % ABS % LSAC % ABS %
Gender*        
Male 51.2 51.3 50.9 51.3
Female 48.8 48.7 49.1 48.7
Family type        
Two resident parents/guardians 90.7 88.1 86.0 82.0
One resident parent/guardian 9.3 11.9 14.0 18.0
Siblings        
Only child 39.5 36.2 11.5 12.1
One sibling 36.8 35.6 48.4 45.9
Two or more siblings 23.7 28.2 40.1 42.0
Ethnicity        
Study child Indigenous 4.5 4.3 3.8 4.3
Mother speaks a language other than English at home 14.5 16.8 15.7 17.6
Educational status        
Mother completed Year 12 66.9 56.6 58.6 48.3
Father completed Year 12 58.5 50.2 52.7 45.3
State*        
New South Wales 31.6 34.1 31.6 33.7
Victoria 24.5 24.6 25.0 23.8
Queensland 20.6 19.3 19.8 19.7
South Australia 6.8 6.8 6.8 7.2
Western Australia 10.4 9.9 10.2 10.1
Tasmania 2.2 2.3 2.7 2.5
Northern Territory 1.7 1.4 1.7 1.6
Australian Capital Territory 2.1 1.7 2.3 1.3
Region        
Capital city statistical division 62.5 63.7 62.1 62.1
Balance of state 37.5 26.3 37.9 37.9
Total 5,047   4,983  

Note: ABS data comes from the 2001 Census for families for 0 and 4 year olds, except where indicated with a *, where it is based on the September 2004 Estimated Resident Population for families of 0 and 4 year olds.

For most characteristics, the wave 1 sample is only marginally different to the ABS data. The largest difference is in the educational status of the parents. Children with mothers who have completed Year 12 are over-represented in the sample, with proportions 10% higher than in the 2001 Census.

Other differences in the wave 1 sample include:

  • Children in lone-parent families are under-represented.
  • Children with two or more siblings are under-represented and only children are over-represented in the B cohort.
  • Children from an ATSI background are under-represented for the K cohort, and marginally over-represented for the B cohort.
  • Children with mothers who speak a language other than English at home are under-represented.
  • Children in New South Wales are under-represented.

Table 18 shows the number of children in the wave 1 sample with selected characteristics, and gives the waves 2-7 response rates for children with these characteristics. As can be seen in the table, the greatest sample loss has been from Indigenous families and families where P1 speaks a language other than English at home.

Table 18: Response rates at waves 2-7 by selected sample characteristics
  Wave 1 n % responding at wave 2 % responding at wave 3 % responding at wave 4 % responding at wave 5 % responding at wave 6 % responding at wave 7
Characteristics B K B B B B B K B K B K B K
Full sample 5,107 4,983 90.2 89.6 85.9 86.9 83.1 83.7 80.0 79.4 73.7 71.0 73.4 69.2
Study child male 2,610 2,537 90.0 89.8 86.2 87.2 83.9 84.1 80.3 79.7 73.9 70.9 73.8 69.2
Study child female 2,497 2,446 90.3 89.4 85.5 86.6 82.2 83.2 79.6 79.1 73.4 71.0 73.0 69.2
Study child Indigenous 230 187 78.3 81.8 64.8 66.3 63.0 63.1 60.4 60.4 46.1 44.4 48.3 45.8
Mother speaks language other than English 740 778 83.9 83.8 75.0 76.6 72.0 71.1 68.6 66.1 61.1 58.5 61.4 60.7
Mother did not complete Year 12 1,688 2,044 84.8 86.5 78.8 81.7 74.4 78.1 70.1 72.6 61.4 62.2 62.5 59.9
Father did not complete Year 12 1,890 2,016 90.0 90.1 85.9 87.0 83.6 84.9 79.7 80.9 73.0 71.6 70.9 68.2
New South Wales 1,615 1,573 90.3 90.1 84.4 86.3 81.8 81.8 79.8 78.2 71.2 70.2 70.6 67.9
Victoria 1,251 1,245 88.4 86.3 85.1 86.0 81.9 83.1 76.6 76.7 71.5 68.1 71.5 66.1
Queensland 1,054 988 91.4 90.8 88.0 87.2 84.3 84.0 82.4 80.9 75.4 71.9 76.5 72.0
South Australia 347 339 91.1 89.4 88.2 86.7 85.9 83.2 81.0 79.6 76.1 70.5 73.1 65.0
Western Australia 533 507 89.7 91.5 83.9 87.6 81.6 86.0 78.6 81.1 75.0 73.0 75.3 72.4
Tasmania 113 136 90.3 94.1 92.0 91.2 92.9 90.4 91.2 87.5 88.5 83.1 88.2 82.8
Northern Territory 87 82 90.8 89.0 83.9 87.8 80.5 89.0 81.6 86.6 79.3 72.0 77.2 71.2
Australian Capital Territory 107 113 97.2 94.7 95.3 94.7 93.5 92.0 89.7 89.4 85.0 82.3 78.8 73.8
Capital city statistical division 3,194 3,095 90.6 89.3 86.2 86.8 82.9 82.8 79.9 78.7 74.1 70.4 73.2 69.7
Balance of state 1,913 1,888 89.5 90.0 85.4 87.2 83.3 85.0 80.1 80.5 73.0 72.0 73.8 68.4