After data collection

Contact Martin Zettersten (martincz@princeton.edu) with questions about data submission and data preparation (please read text below carefully first!)

Data templates

The primary deliverables for the project are two data files, filled out by your lab from the templates provided below. Please download a copy of each template and use a guide for formatting your lab’s data. We prefer files in the CSV data format. PLEASE NOTE that saving in this format will remove any formulas or other non-plain-text features of your spreadsheet (e.g,. color fills, formatting); all information should be captured in the text within each cell.
Participant Data – A .csv file with one row for each participant, and with columns showing the participant’s subject number, age, demographic information, notes on the session, etc. Participant Data - CSV template Trial Data – A .csv file with one row for each trial, and with columns showing the participant’s subject number, trial number, trial type, looking time, etc. Trial Data - CSV template IMPORTANT NOTE 1: These two files must use identical, anonymous subject identifiers (‘participant_id’). We must be able to link participant- and trial-level data across files! You can use your lab’s normal participant numbering convention (e.g., 001, 002, 003, etc.), as long as participant IDs DO NOT include any private information (e.g., initials, birth date, gender). IMPORTANT NOTE 2: These files must contain de-identified data ONLY. All potentially identifying information should be stripped from your data file before submission. For example, you SHOULD NOT include birth date and test date in your Participant Data file. Instead, use an age calculating tool (e.g., https://www.calculator.net/age-calculator.html) to calculate each participant’s age in days, and report that value in the ‘participant_age_days’ variable. If you have any questions about ensuring your data is de-identified, email contact@manybabies.org. IMPORTANT NOTE 3: It’s really important to remember that these files are designed to be read by a computer program, not a person. So anything that violates the template (e.g., variables that aren’t of the specified type, formatting, comments, etc.) will not work. For example, cells in the column “lang1_exposure” in the participant data file should contain numbers. If you write “80 to 90” this will cause errors because it contains characters in addition to numbers (note: please check questionnaire responses before participants leave the lab to avoid NA responses). If you have questions, comments, or calculations, please communicate directly with the analysis team, rather than embedding them in the data. IMPORTANT NOTE 4: Please do not leave any fields blank. If something does not logically have an answer, or if you did not collect this information, please mark it as “NA”. Language. If you collect data from children who are learning more than one language, please provide an approximate percentage of exposure to each language, either by parental report, or if it is standard practice in your lab, using a day-in-the-life style questionnaire administered by the RA. The total should add up to 100%.

Data dictionary

MB5 Data Dictionary – This spreadsheet lists all of the variables that need to go into the Participant Data and Trial Data files (Note that there is one worksheet/tab for each data file). Each row contains a variable and that variable’s specified format (e.g., string, integer), set of example values, and description. It is important that your lab’s data follows these specifications exactly in order to allow for data harmonization with the full dataset. MB5 Data Dictionary

Data Validation

After preparing your Participant Data and Trial Data files, please use the MB Data Validator to ensure your lab’s data is in the correct format (MB Validator User Manual). Common issues: If the validator is rejecting your CSV file, it may be due to different country standards around the use of commas as decimal points, etc. in your numeric format. Please use periods (‘.’) and not commas (‘,’) as decimal points. Make sure that there are no stray marks in cells outside of your data range. For example, a space entered in a cell in an otherwise empty row or column will cause an error. If you encounter any unexpected issues please send an email to Martin Zettersten (martincz@princeton.edu). Your data files MUST pass validation before submission.

Data submission

Once your data files have passed validation, please upload both the Participant Dataand Trial Data files using the MB5 Data Upload form. Please take note of the following: It is essential that both file names include your ManyBabies LabID. Refer to the LabID list here to find your lab’s unique LabID. Use the following naming convention for your Participant and Trial Data files: yourLabID_participant_data.csv (e.g., babylabPrinceton_participant_data.csv) yourLabID_trial_data.csv (e.g., babylabPrinceton_trial_data.csv)

Sharing ‘raw’ data

For most labs, the “raw” data generated by the software that runs the study will be in a different format from the one we are asking labs to submit as their machine-readable, de-identified data. This creates a problem for reproducibility of the data pipeline. In addition to potential error-checking, the raw data may be useful for secondary analysis. Secondary analyses can be used to map variability between labs. Ideally, we would ask all labs to share both their actual raw data (immediately generated when the study is run) together with any code they use to convert it to the submission format. However, doing so would likely raise concerns about the sharing of de-identified data as these raw data outputs may contain birthdates or other participant-identifying information. A related concern is that many labs may be converting the raw data into the submission formatted files by hand, which may be prone to human error. We strongly encourage labs to develop processes for making these conversions in an automated way (e.g., in an R script). In the coming months we will be soliciting suggestions and reviewing the practicality of helping labs develop these practices, as well as making a decision about whether and how to collect raw data from labs that do not use eyetrackers. We welcome comments and suggestions from contributors on this issue!

Submitting raw eye-tracking data

There are additional considerations for labs submitting eye-tracking data. First, the raw data is useful for studying variability between labs in the extraction of looking times. Second, the pupil size data provides interesting additional information about cognitive processing that can be used in follow-up or spin-off studies. We are therefore asking all laboratories submitting eye-tracking data to provide these raw data. For the data to be most useful and accessible for secondary analyses, the preferred format is to submit 1) CSV or TSV files (or any other plain text format); please select the export options so that the file remains as unchanged as possible (select all possible variables, no fixation filters, etc.) and 2) annotated R-code (or other script) that transforms these CSV or TSV files into the trial-level format that needs to be submitted. Please upload these files using the MB5 Data Upload form in addition to the Participant Data and Trial Data files (see above for details). For purposes of transparency and replicability, using a standardized format for the eye-tracking data is preferable (e.g., the Peekbank format). Please take care that the raw eye-tracking data does not contain any identifying information such as birth dates, zip codes, etc.! If you have any concerns regarding the sharing of your raw data (e.g. with respect to participant consent/ethics approval), please contact the project leaders. In the Lab Questionnaire, report which method you used to compute the looking times In your upload, please also include the (annotated!) script that was used to compute the looking times from the raw data

Video records of data collection (optional)

If you are sharing videos of your data collection (and this is strongly encouraged, if it’s at all possible given your ethics approval), you can store them in Databrary if you are a member. The naming convention for Databrary volumes is “ManyBabies5: yourLabID” (e.g. “ManyBabies5: babylabPrinceton”). We ask that you use this naming convention so people can easily search for all the ManyBabies-related volumes.

Thank you for being part of ManyBabies 5!