22  Data Validation and Submission

22.1 Data Validation

After preparing your Participant Data and Trial Data files, please use the ManyBabies Data Validator to ensure your lab’s data is in the correct format.

Common issues: If the validator is rejecting your CSV file, it may be due to different country standards around the use of commas as decimal points, etc. in your numeric format. Please use periods (‘.’) and not commas (‘,’) as decimal points. Make sure that there are no stray marks in cells outside of your data range. For example, a space entered in a cell in an otherwise empty row or column will cause an error. If you encounter any unexpected issues please email us. Your data files MUST pass validation before submission.

22.2 Data submission

Once your data files have passed validation, please upload both the Participant Dataand Trial Data files using the MB5 Data Upload form. Please take note of the following: It is essential that both file names include your ManyBabies LabID. Refer to the LabID list here to find your lab’s unique LabID. Use the following naming convention for your Participant and Trial Data files: yourLabID_participant_data.csv (e.g., babylabPrinceton_participant_data.csv) yourLabID_trial_data.csv (e.g., babylabPrinceton_trial_data.csv)

Sharing ‘raw’ data

For most labs, the “raw” data generated by the software that runs the study will be in a different format from the one we are asking labs to submit as their machine-readable, de-identified data. This creates a problem for reproducibility of the data pipeline. In addition to potential error-checking, the raw data may be useful for secondary analysis. Secondary analyses can be used to map variability between labs. Ideally, we would ask all labs to share both their actual raw data (immediately generated when the study is run) together with any code they use to convert it to the submission format. However, doing so would likely raise concerns about the sharing of de-identified data as these raw data outputs may contain birthdates or other participant-identifying information. A related concern is that many labs may be converting the raw data into the submission formatted files by hand, which may be prone to human error. We strongly encourage labs to develop processes for making these conversions in an automated way (e.g., in an R script). In the coming months we will be soliciting suggestions and reviewing the practicality of helping labs develop these practices, as well as making a decision about whether and how to collect raw data from labs that do not use eyetrackers. We welcome comments and suggestions from contributors on this issue!

Submitting raw eye-tracking data

There are additional considerations for labs submitting eye-tracking data. First, the raw data is useful for studying variability between labs in the extraction of looking times. Second, the pupil size data provides interesting additional information about cognitive processing that can be used in follow-up or spin-off studies. We are therefore asking all laboratories submitting eye-tracking data to provide these raw data. For the data to be most useful and accessible for secondary analyses, the preferred format is to submit 1) CSV or TSV files (or any other plain text format); please select the export options so that the file remains as unchanged as possible (select all possible variables, no fixation filters, etc.) and 2) annotated R-code (or other script) that transforms these CSV or TSV files into the trial-level format that needs to be submitted. Please upload these files using the MB5 Data Upload form in addition to the Participant Data and Trial Data files (see above for details). For purposes of transparency and replicability, using a standardized format for the eye-tracking data is preferable (e.g., the Peekbank format). Please take care that the raw eye-tracking data does not contain any identifying information such as birth dates, zip codes, etc.! If you have any concerns regarding the sharing of your raw data (e.g. with respect to participant consent/ethics approval), please contact the project leaders. In the Lab Questionnaire, report which method you used to compute the looking times In your upload, please also include the (annotated!) script that was used to compute the looking times from the raw data

22.3 Video records of data collection (optional)

If you are sharing videos of your data collection (and this is strongly encouraged, if it’s at all possible given your ethics approval), you can store them in Databrary if you are a member. The naming convention for Databrary volumes is “ManyBabies5: yourLabID” (e.g. “ManyBabies5: babylabPrinceton”). We ask that you use this naming convention so people can easily search for all the ManyBabies-related volumes.

Thank you for being part of ManyBabies 5!