CAT Data FAQ

You've just received an email that your data is ready to download. This page explains how to download CAT sequencing data, what files you will receive, how long data are retained, and what to do if you need help.

How Do I Download My Data?

When your data are ready, you will receive a “Data Ready” email with the exact data path and credentials. You can download using SFTP or rsync. Data are typically available for about 3 months on fast SSD storage, then moved to slower storage for roughly another 3 months. Please download and back up your data as soon as possible.

How Do I Use SFTP (GUI)?

We recommend Cyberduck (Windows/Mac) or FileZilla. Server: fastq.ucsf.edu, Port: 22. Username and password are included in the Data Ready email.

Short video walkthrough coming here.

How Do I Use SFTP (Command Line)?

sftp <username>@fastq.ucsf.edu ​ # enter password

sftp usefull commands:

  • ls # list remote directory
  • cd DIR # change remote directory
  • pwd # show remote working directory
  • lls # list local directory
  • lcd DIR # change local directory
  • lpwd # show local working directory
  • get filename # download one file
  • get -r directory # download a whole directory
  • exit # quit SFTP

How Do I Use rsync? (Recommended for Large Transfers)

rsync -a --append-verify --partial --info=progress2 --no-o --no-g --no-perms --compress-level=0 \
    <username>@fastq.ucsf.edu:<PATH_TO_DATA> <LOCAL_DESTINATION>
  

To list files without copying:

rsync -av --list-only <username>@fastq.ucsf.edu:<PATH_TO_DATA> <LOCAL_DESTINATION>
  

Note: “Could not chdir to home directory …” can be ignored.

Data Integrity (Checksums)

Your data folder will include a checksum file (for example fastq_checksums.md5 or checksums.md5). After download, verify integrity with:

md5sum -c fastq_checksums.md5
  

If you are on macOS and don’t have md5sum, install coreutils or use an equivalent tool to compare hashes.

Data Retention and Regeneration

Data is initially available from fast storage on /volume1/SSD/illumina/run_id after approximately 3 weeks data is moved to slower storage /volume2/tortoise/run_id OR /volume3/slow/run_id and then deleted after approximately three months. Raw run data is retained for approximately 6 months. You can request regeneration through iLabs for a $100 fee (per project). We recommend backing up in at least two locations immediately after download.

We only guarantee data availability for approximately 6 months total.

Missing Samples (Barcodes)

If expected samples are missing from your results, the most common cause is an incorrect samplesheet (wrong barcodes or orientation), or low/absent representation in the library. Before contacting CAT, review Reports/Top_Unknown_Barcodes.csv to see barcode pairs present in significant quantity that may be missing from the results. Also review the SampleSheet.csv found in the bcl-convert outputs. The pipeline attempts to identify and add these automatically, but review is still recommended.

How Sample Names and Barcodes Are Edited

Sample IDs are sanitized for compatibility: Unicode is converted to ASCII, punctuation removed, whitespace converted to underscores, and only letters, numbers, underscores, and dashes are allowed. If two Sample IDs collide after sanitization, a short hash is appended (_hXXXXXX) to make them unique. Barcodes are normalized to uppercase A/C/G/T only, N characters are interpreted as UMI and are outputed, and invalid entries are rejected. Some workflows may omit i5 or enforce fixed index length.

Barcode Collisions and Automatic Corrections

Some barcodes are too similar (or contain Ns) and may be filtered to avoid misassignment. The CAT pipeline uses the run's top emited barcodes to infer dominant unexpected barcodes, attempts reverse‑complement corrections when needed, and may write a corrected samplesheet. If corrections are unsafe, it skips changes and reports a warning.

bcl2fastq vs. bcl-convert

CAT now uses Illumina’s bcl-convert instead of bcl2fastq. The output is still FASTQ, but report/log layout is updated (e.g., Reports/ and Logs/ folders), sample sheet handling is stricter, and some file naming details may differ. If your pipeline depends on the older layout, contact CAT for guidance.

What Is in My Project Folder?

Illumina projects typically include FASTQ files, Reports/, Logs/, and a SampleSheet.csv.

PacBio projects typically include BAM files (and indexes), run metadata, and summary/report files.

Common Errors and Quick Fixes

  • Permission denied or login failures: confirm you are using the correct username/password from the Data Ready email.
  • Slow or interrupted transfers: use rsync with --partial and --append-verify to resume.
  • Partial files: re‑run rsync or verify with checksums.
  • Missing expected samples: review Top_Unknown_Barcodes.csv and your submitted barcodes.

Contact Us

If you need assistance from the CAT Core please email either:

Glossary

  • FASTQ: Text format containing sequencing reads and quality scores (usually compressed as .fastq.gz).
  • BAM: Binary alignment format commonly used for PacBio data; includes read sequence and metadata.
  • Demultiplexing: Assigning reads to samples using barcode/index sequences.
  • Lane: Physical lane on a flow cell; a run can have multiple lanes.
  • Index (Barcode): Short sequence used to identify samples pooled in a run.

Referencing the CAT Core and Publications

 

When submitting manuscripts, please acknowledge the CAT by including the text: “Sequencing was performed at the UCSF CAT, supported by UCSF PBBR, RRP IMIA, and NIH 1S10OD028511-01 grants.” 

 

To aid our facility in future grant submissions, it helps to reference labs and publications that have benefited from our core please notify us of your achievement by emailing catcore [at] ucsf.edu the PMID number, lab, and the piece of equipment used in the paper.