Skip to content

Data streams

Valentin Kuznetsov edited this page Oct 4, 2017 · 1 revision

Data-streams on HDFS

CMSSpark framework is capable of dealing with different data-streams presented on HDFS. Here is full list of them:

  • DBS global (CVS): /project/awg/cms/CMS_DBS3_PROD_GLOBAL
  • DBS phys01 (CVS): /project/awg/cms/CMS_DBS3_PROD_PHYS01
  • DBS phys02 (CVS): /project/awg/cms/CMS_DBS3_PROD_PHYS02
  • DBS phys03 (CVS): /project/awg/cms/CMS_DBS3_PROD_PHYS03
  • PhEDEx (CSV): /project/awg/cms/phedex/block-replicas-snapshots
  • PhEDEx catalog (CSV): /project/awg/cms/phedex/catalog
  • AAA (JSON): /project/monitoring/archive/xrootd/raw/gled
  • EOS (JSON): /project/monitoring/archive/eos/logs/reports/cms
  • CMSSW (Avro): /project/awg/cms/cmssw-popularity
  • JobMonitoring/CRAB (Avro): /project/awg/cms/jm-data-popularity
  • JobMonitoring (Avro): /project/awg/cms/job-monitoring
  • WMArchive (Avro): /cms/wmarchive/avro

As you can see different formats are in use by different data sources. The CSV and JSON are trivial since they can be read without additional sofware. Someone needs to download a data from HDFS, e.g.

hadoop fs -get 
hadoop fs -get /project/awg/cms/phedex/block-replicas-snapshots/csv/time=2017-10-04_03h10m23s/part-m-00000 ./data.csv
head -1 data.csv > record.csv

While Avro is a dedicated data-format used by Apache Hadoop for data serialization and wire format for communication between Hadoop nodes. In order to read it you may follow the followin recipe:

hadoop fs -get hdfs:///project/awg/cms/cmssw-popularity/avro-snappy/year=2017/month=10/day=2/part-m-00000.avro ./cmssw.avro
# you may obtain avro-tools jar file from local Hadoop distribution
java -jar /data/users/vk/wma/avro/avro-tools-1.7.7.jar tojson cmssw.avro > cmssw.json
head -1 cmssw.json > record.json
Clone this wiki locally