-
Notifications
You must be signed in to change notification settings - Fork 21
Data streams
Valentin Kuznetsov edited this page Oct 4, 2017
·
1 revision
CMSSpark framework is capable of dealing with different data-streams presented on HDFS. Here is full list of them:
- DBS global (CVS): /project/awg/cms/CMS_DBS3_PROD_GLOBAL
- DBS phys01 (CVS): /project/awg/cms/CMS_DBS3_PROD_PHYS01
- DBS phys02 (CVS): /project/awg/cms/CMS_DBS3_PROD_PHYS02
- DBS phys03 (CVS): /project/awg/cms/CMS_DBS3_PROD_PHYS03
- PhEDEx (CSV): /project/awg/cms/phedex/block-replicas-snapshots
- PhEDEx catalog (CSV): /project/awg/cms/phedex/catalog
- AAA (JSON): /project/monitoring/archive/xrootd/raw/gled
- EOS (JSON): /project/monitoring/archive/eos/logs/reports/cms
- CMSSW (Avro): /project/awg/cms/cmssw-popularity
- JobMonitoring/CRAB (Avro): /project/awg/cms/jm-data-popularity
- JobMonitoring (Avro): /project/awg/cms/job-monitoring
- WMArchive (Avro): /cms/wmarchive/avro
As you can see different formats are in use by different data sources. The CSV and JSON are trivial since they can be read without additional sofware. Someone needs to download a data from HDFS, e.g.
hadoop fs -get
hadoop fs -get /project/awg/cms/phedex/block-replicas-snapshots/csv/time=2017-10-04_03h10m23s/part-m-00000 ./data.csv
head -1 data.csv > record.csv
While Avro is a dedicated data-format used by Apache Hadoop for data serialization and wire format for communication between Hadoop nodes. In order to read it you may follow the followin recipe:
hadoop fs -get hdfs:///project/awg/cms/cmssw-popularity/avro-snappy/year=2017/month=10/day=2/part-m-00000.avro ./cmssw.avro
# you may obtain avro-tools jar file from local Hadoop distribution
java -jar /data/users/vk/wma/avro/avro-tools-1.7.7.jar tojson cmssw.avro > cmssw.json
head -1 cmssw.json > record.json