Updates to the sergeant (Apache Drill connector) Package & a look at Apache Drill 1.14.0 release
This article is originally published at https://rud.is/b
Apache Drill 1.14.0 was recently released, bringing with it many new features and a temporary incompatibility with the current rev of the MapR ODBC drivers. The Drill community expects new ODBC drivers to arrive shortly. The sergeant
is an alternative to ODBC for R users as it provides a dplyr
interface to the REST API along with a JDBC interface and functions to work directly with the REST API in a more programmatic fashion.
First-class dplyr-citizen support for the sergeant JDBC interface
I’ve been primarily using the ODBC interface for a while, now, since it’s dead simple to use it with dplyr
(as has been noted in my still-unfinished, short cookbook on wiring up Drill and R). The ODBC incompatibility is pretty severe since it’s at the protobuf-level, but sergeant::src_drill()
is an easy swap out and does not have any issues since it works against the REST API. Unfortunately, the query endpoint of the REST API mangles the field order when it returns query results. This really isn’t too painful since it’s easy to add in a select()
call after gathering query results to reorder things. However, it’s painful enough that it facilitated rounding out some of the corners to the JDBC interface.
sergeant::drill_jdbc()
now returns a <DrillJDBCConnection>
object which was necessary to add dplyr
classes for just enough bits to enable smooth operation with the tbl()
function (without breaking all your other RJDBC usage in the same session). The next blog section will use the new JDBC interface with dplyr
as it introduces one of Drill’s new features.
Query Image Metadata with Apache Drill 1.14.0
There are quite a few R packages for reading image/media metadata. Since that seems to be en vogue, R folks might be interested in Drill’s new image metadata format plugin. Just point drill to a directory of files and you can use a familiar dplyr
interface to get the deets on your pirated torrent archivefamily photo inventory.
You first need to follow the directions at the aforelinked resource and add the following format to the formats:
section.
formats: {
"image": {
"type": "image",
"extensions": [
"jpg", "jpeg", "jpe", "tif", "tiff", "dng", "psd", "png", "bmp", "gif",
"ico", "pcx", "wav", "wave", "avi", "webp", "mov", "mp4", "m4a", "m4p",
"m4b", "m4r", "m4v", "3gp", "3g2", "eps", "epsf", "epsi", "ai", "arw",
"crw", "cr2", "nef", "orf", "raf", "rw2", "rwl", "srw", "x3f"
],
"fileSystemMetadata": true,
"descriptive": true,
"timeZone": null
}
}
Note that the configuration snippet on Drill’s site (as of the date-stamp on this post) did not have a ,
after the ]
for the extensions
array, so copy this one instead.
I created a media
workspace and set the defaultInputFormat
to image
. Here’s a naive first look at what you can get back from a simple query to a jpg
directory under it (using the new JDBC interface and dplyr
):
library(sergeant)
library(tidyverse)
(con < - drill_jdbc("bigd:2181"))
##
tbl(con, "dfs.media.`/jpg/*`") %>%
glimpse()
## Observations: ??
## Variables: 28
## $ FileSize "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth "4032", "4032", "4032", "4032", "4032", "4032"...
## $ PixelHeight "3024", "3024", "3024", "3024", "3024", "3024"...
## $ BitsPerPixel "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth "72", "72", "72", "72", "72", "72", "72", "72"...
## $ DPIHeight "72", "72", "72", "72", "72", "72", "72", "72"...
## $ Orientaion "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha "false", "false", "false", "false", "false", "...
## $ Duration "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioCodec "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioSampleRate "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ JPEG "{\"CompressionType\":\"Baseline\",\"DataPreci...
## $ JFIF "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no...
## $ ExifIFD0 "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":...
## $ AppleMakernote "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x...
## $ GPS "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4...
## $ XMP "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat...
## $ Photoshop "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3...
## $ IPTC "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio...
## $ Huffman "{\"NumberOfTables\":\"4 Huffman tables\"}", "...
## $ FileType "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...
That’s quite a bit of metadata, but the Drill format plugin page kinda fibs a bit about column types since we see many chr
s there. You may be quick to question the sergeant
package but this isn’t using the REST interface and we can use DBI
calls to ask Drill what’s it’s sending us:
dbSendQuery(con, "SELECT * FROM dfs.media.`/jpg/*`") %>%
dbColumnInfo()
## field.name field.type data.type name
## 1 FileSize CHARACTER VARYING character FileSize
## 2 FileDateTime CHARACTER VARYING character FileDateTime
## 3 Format CHARACTER VARYING character Format
## 4 PixelWidth CHARACTER VARYING character PixelWidth
## 5 PixelHeight CHARACTER VARYING character PixelHeight
## 6 BitsPerPixel CHARACTER VARYING character BitsPerPixel
## 7 DPIWidth CHARACTER VARYING character DPIWidth
## 8 DPIHeight CHARACTER VARYING character DPIHeight
## 9 Orientaion CHARACTER VARYING character Orientaion
## 10 ColorMode CHARACTER VARYING character ColorMode
## 11 HasAlpha CHARACTER VARYING character HasAlpha
## 12 Duration CHARACTER VARYING character Duration
## 13 VideoCodec CHARACTER VARYING character VideoCodec
## 14 FrameRate CHARACTER VARYING character FrameRate
## 15 AudioCodec CHARACTER VARYING character AudioCodec
## 16 AudioSampleSize CHARACTER VARYING character AudioSampleSize
## 17 AudioSampleRate CHARACTER VARYING character AudioSampleRate
## 18 JPEG MAP character JPEG
## 19 JFIF MAP character JFIF
## 20 ExifIFD0 MAP character ExifIFD0
## 21 ExifSubIFD MAP character ExifSubIFD
## 22 AppleMakernote MAP character AppleMakernote
## 23 GPS MAP character GPS
## 24 XMP MAP character XMP
## 25 Photoshop MAP character Photoshop
## 26 IPTC MAP character IPTC
## 27 Huffman MAP character Huffman
## 28 FileType MAP character FileType
We can still work with the results, but there’s also a pretty key element missing: the media filename. The reason it’s not in the listing is that filename
is an implicit column that we have to ask for. So, we need to modify our query to be something like this:
tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>%
glimpse()
## Observations: ??
## Variables: 29
## $ fname "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"...
## $ FileSize "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth "4032", "4032", "4032", "4032", "4032", "4032"...
## $ PixelHeight "3024", "3024", "3024", "3024", "3024", "3024"...
## $ BitsPerPixel "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth "72", "72", "72", "72", "72", "72", "72", "72"...
## $ DPIHeight "72", "72", "72", "72", "72", "72", "72", "72"...
## $ Orientaion "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha "false", "false", "false", "false", "false", "...
## $ Duration "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioCodec "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioSampleRate "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ JPEG "{\"CompressionType\":\"Baseline\",\"DataPreci...
## $ JFIF "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no...
## $ ExifIFD0 "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":...
## $ AppleMakernote "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x...
## $ GPS "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4...
## $ XMP "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat...
## $ Photoshop "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3...
## $ IPTC "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio...
## $ Huffman "{\"NumberOfTables\":\"4 Huffman tables\"}", "...
## $ FileType "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...
We could work with the “map” columns with Drill’s SQL, but this is just metadata and even if there are many files, most R folks have sufficient system memory these days to collect it all and work with it locally. There’s nothing stopping you from working on the SQL side, though, and it may be a better choice if you’ll be using this to process huge archives. But, we’ll do this in R and convert a bunch of field types along the way:
from_map < - function(x) { map(x, jsonlite::fromJSON)}
tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>%
collect() %>%
mutate_at(
.vars = vars(
JPEG, JFIF, ExifSubIFD, AppleMakernote, GPS, XMP, Photoshop, IPTC, Huffman, FileType
),
.funs=funs(from_map)
) %>%
mutate_at(
.vars = vars(
PixelWidth, PixelHeight, DPIWidth, DPIHeight, FrameRate, AudioSampleSize, AudioSampleRate
),
.funs=funs(as.numeric)
) %>%
glimpse() -> media_df
## Observations: 11
## Variables: 29
## $ fname "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"...
## $ FileSize "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth 4032, 4032, 4032, 4032, 4032, 4032, 3024, 4032...
## $ PixelHeight 3024, 3024, 3024, 3024, 3024, 3024, 4032, 3024...
## $ BitsPerPixel "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72
## $ DPIHeight 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72
## $ Orientaion "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha "false", "false", "false", "false", "false", "...
## $ Duration "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ AudioCodec "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ AudioSampleRate 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ JPEG [["Baseline", "8 bits", "3024 pixels", "4032 ...
## $ JFIF
[["1.1", "none", "72 dots", "72 dots", "0", "...
## $ ExifIFD0 "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD [["1/2227 sec", "f/1.8", "Program normal", "2...
## $ AppleMakernote
[["5", "[558 values]", "[104 values]", "1", "...
## $ GPS
[["N", "44° 19' 6.3\"", "W", "-68° 11' 22.39\...
## $ XMP
[["4", ["2017-06-22T14:28:04"], ["2017-06-22T...
## $ Photoshop
[["48 89 11 77 33 105 192 33 170 252 63 34 43...
## $ IPTC
[["UTF-8", "2", "14:28:04", "2017:06:22", "20...
## $ Huffman
[["4 Huffman tables"], ["4 Huffman tables"], ...
## $ FileType
[["JPEG", "Joint Photographic Experts Group",...
Now, we can do anything with the data, including getting the average file size:
mutate(media_df, FileSize = str_replace(FileSize, " bytes", "") %>% as.numeric()) %>%
summarise(mean(FileSize))
## # A tibble: 1 x 1
## `mean(FileSize)`
##
## 1 3878963.
FIN
The enhancements to the JDBC interface have only been given a light workout but seem to be doing well so-far. Kick the tyres and file an issue if you have any problems. ODBC users should not have to wait long for new drivers and src_drill()
aficionados can keep chugging along as before without issue.
For those new to Apache Drill, there’s now an official Docker image for it so you can get up and running without adding too much cruft to your local systems. I may add support for spinning up and managing a Drill container to the sergeant
package, so keep your eyes on pushes to the repo.
Also keep an eye on the mini-cookbook as I’ll be modifying to to account for the new package changes and introduce additional, new Drill features.
Thanks for visiting r-craft.org
This article is originally published at https://rud.is/b
Please visit source website for post related comments.