Announcing pins for Python
This article is originally published at https://www.rstudio.com/blog/
We’re excited to announce the release of pins for Python!
pins removes the hassle of managing data across projects, colleagues, and teams by
providing a central place for people to store, version and retrieve data.
If you’ve ever chased a CSV through a series of email exchanges, or had to decide between
data-final.csv
and data-final-final.csv
, then pins is for you.
pins stores data on a board, which can be a local folder, or on RStudio Connect or a cloud provider like Amazon S3. Each individual object (such as a dataframe, model, or another pickle-able Python object), together with some metadata, is called a pin.
The Python pins library works with its R counterpart, so that teams working across R and Python have a unified strategy for sharing data. This work emerged as part of RStudio’s investment in Python open source, in order to support bilingual data science teams.
Getting Started
The first step to using pins is installing it from PyPI.
python -m pip install pins
In the examples below, I’ll walk through the basics of pins using a temporary directory
for a board, with board_temp()
. This gets deleted after you close Python, so it is
not ideal for collaboration! You can use other boards, like board_rsconnect()
, board_folder()
, and board_s3()
, in more realistic settings.
import pins
from pins.data import mtcars
board = pins.board_temp()
You can “pin” (save) data to a board with the .pin_write()
method. It requires three
arguments: an object, a name, and a pin type:
board.pin_write(mtcars.head(), "mtcars", type="csv")
#> Meta(title='mtcars: a pinned 5 x 11 DataFrame', description=None, created='20220601T175057Z', pin_hash='120a54f7e0818041', file='mtcars.csv', file_size=249, type='csv', api_version=1, version=Version(created=datetime.datetime(2022, 6, 1, 17, 50, 57, 80318), hash='120a54f7e0818041'), name='mtcars', user={})
#>
#> Writing to pin 'mtcars'
Above, we saved the data as a CSV, but depending on
what you’re saving and who else you want to read it, you might use the
type
argument to instead save it as a feather
, parquet
, or joblib
file.
You can later retrieve the pinned data with .pin_read()
:
board.pin_read("mtcars")
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
You can search for data using .pin_search()
and .pin_list()
.
# prints out a list of all pins
# board.pin_list()
# searches for pins containing "cars"
board.pin_search("cars")
#> name type ... file_size meta
#> 0 mtcars csv ... 249 Meta(title='mtcars: a pinned 5 x 11 DataFrame'...
#>
#> [1 rows x 6 columns]
Two more pieces of important functionality exist:
.pin_write()
won’t delete existing data, but versions your data..pin_read()
caches your data, so subsequent reads are much faster.
See getting started in the pins documentation for more information.
Interoperability with R pins
Pins stored with Python can be read with R, and vice-versa.
For example, here is R code that reads the mtcars
pin we wrote to the board above.
Note that TEMP_PATH
refers to the temporary directory we created in this blog post for our Python board.
library(pins)
board <- board_folder(TEMP_PATH)
board %>% pin_read("mtcars")
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
This is especially useful when colleagues prefer one language over the other. For real collaborative work like this, you would use a board like board_rsconnect()
or board_s3()
.
Going further
The real power of pins comes when you share a board with multiple people.
To get started, you can use board_folder()
with a directory on a shared
drive or in DropBox, or if you use
RStudio Connect you can use
board_rsconnect()
:
board = pins.board_rsconnect()
board.pin_write(tidy_sales_data, "michael/sales-summary", type="csv")
Then, someone else (or an automated report) can read and use your pin:
board = pins.board_rsconnect()
board.pin_read("michael/sales-summary")
The pins package also includes boards that allow you to share data on
services like Amazon’s S3 (board_s3()
), with plans to support other backends such as Google Cloud Storage and Azure’s blob storage.
Get in touch
We are so happy about releasing pins for Python, and we want to make sure it supports your workflow. Join our discussion on RStudio Community to let us know what you’re working on, and how pins could help!
Thanks for visiting r-craft.org
This article is originally published at https://www.rstudio.com/blog/
Please visit source website for post related comments.