jsulz HF staff commited on
Commit
f20b5b5
·
1 Parent(s): 6c1172f

tweaking some words

Browse files
Files changed (2) hide show
  1. README.md +1 -1
  2. app.py +24 -11
README.md CHANGED
@@ -11,7 +11,7 @@ license: mit
11
  short_description: An analysis of LFS files on the Hub.
12
  ---
13
 
14
- Running this locally is easiest with Poetry installed.
15
 
16
  Clone the repository and run:
17
 
 
11
  short_description: An analysis of LFS files on the Hub.
12
  ---
13
 
14
+ Running this locally is easiest with [Poetry installed](https://python-poetry.org/).
15
 
16
  Clone the repository and run:
17
 
app.py CHANGED
@@ -4,8 +4,16 @@ from plotly import graph_objects as go
4
  import plotly.io as pio
5
  import plotly.express as px
6
 
 
 
 
 
 
 
 
7
  # Set the default theme to "plotly_dark"
8
- pio.templates.default = "plotly_dark"
 
9
 
10
 
11
  def process_dataset():
@@ -222,7 +230,7 @@ def plot_total_sum(by_type_arr):
222
 
223
  # Update layout
224
  fig.update_layout(
225
- title="Top 20 File Extensions by Total Size",
226
  xaxis_title="File Extension",
227
  yaxis_title="Total Size (PBs)",
228
  yaxis=dict(tickformat=".2f"), # Format y-axis labels to 2 decimal places
@@ -292,7 +300,7 @@ def area_plot_by_extension_month(_df):
292
  fig = px.area(_df, x="date", y="total_size", color="extension")
293
  # Update layout
294
  fig.update_layout(
295
- title="File Extension Cumulative Growth (in PBs) Over Time",
296
  xaxis_title="Date",
297
  yaxis_title="Size (PBs)",
298
  legend_title="Type",
@@ -353,7 +361,7 @@ with gr.Blocks() as demo:
353
  # Add top level heading and introduction text
354
  gr.Markdown("# Git LFS Usage Across the Hub")
355
  gr.Markdown(
356
- "The Hugging Face Hub has just crossed 1,000,000 models - but where is all that data stored? The short answer is Git LFS. This analysis dives into the LFS storage on the Hub, breaking down the data by repository type, file extension, and growth over time."
357
  )
358
 
359
  gr.Markdown(
@@ -362,15 +370,20 @@ with gr.Blocks() as demo:
362
  gr.HTML(div_px(25))
363
  # Cumulative growth analysis
364
  gr.Markdown("## Repository Growth")
365
- with gr.Row():
366
- gr.Plot(fig)
 
 
367
 
368
  gr.HTML(div_px(5))
369
  # @TODO Talk to Allison about variant="panel"
370
  with gr.Row():
371
  with gr.Column(scale=1):
372
  gr.Markdown(
373
- "This table shows the total number of files, cumulative size of those files across all repositories on the Hub, and the potential file-level dedupe savings. To put this in context, the last [Common Crawl](https://commoncrawl.org/) download was [451 TBs](https://github.com/commoncrawl/cc-crawl-statistics/blob/master/stats/crawler/CC-MAIN-2024-38.json#L31). The Spaces repositories alone outpaces that! Meanwhile, between Datasets and Model repos, the Hub stores **64 Common Crawls** 🤯. Current estimates put total deduplication savings at approximately 3.24 PBs (7.2 Common Crawls)!"
 
 
 
374
  )
375
  with gr.Column(scale=3):
376
  # Convert the total size to petabytes and format to two decimal places
@@ -384,14 +397,14 @@ with gr.Blocks() as demo:
384
  with gr.Row():
385
  with gr.Column(scale=1):
386
  gr.Markdown(
387
- "The cumulative growth of models, spaces, and datasets over time can be seen in the adjacent chart. Beside that is a view of the total change, from the previous month to the current one, of LFS files stored on the hub over 2024. We're averaging nearly **2.3 PBs uploaded to LFS per month!**"
388
  )
389
 
390
  gr.Markdown(
391
- "By the same token, the monthly file deduplication savings are nearly 225TBs. Borrowing from the [Common Crawl](https://commoncrawl.org/) analogy, that's about half a crawl saved each month!"
392
  )
393
  with gr.Column(scale=3):
394
- gr.Dataframe(last_10_months, height=250)
395
 
396
  gr.HTML(div_px(25))
397
  # File Extension analysis
@@ -445,7 +458,7 @@ with gr.Blocks() as demo:
445
 
446
  gr.HTML(div_px(5))
447
  gr.Markdown(
448
- "To dig a little deeper, the following dropdown allows you to filter the area chart by file extension."
449
  )
450
 
451
  # build a dropdown using the unique values in the extension column
 
4
  import plotly.io as pio
5
  import plotly.express as px
6
 
7
+ # @TODO: Add a custom template to the plotly figure
8
+ """
9
+ pio.templates["custom"] = go.layout.Template()
10
+ pio.templates["custom"].layout = dict(
11
+ plot_bgcolor="#bde5ec", paper_bgcolor="#bbd5da"
12
+ )
13
+
14
  # Set the default theme to "plotly_dark"
15
+ pio.templates.default = "custom"
16
+ """
17
 
18
 
19
  def process_dataset():
 
230
 
231
  # Update layout
232
  fig.update_layout(
233
+ title="Top 20 File Extensions by Total Size (in PBs)",
234
  xaxis_title="File Extension",
235
  yaxis_title="Total Size (PBs)",
236
  yaxis=dict(tickformat=".2f"), # Format y-axis labels to 2 decimal places
 
300
  fig = px.area(_df, x="date", y="total_size", color="extension")
301
  # Update layout
302
  fig.update_layout(
303
+ title="File Extension Monthly Additions (in PBs) Over Time",
304
  xaxis_title="Date",
305
  yaxis_title="Size (PBs)",
306
  legend_title="Type",
 
361
  # Add top level heading and introduction text
362
  gr.Markdown("# Git LFS Usage Across the Hub")
363
  gr.Markdown(
364
+ "The Hugging Face Hub has just crossed 1,000,000 models - but where is all that data stored? Most of it is stored in Git LFS. This analysis dives into the LFS storage on the Hub, breaking down the data by repository type, file extension, and growth over time. The data is based on a snapshot of the Hub's LFS storage, starting in March 2022 and ending September 20th, 2024 (meaning the data is incomplete for September 2024). Right now, this is a one-time analysis, but as we do our work we hope to revisit and update the underlying data to provide more insights."
365
  )
366
 
367
  gr.Markdown(
 
370
  gr.HTML(div_px(25))
371
  # Cumulative growth analysis
372
  gr.Markdown("## Repository Growth")
373
+ gr.Markdown(
374
+ "The plot below shows the growth of Git LFS storage on the Hub over the past two years. The solid lines represent the cumulative growth of models, spaces, and datasets, while the dashed lines represent the growth with file-level deduplication."
375
+ )
376
+ gr.Plot(fig)
377
 
378
  gr.HTML(div_px(5))
379
  # @TODO Talk to Allison about variant="panel"
380
  with gr.Row():
381
  with gr.Column(scale=1):
382
  gr.Markdown(
383
+ "In this table, we can see what the final picture looks like as of September 20th, 2024, along with the potential file-level deduplication savings."
384
+ )
385
+ gr.Markdown(
386
+ "To put this in context, the last [Common Crawl](https://commoncrawl.org/) download was [451 TBs](https://github.com/commoncrawl/cc-crawl-statistics/blob/master/stats/crawler/CC-MAIN-2024-38.json#L31). The Spaces repositories alone outpaces that! Meanwhile, between Datasets and Model repos, the Hub stores **64 Common Crawls** 🤯. Current estimates put file deduplication savings at approximately 3.24 PBs (7.2 Common Crawls)!"
387
  )
388
  with gr.Column(scale=3):
389
  # Convert the total size to petabytes and format to two decimal places
 
397
  with gr.Row():
398
  with gr.Column(scale=1):
399
  gr.Markdown(
400
+ "The month-to-month growth of models, spaces, can be seen in the adjacent table. In 2024, the Hub has averaged nearly **2.3 PBs uploaded to LFS per month!** By the same token, the monthly file deduplication savings are nearly 225TBs. "
401
  )
402
 
403
  gr.Markdown(
404
+ "Borrowing from the Common Crawl analogy, that's about *5 crawls* uploaded every month, with an _easy savings of half a crawl every month_ by deduplicating at the file-level!"
405
  )
406
  with gr.Column(scale=3):
407
+ gr.Dataframe(last_10_months)
408
 
409
  gr.HTML(div_px(25))
410
  # File Extension analysis
 
458
 
459
  gr.HTML(div_px(5))
460
  gr.Markdown(
461
+ "To dig a little deeper, the following dropdown allows you to filter the area chart by file extension. Because we're dealing with individual file extensions, the data is presented in terabytes (TBs)."
462
  )
463
 
464
  # build a dropdown using the unique values in the extension column